How Bag of Words (BOW) Works in NLP

Bag of words

How Bag of Words (BOW) Works in NLP

In this article, we are going to learn about the most popular concept, bag of words (BOW) in NLP, which helps in converting the text data into meaningful numerical data

After converting the text data to numerical data, we can build machine learning or natural language processing models to get key insights from the text data.

Before that, Let’s take a step back and understand why NLP and NLU (Natural language understanding) are challenging compared to other machine learning or deep learning models.

According to a report, there are over 7110 languages across the globe. Meanwhile, just 23 languages account for more than half the world’s population.

People communicate in the form of texts, which is a combination of words. The text content that is being produced is so large that it becomes a necessity to get insights from the combination of words. Generally, we call it reviews

Learn the smart way to convert the text to numerical data with bag of words technique in nlp. #naturallanguageprocessing #nlp #bow #datascience

Click to Tweet

Natural Language Processing, which is abbreviated as NLP, helps us in understanding or rather helps in getting the key insights from the raw text.

It is unique because it processes unstructured data, which is highly rich in information and can be used for different purposes. 

I hope you're excited to learn about the BOW and NLP. Before we dive further, let’s have a look at the topics you are going to learn.

What is (Natural Language Processing) NLP?

Artificial Intelligence Branches

NLP (Natural Language Processing ) is studied as a subset of artificial intelligence, which can also be understood from the above Venn diagram. This sub branch of artificial intelligence focuses more on getting key insights from text data.

Let’s see some of the key applications for natural language processing.

Applications of NLP

Following are the applications.

Popular NLP Applications

Survey/Sentiment Analysis

When writing a survey, sentiments are focused based on the feedback of the customer/user. It helps the organization to understand the feedback of customers/users. 

This sentiment analysis approach saves not only a lot of resources but also time. It can be implemented with a lot of NLP models like Bag of Words, TF-IDF, or even Neural Networks

However, promising results are achieved using BERT and Transformers.

Language translators like Google Translate

The concept of RNN is used here. Here LSTMs work really well. Let us say you want to convert a sentence from English to French. 

So the moment you change a word where gender is associated, the pronoun along with it would change automatically. This is because of NLP based RNNs that are being used for this change.

Autocorrect and Autocomplete text recommendations

The words that you have been using the most are recorded along with the order in which they are used. 

And when the system feels the order is being repeated, it starts suggesting and auto-completing the sentences. This not only records the words but also gives a great user experience.

One real-life example, these days, Gmail is giving suggestions based on the words you write in the mail.

Fake News Analyser

These kinds of applications help in categorising the text.

These kinds of models are trained by giving details like text and the label, whether it is fake or not. 

A model is then trained on this data, which helps in classifying whether the news/email is fake or not. It is widely used, like in Twitter and Google news section. 

One real-life example is email spam classification.

For any natural language processing model, the word corpus is a key thing. Let’s discuss that a bit more.

What Is Corpus?

NLP deals with the data in the form of texts. So the text-based dataset that we extract or we get is unstructured. We call it a corpus, which can be understood as a collection of words. 

The plural of the corpus is corpora, which also is its Latin derivation, which means “body.”When the corpus is labeled and structured properly, we call it labeled corpus.

Text Preprocessing Techniques

Below are the basic Text preprocessing techniques, which need to perform on the raw text before building any NLP model.

Text Preprocessing Techniques
  1. Tokenization
  2. Stop words
  3. Stemming
  4. Lemmatization


It is a fundamental concept which deals with breaking texts or the corpus into phrases/sentences or words. And then stores them in a list.

For Example

Let our corpus be:

“This is a detailed article on Bag of Words with NLP. It is a beginner-friendly article!”

The sentence tokenize for the above text would be:

  • Sentence 1: This is a detailed article on Bag of Words with NLP
  • Sentence 2: It is a beginner-friendly article!

Stop Words

Dealing with text makes us understand that the complexity of processing it all out is proportional to the number of words we have. So bring the complexity to the bare minimum. We simply cannot remove words every now and then. 

However, we can surely discard the redundant ones which don’t really add meaning to the corpus. A list of such words is stored in stop words, which can be understood as a list of words that are supposed to be avoided. 

Stop words has a list of more than 30 languages. So for English, it automatically removes the words from text like “a,” ”an”,”the”,”to”,”for,” etc.


As discussed earlier, we know that the lesser the words simpler the model would be for NLP based tasks.

In this, it is important for us to understand and find out how we can cut short the words so that the words with almost the same meaning are not repeated in the vocabulary. 

Vocabulary here means the list of different words. For this, we use stemming, which cuts shorts the words to their root word.

For example

History and Historical are identical and have almost the same meaning. So when the root word for both the words would be “histori.”

Let us take another example.

The root word of goes and going would be “go.”

Removing the suffix from the word to get the root word is called suffix stripping.

There are different stemmers provided by NLTK libraries in python like

  • Porter Stemmer, 
  • Snowball Stemmer,
  • Lancaster Stemmer, etc.

While implementing things in this article, we will be trying Porter Stemmer for the time being.


Lemmatization is another text preprocessing concept which is more likely related to stemming. In the examples, we have seen the texts that are being converted to their root words have no meaning.

Lemmatization helps you keep those words intact, if not stripping. Lemmatization is considered better in terms of preprocessing but consumes more time than stemming.

We are having various other text preprocessing techniques to apply to the text. You can refer to the below article to learn 20+ popular text preprocessing techniques along with the implementations.

Word Embeddings

We have understood the need to reduce the word vector as much as possible, and we do it by text preprocessing techniques like stemming, stop words, etc. 

After this step, we have the step of word embedding. It can be understood as converting a word vector to numerics for better understanding by the system. 

There are different word embedding techniques:

  1. Binary Encoding
  2. TF-IDF
  3. Word2vec
  4. Latent Semantic Analysis encoding

We suggest you read the below article to learn about the popular word embedding techniques along with implementation.

All of these are useful embedding techniques, but in this article, we will be focusing more on binary encoding, which is used in Bag of Words. It basically marks the word vector 1 if it is present in the sentence else 0.

A detailed explanation of the same is given in the section below.

Understanding Bag of Words

As the name suggests, the concept is to create a bag of words from the clutter of words, which is also called as the corpus. 

It is the simplest form of representing words in the form of numbers. We convert the words to digits because the system needs the information in the form of numbers, or else it won’t be able to process the data.

We convert the words to numbers by analyzing the presence of the word in a particular sentence. 

A number is denoted as an encoded value against the word. This is the number of times that word has been represented in the sentence. 

If only the presence is to be considered, then the game is denoted in form 1’s and 0’s. When the word is present in the sentence, it is denoted as 1 else 0. This is called a binary bag of words.

Let us understand the Bag of words better with an example.

TF-IDF Calculation Example

After the text preprocessing step, we will end up with the below sentences.

  • Document 1: read SVM algorithm article dataaspirant blog
  • Document 2: read randomforest algorithm article dataaspirant blog

Here we will be making a vocabulary that will consist of all the words used in the above two sentences. 

A bag of words is a place where it keeps records of the occurrence/presence of the word in that specific sentence. It is demonstrated below.

Bag Of Words Example

This is precisely how we convert words to numbers.

When should you use a bag of words?

Bag of words acts as a baseline model and thus can be treated as a basic model to test the results and know more about the data that is being fed to the model. After that, one can proceed further with deep learning approaches. 

One can also use a bag of words when the data is context specific. It means when the data is of some review of a particular kind like that of IMDB movie reviews, Yelp reviews, etc. 

With all this, a bag of words also turns out to play well when the data is small and domain specific. 

Drawbacks of Bag of Words

Bag of words particularly doesn’t record the arrangement of the words in the sentence nor records how the word is associated with other sentences. 

The bag of words depends a lot more on the vocabulary of the text. In this, with an increase in the number of sentences, the vocabulary would increase exponentially, which will make the model complex and both computation and resource-wise.

Implementing Bag of Words Technique

This section is the roadmap and more like the contents section to the upcoming section where we would be implementing whatever we have learned in this article. 

We would be taking a basic classification dataset, would preprocess the text by using the concepts of stop words, stemming, etc. 

After that, we would be using a bag of words model on the resultant corpus to help us give the data in a numerical format for the system to understand and process. 

Lastly, we would be fitting the data with a classification algorithm that would predict the outcomes of the testing inputs.

Solving a Real Time NLP Problem Using Bag of Words

Bag Of Words Implementation

In this article, we will scratch the topic by solving a fundamental yet intriguing  problem of spam-ham classification. You can download the dataset from here.

Required Packages Installation



This is a detailed article for beginners that is enriched with the concepts that are used in NLP. It talks about NLP, its applications, and what a corpus means in NLP. To show how NLP is linked with AI, a venn diagram can be found here too. 

Learned about text preprocessing concepts like

  • tokenization, 
  • stop words,
  • stemming,
  • lemmatization. 

Then we had discussed the concept behind the bag of words, what is vocabulary.

We also saw the cases where using a bag of words would be the best bet and, along with that, discussed its drawbacks too.

Lastly, we summed up all the learnings by mentioning the installation process of the packages that we will be using in this use case and then implementing it with python based code on the problem of spam classification whose dataset link has been mentioned along with the code.

I hope you liked it! 

Recommended Natural Language Processing Courses

nlp specialization

NLP Specialization with Python

Rating: 4.6/5


NLP Classification and vector spaces

Rating: 4.5/5

NLP Model Building With Python

Rating: 4.2/5

Follow us:


I hope you like this post. If you have any questions ? or want me to write an article on a specific topic? then feel free to comment below.


2 Responses to “How Bag of Words (BOW) Works in NLP

Leave a Reply

Your email address will not be published. Required fields are marked *