How Bag of Words (BOW) Works in NLP

January 18, 2021 Bhavishya Pandit

Data Science, Natural Language Processing

In this article, we are going to learn about the most popular concept, bag of words (BOW) in NLP, which helps in converting the text data into meaningful numerical data.

After converting the text data to numerical data, we can build machine learning or natural language processing models to get key insights from the text data.

Before that, Let’s take a step back and understand why NLP and NLU (Natural language understanding) are challenging compared to other machine learning or deep learning models.

According to a report, there are over 7110 languages across the globe. Meanwhile, just 23 languages account for more than half the world’s population.

People communicate in the form of texts, which is a combination of words. The text content that is being produced is so large that it becomes a necessity to get insights from the combination of words. Generally, we call it reviews.

Learn the smart way to convert the text to numerical data with bag of words technique in nlp. #naturallanguageprocessing #nlp #bow #datascience

Click to Tweet

Natural Language Processing, which is abbreviated as NLP, helps us in understanding or rather helps in getting the key insights from the raw text.

It is unique because it processes unstructured data, which is highly rich in information and can be used for different purposes.

I hope you're excited to learn about the BOW and NLP. Before we dive further, let’s have a look at the topics you are going to learn.

What is (Natural Language Processing) NLP?

Natural Language Processing (NLP) stands at the fascinating intersection of Machine Learning, artificial intelligence, Deep Learning and linguistics. It is dedicated to enabling computers to understand, interpret, and respond to human language in a valuable and meaningful way.

NLP combines computational linguistics—rule-based modeling of human language—with statistical, machine learning, and deep learning models. At its core, NLP strives to bridge the gap between human communication and computer understanding, transforming the way we interact with technology.

It encompasses a range of techniques and tools that allow computers to process, analyze, and generate human language in both written and spoken forms.

From simple tasks like spell checks and keyword searches to complex operations like sentiment analysis, language translation, and speech recognition, NLP paves the way for more intuitive and seamless human-computer interactions.

It empowers applications such as chatbots, virtual assistants, translation services, and helps in extracting insights and information from large volumes of text data. The advancements in NLP are continually reshaping our digital experiences, making technology more accessible, efficient, and attuned to our natural way of communicating.

As we progress, NLP stands as a testament to the incredible potential of blending human language with artificial intelligence.

Let’s see some of the key applications for natural language processing.

Applications of NLP

Following are the applications.

Survey/Sentiment Analysis

When writing a survey, sentiments are focused based on the feedback of the customer/user. It helps the organization to understand the feedback of customers/users.

This sentiment analysis approach saves not only a lot of resources but also time. It can be implemented with a lot of NLP models like Bag of Words, TF-IDF, or even Neural Networks.

However, promising results are achieved using BERT and Transformers.

Language translators like Google Translate

The concept of RNN is used here. Here LSTMs work really well. Let us say you want to convert a sentence from English to French.

So the moment you change a word where gender is associated, the pronoun along with it would change automatically. This is because of NLP based RNNs that are being used for this change.

Autocorrect and Autocomplete text recommendations

The words that you have been using the most are recorded along with the order in which they are used.

And when the system feels the order is being repeated, it starts suggesting and auto-completing the sentences. This not only records the words but also gives a great user experience.

One real-life example, these days, Gmail is giving suggestions based on the words you write in the mail.

Fake News Analyser

These kinds of applications help in categorising the text.

These kinds of models are trained by giving details like text and the label, whether it is fake or not.

A model is then trained on this data, which helps in classifying whether the news/email is fake or not. It is widely used, like in Twitter and Google news section.

One real-life example is email spam classification.

For any natural language processing model, the word corpus is a key thing. Let’s discuss that a bit more.

What Is Corpus?

NLP deals with the data in the form of texts. So the text-based dataset that we extract or we get is unstructured. We call it a corpus, which can be understood as a collection of words.

The plural of the corpus is corpora, which also is its Latin derivation, which means “body.”When the corpus is labeled and structured properly, we call it labeled corpus.

Text Preprocessing Techniques

Below are the basic Text preprocessing techniques, which need to perform on the raw text before building any NLP model.

Tokenization
Stop words
Stemming
Lemmatization

Tokenization

It is a fundamental concept which deals with breaking texts or the corpus into phrases/sentences or words. And then stores them in a list.

For Example

Let our corpus be:

“This is a detailed article on Bag of Words with NLP. It is a beginner-friendly article!”

The sentence tokenize for the above text would be:

Sentence 1: This is a detailed article on Bag of Words with NLP
Sentence 2: It is a beginner-friendly article!

Stop Words

Dealing with text makes us understand that the complexity of processing it all out is proportional to the number of words we have. So bring the complexity to the bare minimum. We simply cannot remove words every now and then.

However, we can surely discard the redundant ones which don’t really add meaning to the corpus. A list of such words is stored in stop words, which can be understood as a list of words that are supposed to be avoided.

Stop words has a list of more than 30 languages. So for English, it automatically removes the words from text like “a,” ”an”,”the”,”to”,”for,” etc.

Stemming

As discussed earlier, we know that the lesser the words simpler the model would be for NLP based tasks.

In this, it is important for us to understand and find out how we can cut short the words so that the words with almost the same meaning are not repeated in the vocabulary.

Vocabulary here means the list of different words. For this, we use stemming, which cuts shorts the words to their root word.

For example

History and Historical are identical and have almost the same meaning. So when the root word for both the words would be “histori.”

Let us take another example.

The root word of goes and going would be “go.”

Removing the suffix from the word to get the root word is called suffix stripping.

There are different stemmers provided by NLTK libraries in python like

Porter Stemmer,
Snowball Stemmer,
Lancaster Stemmer, etc.

While implementing things in this article, we will be trying Porter Stemmer for the time being.

Lemmatization

Lemmatization is another text preprocessing concept which is more likely related to stemming. In the examples, we have seen the texts that are being converted to their root words have no meaning.

Lemmatization helps you keep those words intact, if not stripping. Lemmatization is considered better in terms of preprocessing but consumes more time than stemming.

We are having various other text preprocessing techniques to apply to the text. You can refer to the below article to learn 20+ popular text preprocessing techniques along with the implementations.

20+ text preprocessing techniques implementation in python

Word Embeddings

We have understood the need to reduce the word vector as much as possible, and we do it by text preprocessing techniques like stemming, stop words, etc.

After this step, we have the step of word embedding. It can be understood as converting a word vector to numerics for better understanding by the system.

There are different word embedding techniques:

Binary Encoding
TF-IDF
Word2vec
Latent Semantic Analysis encoding

We suggest you read the below article to learn about the popular word embedding techniques along with implementation.

Word embedding techniques implementation in python

All of these are useful embedding techniques, but in this article, we will be focusing more on binary encoding, which is used in Bag of Words. It basically marks the word vector 1 if it is present in the sentence else 0.

A detailed explanation of the same is given in the section below.

Understanding Bag of Words

As the name suggests, the concept is to create a bag of words from the clutter of words, which is also called as the corpus.

It is the simplest form of representing words in the form of numbers. We convert the words to digits because the system needs the information in the form of numbers, or else it won’t be able to process the data.

We convert the words to numbers by analyzing the presence of the word in a particular sentence.

A number is denoted as an encoded value against the word. This is the number of times that word has been represented in the sentence.

If only the presence is to be considered, then the game is denoted in form 1’s and 0’s. When the word is present in the sentence, it is denoted as 1 else 0. This is called a binary bag of words.

Let us understand the Bag of words better with an example.

Document 1: I read the SVM algorithm article in dataaspirant blog
Document 2: I read the randomforest algorithm article in dataaspirant blog

After the text preprocessing step, we will end up with the below sentences.

Document 1: read SVM algorithm article dataaspirant blog
Document 2: read randomforest algorithm article dataaspirant blog

Here we will be making a vocabulary that will consist of all the words used in the above two sentences.

A bag of words is a place where it keeps records of the occurrence/presence of the word in that specific sentence. It is demonstrated below.

This is precisely how we convert words to numbers.

When should you use a bag of words?

Bag of words acts as a baseline model and thus can be treated as a basic model to test the results and know more about the data that is being fed to the model. After that, one can proceed further with deep learning approaches.

One can also use a bag of words when the data is context specific. It means when the data is of some review of a particular kind like that of IMDB movie reviews, Yelp reviews, etc.

With all this, a bag of words also turns out to play well when the data is small and domain specific.

Drawbacks of Bag of Words

Bag of words particularly doesn’t record the arrangement of the words in the sentence nor records how the word is associated with other sentences.

The bag of words depends a lot more on the vocabulary of the text. In this, with an increase in the number of sentences, the vocabulary would increase exponentially, which will make the model complex and both computation and resource-wise.

Implementing Bag of Words Technique

This section is the roadmap and more like the contents section to the upcoming section where we would be implementing whatever we have learned in this article.

We would be taking a basic classification dataset, would preprocess the text by using the concepts of stop words, stemming, etc.

After that, we would be using a bag of words model on the resultant corpus to help us give the data in a numerical format for the system to understand and process.

Lastly, we would be fitting the data with a classification algorithm that would predict the outcomes of the testing inputs.

Solving a Real Time NLP Problem Using Bag of Words

In this article, we will scratch the topic by solving a fundamental yet intriguing problem of spam-ham classification. You can download the dataset from here.

Required Packages Installation

Implementation

Conclusion

This is a detailed article for beginners that is enriched with the concepts that are used in NLP. It talks about NLP, its applications, and what a corpus means in NLP. To show how NLP is linked with AI, a venn diagram can be found here too.

Learned about text preprocessing concepts like

tokenization,
stop words,
stemming,
lemmatization.

Then we had discussed the concept behind the bag of words, what is vocabulary.

We also saw the cases where using a bag of words would be the best bet and, along with that, discussed its drawbacks too.

Lastly, we summed up all the learnings by mentioning the installation process of the packages that we will be using in this use case and then implementing it with python based code on the problem of spam classification whose dataset link has been mentioned along with the code.

I hope you liked it!

Frequently Asked Questions (FAQs) On Bag Of Words

1. What is Bag of Words (BoW) in Natural Language Processing (NLP)?

Bag of Words (BoW) is a simple and widely used model in NLP for text analysis. It represents text data as a 'bag' (or collection) of words without considering grammar and word order but maintaining frequency.

2. How Does BoW Model Text Data?

In BoW, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.

3. What are the Key Steps in Creating a BoW Model?

The main steps include tokenization (splitting text into words or tokens), building a vocabulary of words, and counting the occurrences of words in each document.

4. Why is BoW Popular in Text Analysis?

BoW is straightforward to understand and implement. It's used in various applications like document classification, spam filtering, and sentiment analysis due to its simplicity and efficiency.

5. How is the Dimensionality of the Vocabulary Managed in BoW?

Techniques like removing stop words, applying thresholds for maximum and minimum word frequency, and using n-grams can help manage dimensionality.

6. Can BoW Handle Context and Semantic Meaning?

The traditional BoW model doesn't capture context or semantic meaning. The order of words is lost, making it hard to understand the tone, context, or syntax of the sentence.

7. What is the Difference Between BoW and TF-IDF?

While BoW only counts word frequencies, TF-IDF (Term Frequency-Inverse Document Frequency) weighs these frequencies, emphasizing words that are unique to specific documents.

8. Is BoW Effective for All Languages?

BoW can be used for most languages, but its effectiveness can vary. For languages with rich morphology, compound words, or homonyms, it might not capture the nuances effectively.

9. How is BoW Implemented in Machine Learning Models?

In practice, BoW is often implemented using libraries like scikit-learn in Python. It transforms text data into a numerical format (a vector) that machine learning models can process.

10. What are the Limitations of BoW?

BoW struggles with capturing semantic relationships, it can lead to high dimensionality, and it doesn’t handle negations or varying sentence structures well.

11. Can BoW be Used with Deep Learning Models?

While BoW can be used as an input feature for deep learning models, more advanced techniques like word embeddings are generally preferred for capturing context in deep learning.

12. Is BoW Still Relevant with the Advent of Word Embeddings?

Despite the rise of more complex models like word embeddings, BoW remains relevant for many basic text classification tasks due to its simplicity and efficiency.

Recommended Natural Language Processing Courses

Recommended

NLP Course

Rating: 4.6/5

Learn Now

Deep Learning Course

Rating: 4.5/5

Learn Now

Machine Learning Coure

Rating: 4.2/5

Learn Now

2 Responses to “How Bag of Words (BOW) Works in NLP”

Mitesh Sharma
3 years ago
Reply

Really a wonderful article on BOW in NLP. It gives a great knowledgeable treat.
Nice job Saimadhu
- Saimadhu Polamuri
  3 years ago
  Reply
  
  Thanks! Mithesh Sharma 🙂

Dataaspirant

How Bag of Words (BOW) Works in NLP

What is (Natural Language Processing) NLP?

Applications of NLP

Survey/Sentiment Analysis

Language translators like Google Translate

Autocorrect and Autocomplete text recommendations

Fake News Analyser

What Is Corpus?

Text Preprocessing Techniques

Tokenization

Stop Words

Stemming

Lemmatization

Word Embeddings

Understanding Bag of Words

When should you use a bag of words?

Drawbacks of Bag of Words

Implementing Bag of Words Technique

Solving a Real Time NLP Problem Using Bag of Words

Required Packages Installation

Implementation

Conclusion

Frequently Asked Questions (FAQs) On Bag Of Words

1. What is Bag of Words (BoW) in Natural Language Processing (NLP)?

2. How Does BoW Model Text Data?

3. What are the Key Steps in Creating a BoW Model?

4. Why is BoW Popular in Text Analysis?

5. How is the Dimensionality of the Vocabulary Managed in BoW?

6. Can BoW Handle Context and Semantic Meaning?

7. What is the Difference Between BoW and TF-IDF?

8. Is BoW Effective for All Languages?

9. How is BoW Implemented in Machine Learning Models?

10. What are the Limitations of BoW?

11. Can BoW be Used with Deep Learning Models?

12. Is BoW Still Relevant with the Advent of Word Embeddings?

Recommended Natural Language Processing Courses

NLP Course

Deep Learning Course

Machine Learning Coure

Follow us:

FACEBOOK| QUORA |TWITTER| GOOGLE+ | LINKEDIN| REDDIT | FLIPBOARD | MEDIUM | GITHUB

2 Responses to “How Bag of Words (BOW) Works in NLP”

Leave a Reply Cancel reply

Awarded top 75 data science blog

Data Science Dojo

Udacity

Recent Posts

Build Your Career In AI With Andrew ng Deep learning courses

Categories

Quick Links

Recent Posts

Categories