Most Popular Word Embedding Techniques In NLP

August 18, 2020 Sharmila Polamuri

Machine Learning, Natural Language Processing

In the realm of Natural Language Processing (NLP), the ability to understand and represent text data is crucial. Word embedding techniques have become a powerful tool for capturing the meaning and context of words within large text corpora.

The word embedding techniques transform words into dense vector representations that can be effectively utilized by machine learning algorithms.

Let's understand this with asking ourself few questions.

To build any model in machine learning or deep learning, the final level data has to be in numerical form, because models don’t understand text or image data directly like humans do.

So how natural language processing (NLP) models learn patterns from text data ?

We need smart ways to convert the text data into numerical data, which is called vectorization or in the NLP world, it is called word embeddings.

Vectorization or word embedding is nothing but the process of converting text data to numerical vectors. Later the numerical vectors are used to build various machine learning models. In a way, we say this as extracting features from text to build multiple natural language processing models.

In this comprehensive guide, we will explore various word embedding techniques, their applications, and how to implement them using Python.

Whether you're new to NLP or looking to expand your knowledge, this article is designed to provide you with a solid understanding of word embedding techniques and their significance in the field of NLP.

Most popular word embedding techniques in natural language processing

Click to Tweet

Before we dive further, let’s quickly see what you will learn in this blog post.

Introduction To Natural Language Processing(NLP)

Natural Language Processing, in short, called NLP, is a subfield of data science. With the increase in capturing text data, we need the best methods to extract meaningful information from text. For this, we are having a separate subfield in data science and called Natural Language Processing. Using these natural language processing techniques we build text-related applications or to automate tasks.

In technical words, Natural Language Processing is the process of training machines to understand and generate results like humans using our natural languages. Based on these 2 tasks, NLP is further classified as

Natural Language Understanding (NLU)
Natural Language Generation (NLG)

To get some motivation to work on natural language processing projects, let’s look at a few applications that belong to NLP.

Natural Language Processing (NLP) Applications

Below are some of the popular applications of nlp.

Email spam detection
Sentiment analysis
Document classification
Chatbot etc.

By now, we clearly understood the need for word embedding, now let’s look at the popular word embedding techniques.

Popular Word Embedding Techniques

On a high level word embedding techniques were classified into the below categories.

Traditional Word Embedding Techniques
Advanced Word Embedding Techniques
Contextualized Word Embedding Techniques

Traditional Word Embedding Techniques

One-hot encoding:

One-hot encoding is a simple technique for representing words as binary vectors. Each word in the vocabulary is assigned a unique position in the vector, with a value of 1 at that position and 0 elsewhere. While this method is straightforward, it suffers from high dimensionality and fails to capture any semantic relationships between words.

TF-IDF (Term Frequency-Inverse Document Frequency):

TF-IDF is a popular technique for representing text data in a numerical format. It combines the frequency of a word in a document (Term Frequency) with the inverse of its frequency across all documents (Inverse Document Frequency). This method weighs the importance of words, but it does not capture the context or semantic relationships between words.

Bag of words:

The bag of words approach represents documents as a vector of word frequencies. Although this method can capture the importance of words within a document, it ignores the order of words and fails to encode any semantic information.

Advanced Word Embedding Techniques

Word2Vec:

Word2Vec is a popular neural network-based word embedding technique that captures semantic relationships between words. It comes in two flavors:

Continuous Bag of Words (CBOW)
Skip-Gram.

CBOW predicts the target word based on its context, while Skip-Gram predicts the context words given the target word. Word2Vec efficiently learns high-quality vector representations, capturing semantic and syntactic information.

GloVe (Global Vectors for Word Representation):

GloVe is another neural network-based word embedding method that combines the advantages of both matrix factorization and local context window methods. GloVe constructs a word-word co-occurrence matrix and learns the embeddings by minimizing the difference between the dot product of word vectors and the logarithm of their co-occurrence count.

FastText:

FastText is an extension of the Word2Vec model, which represents words as the sum of their subword (n-gram) vectors. This technique allows FastText to capture morphological information and generate embeddings for out-of-vocabulary words.

Contextualized Word Embedding Techniques

ELMo (Embeddings from Language Models):

ELMo is a deep contextualized word embedding technique that generates embeddings based on the entire context of a sentence. It uses a bidirectional LSTM (Long Short-Term Memory) model to create a contextualized representation for each word in the sentence, capturing both semantic and syntactic information.

BERT (Bidirectional Encoder Representations from Transformers):

BERT is a powerful pre-trained language model that generates context-aware word embeddings. It leverages the Transformer architecture and a masked language modeling objective to learn deep bidirectional representations, making it highly effective for various NLP tasks.

Now Let's understand about these techniques a bit depth.

Bag of words

The bag of words method is simple to understand and easy to implement. This method is mostly used in language modeling and text classification tasks. The concept behind this method is straightforward. In this method, we will represent sentences into vectors with the frequency of words that are occurring in those sentences.

Confusing?

Okay, We will explain step by step the process of how the bag of word approach works.

Bag of words approach

In this approach we perform two operations.

Tokenization
Vectors Creation

Tokenization

The process of dividing each sentence into words or smaller parts. Here each word or symbol is called a token. After tokenization we will take unique words from the corpus. Here corpus means the tokens we have from all the documents we are considering for the bag of words creation.

Create vectors for each sentence

Here the size of the vector is equal to the number of unique words of the corpus. For each sentence we will fill each position of a vector with corresponding word frequency in a particular sentence.

Let's understand this with an example

This pasta is very tasty and affordable.
This pasta is not tasty and is affordable.
This pasta is very very delicious.

These 3 sentences are example sentences, our first step is to perform tokenization. Before tokenization we have to convert all sentences to lowercase letters or uppercase letters for normalization, we will convert all the words in the sentences to lowercase.

Output of sentences after converting to lowercase

this pasta is very tasty and affordable.
this pasta is not tasty and is affordable.
this pasta is very very delicious.

Now we will perform tokenization.

Dividing sentences into words and creating a list with all unique words and also in alphabetical order.

We will get the below output after the tokenization step.

[“and”, “affordable.”, “delicious.”, “is”, “not”, “pasta”, “tasty”, “this”, “very”]

Now what is our next step?

Creating vectors for each sentence with frequency of words. This is called a sparse matrix. Below is the sparse matrix of example sentences.

s.no	and	affordable	delicious	is	not	pasta	tasty	this	very
1	1	1	0	1	0	1	1	1	1
2	1	1	0	2	1	1	1	1	0
3	0	0	1	1	0	1	0	1	2

We can see in the above figure, every sentence converting into vectors. We can also find sentence similarities after converting sentences to vectors.

How can we find similarities ? Just calculating distance between any two vectors of sentences by using any distance measure method for example Euclidean Distance

In the above example we are just taking each word as a feature, another name for this is 1-gram representence, we can also take bigram words , tri-Gram words etc .

Examples for Bi-Gram word representation of the first sentence as below.

this, pasta
pasta, is
is, very
very, tasty
tasty, and
and affordable

Like this we can take more tri-gram words and n-gram words etc, here n is the number of words to split. But we can not get any semantic meaning or relation between words from the bag of words technique.

In Bag of word representation we have more zeros in the sparse matrices. The size of the matrix will be increased based on the total number of words in the corpus. In real world applications corpus will contain thousands of words.

So we need more resources to build analytics models with this type of technique for large datasets. This drawback will be overcome in the next word embedding techniques. Now let’s learn how to implement the bag of words technique in python with Sklearn

Implementation of Bag of words with python sklearn

Implementation of bag of words with sklearn

Output

TF-IDF

Another popular word embedding technique for extracting features from corpus or vocabulary is TF-IDF. This is a statistical method to find how important a word is to a document all over other documents.

Let me explain more details about this technique like what are TF and IDF full forms ? and also what is important and what is the process of this technique ? etc.

TF

The full form of TF is Term Frequency (TF). In TF , we are giving some scoring for each word or token based on the frequency of that word. The frequency of a word is dependent on the length of the document. Means in large size of document a word occurs more than a small or medium size of the documents.

So to overcome this problem we will divide the frequency of a word with the length of the document (total number of words) to normalize.By using this technique also, we are creating a sparse matrix with frequency of every word.

Formula to calculate Term Frequency (TF)

TF = no. of times term occurrences in a document / total number of words in a document

IDF

The full form of IDF is Inverse Document Frequency. Here also we are assigning a score value to a word , this scoring value explains how a word is rare across all documents. Rarer words have more IDF score.

Formula to calculate Inverse Document Frequency (IDF) :-

IDF = log base e (total number of documents / number of documents which are having term )

Formula to calculate complete TF-IDF value is

TF - IDF = TF * IDF

TF-IDF value will be increased based on frequency of the word in a document. Like Bag of Words in this technique also we can not get any semantic meaning for words.

But this technique is mostly used for document classification and also successfully used by search engines like Google, as a ranking factor for content.

Okay with the theory part for TF-IDF is completed now we will see how this happens with example and then we will learn the implementation in python.

Example sentences :-

A: This pasta is very tasty and affordable.
B: This pasta is not tasty and is affordable.
C: This pasta is very very delicious.

Let's consider each sentence as a document. Here also our first task is tokenization (dividing sentences into words or tokens) and then taking unique words.

From the above table we can observe rarer words have more score than common words.That shows us the significance of the words in our corpus.

Implementation of TF-IDF by using Sklearn

Output

Word2vec

word2vect

Image reference : https://devopedia.org

The Word2Vec model is used for learning vector representations of words called “word embeddings”. Did you observe that we didn’t get any semantic meaning from words of corpus by using previous methods?

But for most of the applications of NLP tasks like sentiment classification, sarcasm detection etc require semantic meaning of a word and semantic relationships of a word with other words.

So can we get semantic meaning from words ?

Yeah exactly you got the answer , the answer is by using word2vec technique we will get what we want.

Word embeddings have a capability of capturing semantic and syntactic relationships between words and also the context of words in a document. Word2vec is the technique to implement word embeddings.

Every word in a sentence is dependent on another word or other words.If you want to find similarities and relations between words ,we have to capture word dependencies.

By using Bag-of-words and TF-IDF techniques we can not capture the meaning or relation of the words from vectors. Word2vec constructs such vectors called embeddings.

Word2vec model takes input as a large size of corpus and produces output to vector space. This vector space size may be in hundred of dimensionality. Each word vector will be placed on this vector space.

In vector space whatever words share context commonly in a corpus that are closer to each other. Word vector having positions of corresponding words in a vector space.

The Word2vec method learns all those types of relationships of words while building a model. For this purpose word2vec uses 2 types of methods. There are

Skip-gram
CBOW (Continuous Bag of Words)

Image reference : https://community.alteryx.com

Here one more thing we have to discuss that is window size. Did you remember the Bag-Of-words technique we discussed about 1-gram or uni-gram, bigram ,trigram ….n-gram representation of text ?

This method also follows the same technique. But here it is called window size.

The Word2vec model will capture relationships of words with the help of window size by using skip-gram and CBow methods.

What is the difference between these 2 methods ? Do you want to know ?

That is a really simple technique. Before going to discuss these techniques , we have to know one more thing , why are we taking windows in this technique? Just to know the center word and context of the center word. (I have to add few words here like we can not use whole sentence)

Skip-Gram

In this method , take the center word from the window size words as an input and context words (neighbour words) as outputs. Word2vec models predict the context words of a center word using skip-gram method. Skip-gram works well with a small dataset and identifies rare words really well.

Image reference : researchgate.net

Continuous Bag-of-words

CBow is just a reverse method of the skip gram method. Here we are taking context words as input and predicting the center word within the window. Another difference from skip gram method is, It was working faster and better representations for most frequency words.

Image reference : researchgate.net

Difference between Skip gram & CBow

Skip gram:

In this input is centre word and output is context words (neighbour words).
Works well with small datasets.
Skip-gram identifies rarer words better.

CBow:

In this context or neighbor words are input and output is the center word.
Works good with large datasets.
Better representation for frequent words than rarer.

Word2vec implementation

Let’s jump into the implementation part. here we will see

How to build word2vec model with these two methods
Usage of Word embedding Pre-trained models
1. Google word2vec
2. Stanford glove Embeddings

Building our word2vec model with custom text

Word2vec with gensim

For this i am taking just a sample text file and will build a word2vec model by using the gensim python library.

Require libraries

Gensim (pip install --upgrade gensim)
NLTK (pip install nltk)
Regex (pip install re)

importing required libraries for word2vec_model

We will get output like this

Now i am removing punctuations from all sentences. Because we can not get that much information from punctuations.But not all applications.

For this sample example we don’t need any punctuations , numbers, all these things so i will remove them with a regex pattern.

removing punctuations from sentences

Now we have to apply tokenization to all sentences.

Output

We can give these tokenized sentences to word2vec as input to the word2vec model.

Building word2vec with CBOW method

Output

Total number of words

array([-0.20608747, 0.05975117], dtype=float32)

Word2vec model building is done.

So let’s see how it looks like by using matplotlib for visualization.

We can see in the above figure , node , tree, random, words are close to each other and also the distance between movie and algorithm. Maybe we can’t observe more words like this because of dataset size , if we use large dataset then we can observe more clearly.

Building word2vec skip-gram method

Let’s see the visualization

visualization of skip gram word2vec model

Same as CBOW visualization graph here also same thing happens, node , tree, random, words are close to each other and also the distance between movie and algorithm.

Word embedding model using Pre-trained models

If our dataset size is small, then we can get too many words, and if we can't provide more sentences, the model will not learn more from our dataset. Otherwise if we want to build a word2vec model with a large corpus then it will require more resources like time,memory etc.

So how can we build a better word embedding model ? don’t worry , we can utilize already trained models. Here we are using 2 most popular pre-trained word embedding models. We don't explain about these pre-trained models in detail, but tell how to use them.

Google word2vec

We can download google word2vec pretrained model from link.This is the compressed file so you have to extract that file before using it in the script.

We will see how word embeddings capture the relation between words with example of

King - man = ? - woman

Output

Stanford Glove Embeddings

Full form Glove is Global Vectors for Word Representation.

We can download this pretrained model from this link.This file also compressed one we have to extract , after extracting you can see different files. Glove embedding model provides different dimensions of models like below

For this we have to do some pre-requested task.we have to convert the glove word embedding file to word2vec using glove2word2vec() function. From those file , i am taking 100 dimensions file glove.6B.100d.txt

load glove pretrained model and Apply on an example

Conclusion

We can use any one of the text feature extraction based on our project requirement. Because every method has their advantages like a Bag-Of-Words suitable for text classification, TF-IDF is for document classification and if you want semantic relation between words then go with word2vec.

We can’t say blindly what type of feature extraction gives better results. One more thing is building word embeddings from our dataset or corpus will give better results. But we don’t always have enough size of data set so in that case we can use pre-trained models with transfer learning.

We didn’t explain transfer learning concept in this article, surely we will explain how to apply transfer learning technique to train pre-trained word embeddings with our corpus in the future articles.

Recommended NLP courses

Recommended

NLP Specialization Course

Rating: 4/5

Learn Now

Natural-language-processing-classifiation-vector-spaces

Machine Learning Course

Rating: 4.6/5

Learn Now

Deep Learning Course

Rating: 4.4/5

Learn Now

3 Responses to “Most Popular Word Embedding Techniques In NLP”

who cares
2 years ago
Reply

amazing work with this article. Thank you for this hardwork.
Lalit Aggarwal
3 years ago
Reply

Wonderful explanation. Thank you 🙂
- Saimadhu Polamuri
  3 years ago
  Reply
  
  Thanks! Lalit,
  
  We are glad that you liked the article. Keep Learning
  
  Thanks,
  Team dataaspirant 🙂

Dataaspirant

Most Popular Word Embedding Techniques In NLP

Introduction To Natural Language Processing(NLP)

Natural Language Processing (NLP) Applications

Popular Word Embedding Techniques

Traditional Word Embedding Techniques

One-hot encoding:

TF-IDF (Term Frequency-Inverse Document Frequency):

Bag of words:

Advanced Word Embedding Techniques

Word2Vec:

GloVe (Global Vectors for Word Representation):

FastText:

Contextualized Word Embedding Techniques

ELMo (Embeddings from Language Models):

BERT (Bidirectional Encoder Representations from Transformers):

Bag of words

Bag of words approach

Tokenization

Create vectors for each sentence

Output of sentences after converting to lowercase

Implementation of Bag of words with python sklearn

Output

TF-IDF

TF

IDF

Example sentences :-

Implementation of TF-IDF by using Sklearn

Output

Word2vec

Skip-Gram

Continuous Bag-of-words

Skip gram:

CBow:

Word2vec implementation

Building our word2vec model with custom text

Word2vec with gensim

Require libraries

Output

Building word2vec with CBOW method

Output

Building word2vec skip-gram method

Word embedding model using Pre-trained models

Google word2vec

Output

Stanford Glove Embeddings

Conclusion

Recommended NLP courses

NLP Specialization Course

Machine Learning Course

Deep Learning Course

Follow us:

FACEBOOK| QUORA |TWITTER| GOOGLE+ | LINKEDIN| REDDIT | FLIPBOARD | MEDIUM | GITHUB

3 Responses to “Most Popular Word Embedding Techniques In NLP”

Leave a Reply Cancel reply

Awarded top 75 data science blog

Data Science Dojo

Udacity

Recent Posts

Build Your Career In AI With Andrew ng Deep learning courses

Categories

Quick Links

Recent Posts

Categories