Most Popular Word Embedding Techniques In NLP

Popular Word Embedding Techniques

Most Popular Word Embedding Techniques

To build any model in machine learning or deep learning, the final level data has to be in numerical form, because models don’t understand text or image data directly like humans do.

So how natural language processing (NLP) models learn patterns from text data 🤔?

We need smart ways to convert the text data into numerical data, which is called vectorization or in the NLP world, it is called word embeddings. 

Vectorization or word embedding is nothing but the process of converting text data to numerical vectors. Later the numerical vectors are used to build various machine learning models. In a way, we say this as extracting features from text to build multiple natural language processing models.

We have numerous ways to convert the text data to numerical vectors. In this article, we will see details about different word embedding techniques with examples, and also we will learn how to implement them in python.

Most popular word embedding techniques in natural language processing

Click to Tweet

Before we dive further, let’s quickly see what you will learn in this blog post.

Natural Language Processing(NLP)

Natural Language Processing, in short, called NLP, is a subfield of data science. With the increase in capturing text data, we need the best methods to extract meaningful information from text. For this, we are having a separate subfield in data science and called Natural Language Processing. Using these natural language processing techniques we build text-related applications or to automate tasks.

In technical words, Natural Language Processing is the process of training machines to understand and generate results like humans using our natural languages. Based on these 2 tasks, NLP is further classified as

  • Natural Language Understanding (NLU)
  • Natural Language Generation (NLG)

To get some motivation to work on natural language processing projects, let’s look at a few applications that belong to NLP.

Natural Language Processing (NLP) Applications

Below are some of the popular applications of nlp. 

By now, we clearly understood the need for word embedding, now let’s look at the popular word embedding techniques.

Word embedding techniques

Below are the popular and simple word embedding methods to extract features from text are

  • Bag of words
  • TF-IDF
  • Word2vec
  • Glove embedding
  • Fastext
  • ELMO (Embeddings for Language models)

But in this article, we will learn only the popular word embedding techniques, such as a bag of words, TF-IDF, Word2vec. The other advanced methods for converting text to numerical vector representation will explain in the upcoming articles.

Bag of words

bag of words

The bag of words method is simple to understand and easy to implement. This method is mostly used in language modeling and text classification tasks. The concept behind this method is straightforward. In this method, we will represent sentences into vectors with the frequency of words that are occurring in those sentences. 

Confusing?

Okay, We will explain step by step the process of how the bag of word approach works.

Bag of words approach

In this approach we perform two operations.

  1. Tokenization
  2. Vectors Creation

Tokenization

The process of dividing each sentence into words or smaller parts. Here each word or symbol is called a token. After tokenization we will take unique words from the corpus. Here corpus means the tokens we have from all the documents we are considering for the bag of words creation.

Create vectors for each sentence

Here the size of the vector is equal to the number of unique words of the corpus. For each sentence we will fill each position of a vector with corresponding word frequency in a particular sentence.

Let's understand this with an example

  1. This pasta is very tasty and affordable.
  2. This pasta is not tasty and is affordable.
  3. This pasta is very very delicious.

These 3 sentences are example sentences, our first step is to perform tokenization. Before tokenization we have to convert all sentences to lowercase letters or uppercase letters for normalization, we will convert all the words in the sentences to lowercase.

Output of sentences after converting to lowercase
  • this pasta is very tasty and affordable.
  • this pasta is not tasty and is affordable.
  • this pasta is very very delicious.

Now we will perform tokenization

Dividing sentences into words and creating a list with all unique words and also in  alphabetical order. 

We will get the below output after the tokenization step. 

[“and”, “affordable.”, “delicious.”,  “is”, “not”, “pasta”, “tasty”, “this”, “very”]

Now what is our next step?

Creating vectors for each sentence with frequency of words. This is called a sparse matrix. Below is the sparse matrix of example sentences.

bag of words representation

 s.no

and

affordable

delicious

is

not

pasta

tasty

this

very

1

1

1

0

1

0

1

1

1

1

2

1

1

0

2

1

1

1

1

0

3

0

0

1

1

0

1

0

1

2

We can see in the above figure, every sentence converting into vectors. We can also find sentence similarities after converting sentences to vectors.

How can we find similarities ? Just calculating distance between any two vectors of  sentences by using any distance measure method for example Euclidean Distance 

In the above example we are just taking each word as a feature, another name for this is 1-gram representence, we can also take bigram words , tri-Gram words etc . 

Examples for Bi-Gram word representation  of the first sentence as below.

  • this, pasta
  • pasta, is
  • is, very
  • very, tasty
  • tasty, and
  • and affordable

Like this we can take more tri-gram words and n-gram words etc, here n is the number of words to split. But we can not get any semantic meaning or relation between words from the bag of words technique.

In Bag of word representation we have more zeros in the sparse matrices. The size of the matrix  will be increased based on the total number of words in the corpus. In real world applications corpus will contain thousands of words. So we need more resources to build analytics models with  this type of technique for large datasets. This drawback will be overcome in the next word embedding techniques. Now let’s learn how to implement the bag of words technique in python with Sklearn

Implementation of Bag of words with python sklearn

Implementation of bag of words with sklearn

Output

bag of words output

TF-IDF

TF - IDF

Another popular word embedding technique for extracting features from corpus or vocabulary is TF-IDF. This is a statistical method to find how important a word is to a document all over other documents.

Let me explain more details about this technique like what are TF and IDF full forms ? and also what is important and  what is the process of this technique ? etc.

TF

The full form of TF is Term Frequency (TF). In TF , we are giving some scoring for each word or token based on the frequency of that word. The frequency of a word is dependent on the length of the document. Means in large size of document a word occurs more than a small or medium size of the documents. 

So to overcome this problem we will divide the frequency of a word with the length of the document (total number of words) to normalize.By using this technique also, we are creating a sparse matrix with frequency of every word.

Formula to calculate Term Frequency (TF)

TF = no. of times term occurrences in a document / total number of words in a document

IDF

The full form of IDF is Inverse Document Frequency. Here also we are assigning  a score value  to a word , this scoring value explains how a word is rare across all documents. Rarer words have more IDF score.

Formula to calculate Inverse Document Frequency (IDF) :-

IDF = log base e (total number of documents / number of documents which are having term )

Formula to calculate complete TF-IDF value is 

TF - IDF  = TF * IDF 

TF-IDF value will be increased based on frequency of the word in a document. Like Bag of Words in this technique also we can not get any semantic meaning for words.

But this technique is mostly used for document classification and also successfully used by search engines like Google, as a ranking factor for content. 

Okay with the theory part for TF-IDF is completed now we will see how this happens with example and then we will learn the implementation in python.

Example sentences :-

  • A: This pasta is very tasty and affordable.
  • B: This pasta is not tasty and is affordable.
  • C: This pasta is very very delicious.

Let's consider each sentence as a document. Here also our first task is tokenization (dividing sentences into words or tokens) and then taking unique words.

tf-idf calculation

From the above table we can observe rarer words have more score than common words.That shows us the significance of the words in our corpus.

Implementation of TF-IDF by using Sklearn

Implementation of TF-IDF using Sklearn

Output

tf idf output

Word2vec

word2vect

Image reference : https://devopedia.org

The Word2Vec model is used for learning vector representations of words called “word embeddings”. Did you observe that we didn’t get any semantic meaning from words of corpus by using previous methods? But for most of the applications of NLP tasks like sentiment classification, sarcasm detection etc require semantic meaning of a word and semantic relationships of a word with other words.

So can we get semantic meaning from words ?

Yeah exactly you got the answer , the answer is by using word2vec technique  we will get what we want.

Word embeddings have a capability of capturing semantic and syntactic relationships between words and also the context of words in a document. Word2vec is the technique to implement word embeddings.

Every word in a sentence is dependent on another word or other words.If you want to find similarities and relations between words ,we have to capture word dependencies.

By using Bag-of-words and TF-IDF techniques we can not capture the meaning or relation of the words from vectors. Word2vec constructs such vectors called embeddings.

Word2vec model takes input as a large size of corpus and produces output to vector space. This vector space size may be in hundred of dimensionality. Each word vector will be placed on this vector space.

In vector space whatever words share context commonly in a corpus that are closer to each other. Word vector having positions of corresponding words in a vector space.

The Word2vec method learns all those types of relationships of words while building a model. For this purpose word2vec uses 2 types of methods. There are

  1. Skip-gram
  2. CBOW (Continuous Bag of Words)

Image reference : https://community.alteryx.com

Here one more thing we have to discuss that is window size. Did you remember the Bag-Of-words technique we discussed about 1-gram or uni-gram, bigram ,trigram ….n-gram representation of text ?

This method also follows the same technique. But here it is called window size.

The Word2vec model will capture relationships of words with the help of window size by using skip-gram and CBow methods.

What is the difference between these 2 methods ? Do you want to know ?

That is a really simple technique. Before going to discuss these techniques , we have to know one more thing , why are we taking windows in this technique?  Just to know the center word and context of the center word. (I have to add few words here like we can not use whole sentence)

Skip-Gram

In this method , take the center word from the window size words as an input and context words (neighbour words) as outputs. Word2vec models predict the context words of a center word using skip-gram method. Skip-gram works well with a small dataset and identifies rare words really well.

The architecture of Skip gram

Image reference : researchgate.net 

Continuous Bag-of-words

CBow is just a reverse method of the skip gram method. Here we are taking context words as input and predicting the center word within the window. Another difference from skip gram method is, It was working faster and better representations for most frequency words.

continuous bag of words

Image reference : researchgate.net 

Difference between Skip gram & CBow

Skip gram:

  • In this input is centre word and output is context words (neighbour words).
  • Works well with small datasets.
  • Skip-gram identifies rarer words better.

CBow:

  • In this context or neighbor words are input and output is the center word.
  • Works good with large datasets.
  • Better representation for frequent words than rarer.

Word2vec implementation

Let’s jump into the implementation part. here we will see 

  1. How to build  word2vec model with these two methods
  2. Usage of Word embedding Pre-trained models
    1. Google word2vec
    2. Stanford glove Embeddings

Building our word2vec model with custom text

Word2vec with gensim

For this i am taking just a sample text file and will build a word2vec model by using the gensim python library.

Require libraries

  1. Gensim (pip install --upgrade gensim)
  2. NLTK (pip install nltk)
  3. Regex (pip install re)
importing required libraries for word2vec_model

We will get output like this 

output
reading text file
output

Now i am removing punctuations from all sentences. Because we can not get that much information from punctuations.But not all applications.

For this sample example we don’t need any punctuations , numbers, all these things so i will remove them with a regex pattern.

removing punctuations from sentences

output

Now we have to apply tokenization to all sentences.

apply word tokenization on sentences

Output

output

We can give these tokenized sentences to word2vec as input to the word2vec model.

Building word2vec with CBOW method

Building word2vec with CBOW method

Output

Total number of words 
79
array([-0.20608747,  0.05975117], dtype=float32)

Word2vec model building is done.

So let’s see how it looks like by using matplotlib for visualization.

visualize CBOW word2vec model
output

We can see in the above figure , node , tree, random, words are close to each other and also the distance between movie and algorithm. Maybe we can’t observe more words like this because of dataset size , if we use large dataset then we can  observe more clearly.

Building word2vec skip-gram method

Building word2vec skip gram method
output

Let’s see the visualization 

visualization of skip gram word2vec model
output

Same as CBOW visualization graph here also same thing happens,  node , tree, random, words are close to each other and also the distance between movie and algorithm.

Word embedding model using Pre-trained models

word2vect

If our  dataset size is small, then we can get too many words, and if we can't provide more sentences, the model will not learn more from our dataset. Otherwise if we want to build a word2vec model with a large corpus then it will require more resources like time,memory etc.

So how can we build a better word embedding model ? don’t worry , we can utilize already trained models. Here we are using 2 most popular pre-trained word embedding models. We  don't explain about these pre-trained models in detail, but tell how to use them. 

Google word2vec

We can download google word2vec pretrained model from  link.This is the compressed file so you have to extract that file before using it in the script.

load google word2vec pretrained model file

We will see how word embeddings capture the relation between words with example of 

King - man = ? - woman

apply pretrained word2vec on an example

Output

output

Stanford Glove Embeddings

Full form Glove is Global Vectors for Word Representation.

We can download this pretrained model from this link.This file also compressed one we have to extract , after extracting you can see different files. Glove embedding model provides different dimensions  of models like below

glove file_extracting

For this we have to do some pre-requested task.we have to convert the glove word embedding file to word2vec using glove2word2vec() function. From those file , i am taking 100 dimensions file glove.6B.100d.txt  

load glove pretrained model and Apply on an example
output

Conclusion

We can use any one of the text feature extraction based on our project requirement. Because every method has their advantages  like a Bag-Of-Words suitable for text classification, TF-IDF is for document classification and if you want semantic relation between words then go with word2vec.

We can’t say blindly what type of feature extraction gives better results. One more thing is building word embeddings from our dataset or corpus will give better results. But we don’t always have enough size of data set so in that case we can use pre-trained models with transfer learning.

We didn’t explain transfer learning concept in this article, surely we will explain how to apply transfer learning technique to train pre-trained word embeddings with our corpus in the future articles.

Recommended NLP courses

Recommended
nlp specialization


NLP Specialization with Python

Rating: 4/5

Natural-language-processing-classifiation-vector-spaces


NLP Classification and Vector spaces

Rating: 4.6/5

natural-language-processing-python


NLP Model Building With Python

Rating: 4.4/5

Follow us:

FACEBOOKQUORA |TWITTERGOOGLE+ | LINKEDINREDDIT FLIPBOARD | MEDIUM | GITHUB

I hope you like this post. If you have any questions ? or want me to write an article on a specific topic? then feel free to comment below.

Leave a Reply

Your email address will not be published. Required fields are marked *

>