# Five most popular similarity measures implementation in python

#####  The buzz term similarity distance measure has got a wide variety of definitions among the math and data mining practitioners. As a result, those terms, concepts and their usage went way beyond the head for the beginner, Who started to understand them for the very first time. So today I write this post to give more clear and very intuitive definitions for similarity, and I will drive to Five most popular similarity measures and implementation of them.

Before going to explain different similarity distance measures let me explain the effective key term similarity in datamining. This similarity is the very basic building block for activities such as Recommendation engines, clustering, classification and anomaly detection.

## Similarity:

The similarity measure is the measure of how much alike two data objects are. Similarity measure in a data mining context is a distance with dimensions representing features of the objects. If this distance is small, it will be the high degree of similarity where large distance will be the low degree of similarity.

The similarity is subjective and is highly dependent on the domain and application. For example, two fruits are similar because of color or size or taste. Care should be taken when calculating distance across dimensions/features that are unrelated. The relative values of each element must be normalized, or one feature could end up dominating the distance calculation. Similarity are measured in the range 0 to 1 [0,1].

• Similarity = 1 if X = Y         (Where X, Y are two objects)
• Similarity = 0 if X ≠ Y

That’s all about similarity let’s drive to five most popular similarity distance measures.

## Euclidean distance: Euclidean distance is the most common use of distance. In most cases when people said about distance, they will refer to Euclidean distance. Euclidean distance is also known as simply distance. When data is dense or continuous, this is the best proximity measure.

The Euclidean distance between two points is the length of the path connecting them.The Pythagorean theorem gives this distance between two points.

Script Output:

## Manhattan distance: Manhattan distance is a metric in which the distance between two points is the sum of the absolute differences of their Cartesian coordinates. In a simple way of saying it is the total suzm of the difference between the x-coordinates  and y-coordinates.

Suppose we have two points A and B if we want to find the Manhattan distance between them, just we have, to sum up, the absolute x-axis and y – axis variation means we have to find how these two points A and B are varying in X-axis and Y- axis. In a more mathematical way of saying Manhattan distance between two points measured along axes at right angles.

In a plane with p1 at (x1, y1) and p2 at (x2, y2).

Manhattan distance = |x1 – x2| + |y1 – y2|

This Manhattan distance metric is also known as Manhattan length, rectilinear distance, L1 distance or L1 norm, city block distance, Minkowski’s L1 distance, taxi-cab metric, or city block distance.

Script Output:

## Minkowski distance: The Minkowski distance is a generalized metric form of Euclidean distance and Manhattan distance. In the equation, d^MKD is the Minkowski distance between the data record i and j, k the index of a variable, n the total number of variables y and λ the order of the Minkowski metric. Although it is defined for any λ > 0, it is rarely used for values other than 1, 2 and ∞.

The way distances are measured by the Minkowski metric of different orders between two objects with three variables ( In the image it displayed in a coordinate system with x, y ,z-axes).

Synonyms of Minkowski:
Different names for the Minkowski distance or Minkowski metric arise from the order:

• λ = 1 is the Manhattan distance. Synonyms are L1-Norm, Taxicab or City-Block distance. For two vectors of ranked ordinal variables, the Manhattan distance is sometimes called Foot-ruler distance.
• λ = 2 is the Euclidean distance. Synonyms are L2-Norm or Ruler distance. For two vectors of ranked ordinal variables, the Euclidean distance is sometimes called Spear-man distance.
• λ = ∞ is the Chebyshev distance. Synonyms are Lmax-Norm or Chessboard distance.
reference.

Script Output:

## Cosine similarity: Cosine similarity metric finds the normalized dot product of the two attributes. By determining the cosine similarity, we would effectively try to find the cosine of the angle between the two objects. The cosine of 0° is 1, and it is less than 1 for any other angle.

It is thus a judgement of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors at 90° have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude.

Cosine similarity is particularly used in positive space, where the outcome is neatly bounded in [0,1]. One of the reasons for the popularity of cosine similarity is that it is very efficient to evaluate, especially for sparse vectors.

Script Output:

## Jaccard similarity: We so far discussed some metrics to find the similarity between objects. where the objects are points or vectors .When we consider about Jaccard similarity this objects will be sets. So first let’s learn some very basic about sets.

Sets:

A set is (unordered) collection of objects {a,b,c}. we use the notation as elements separated by commas inside curly brackets { }. They are unordered so {a,b} = { b,a }.

Cardinality:

The cardinality of A denoted by |A| which counts how many elements are in A.

Intersection:

The intersection between two sets A and B is denoted A ∩ B and reveals all items which are in both sets A,B.

Union:

Union between two sets A and B is denoted A ∪ B and reveals all items which are in either set. Now going back to Jaccard similarity. The Jaccard similarity measures the similarity between finite sample sets and is defined as the cardinality of the intersection of sets divided by the cardinality of the union of the sample sets. Suppose you want to find Jaccard similarity between two sets A and B it is the ration of cardinality of A ∩ B and A ∪ B Script Output:

### Implementaion of all 5 similarity measure into one Similarity class:

file_name : similaritymeasures.py

Using Similarity class:

You can get all the complete codes of dataaspirant at dataaspirant data science codes

### Related Courses:

Do check out unlimited data science courses

 Course Course Link What You will Learn Introduction to natural language processing    Introduction to Natural Language Processing  Introduction to the field of Natural Language Processing. Relevant background material in Linguistics, Mathematics, Probabilities, and Computer Science. Text Similarity, Part of Speech Tagging, Parsing, Semantics, Question Answering, Sentiment Analysis, and Text Summarization. Applied Text mining in python    Applied Text Mining in Python  Will learn the basics of text mining and text manipulation. Handling text in python and the concepts of Nltk python framework and manipulating text with it. working on common manipulation needs, like regular expressions (searching for text), cleaning text and preparing text for  machine learning processes. Implement your own text classifier in python. Easy Natural Language Processing (NLP) in Python    Easy Natural Language Processing (NLP) in Python  Will learn about spam detection and will be going to implement spam detection application in python.  Learn about basic concepts sentiment analysis and will be going to implement your first sentiment analysis code in python. Learn the concepts of latent semantic analysis in python. Finally, you will write article spinner in python.

I hope you like this post. If you have any questions then feel free to comment below.  If you want me to write on one specific topic then do tell it to me in the comments below.

• sam says:

A few points:

1. A measure of similarity need not be symmetrical
2. A measure of similarity is not a metric space
3. Information theoretic measures, like KL and Mutual Information tend to be the most powerful, but the most difficult to manipulate mathematically.

• Mo says:

Thought you might cover Mahalanobis distance.

• sujoyrc says:

Agreed … Mahalanobis distance and Haversine distance are missing .. I do not know of any application of Minowski distance ( for lambda >2) … (except Chebyshev )

• sujoyrc says:

Agreed … Mahalanobis distance and Haversine distance are missing … I dont know of any application of Minowski distance for lambda > 2 (except Chebyshev)

• Jitesh Khandelwal says:

I don’t think there is no need to write your own implementation. All of them and a lot more are already available in scipy.spatial.distance module in python.

• saimadhu says:

Hi Jitesh Khandelwal!
You said true, but we have to explain how we can implement them.

• […] to find the similar people by comparing each person with every other person this by calculating similarity score between them. To know more about similarity you can view similarity score post. So Let’s […]

• PKD says:

Reblogged this on Random and commented:
A good blog, explaining some important similarity metrics.

• Yakov Keselman says:

Your post seems to cover just one similarity measure: Jaccard. The remaining four are distance metrics; they must be transformed to provide similarity.

I actually found Jaccard’s metric to work nicely for weighted sets as well: if an item occurs in both A and B, its weight in the intersection is the minimum of the two weights, and its weight in the union is the maximum of the two weights.

• Anonymous says:

fantastic images

• saimadhu says:

Thank you 🙂

• Misgu says:

Nice Post It is easily understood with list of x and y (two lists). so please I want to know more how to implement for large documents especially for cosine similarity in IR

• saimadhu says:

Hi Misgu,

• Anonymous says:

Nice Post.Thanks a lot.:)

• Anonymous says:

Excellent work bro. I love you.

• saimadhu says:

Thanks 🙂

• […] to find the similar people by comparing each person with every other person this by calculating the similarity score between them. To know more about similarity you can view similarity score […]

• […] “d(x, xi)” i =1, 2, ….., n; where d denotes the Euclidean distance between the […]

• […] “d(x, xi)” i =1, 2, ….., n; where d denotes the Euclidean distance between the […]

• […] “d(x, xi)” i =1, 2, ….., n; where d denotes the Euclidean distance between the […]

• […] As we discussed the principle behind KNN classifier (K-Nearest Neighbor) algorithm is to find K predefined number of training samples closest in the distance to new point & predict the label from these. The distance measure is commonly considered to be Euclidean distance. […]

• […] (K-Nearest Neighbor) algorithm is to find K predefined number of training samples that are closest in the distance to a new point & predict a label for our new point using these […]

• Chaitanya Bapat says:

Good post
However multiple grammatical errors
Would love to correct them and contribute towards the site

• Saimadhu Polamuri says:

Hi Chaitanya Bapat,