Knn Classifier, Introduction to KNearest Neighbor Algorithm
Introduction to Knearest neighbor classifier
Knearest neighbor classifier is one of the introductory supervised classifier, which every data science learner should be aware of. Fix & Hodges proposed Knearest neighbor classifier algorithm in the year of 1951 for performing pattern classification task.
For simplicity, this classifier is called as Knn Classifier. To be surprised knearest neighbor classifier mostly represented as Knn, even in many research papers too. Knn address the pattern recognition problems and also the best choices for addressing some of the classification related tasks. The simple version of the Knearest neighbor classifier algorithms is to predict the target label by finding the nearest neighbor class. The closest class will be identified using the distance measures like Euclidean distance.
Knearest neighbor classification step by step procedure
Before diving into the knearest neighbor, classification process lets’s understand the applicationoriented example where we can use the knn algorithm.
Knn classification application
Let’s assume a money lending company “XYZ” like UpStart, IndiaLends, etc. Money lending XYZ company is interested in making the money lending system comfortable & safe for lenders as well as for borrowers. The company holds a database of customer’s details. Using customer’s detailed information from the database,it will calculate a credit score(discrete value) for each customer. The calculated credit score helps the company and lenders to understand the credibility of a customer clearly. So they can simply take a decision whether they should lend money to a particular customer or not.
The customer’s details could be:
 Educational background details.
 Highest graduated degree.
 Cumulative grade points average (CGPA) or marks percentage.
 The reputation of the college.
 Consistency in his lower degrees.
 Whether taken the education loan or not.
 Cleared education loan dues.
 Employment details.
 Salary.
 Year of experience.
 Got any onsite opportunities.
 Average job change duration.
The company(XYZ) use’s these kinds of details to calculate credit score of a customer. The process of calculating the credit score from the customer’s details is expensive. To reduce the cost of predicting credit score, they realized that the customers with similar background details are getting a similar credit score. So, they decided to use already available data of customers and predict the credit score using it by comparing it with similar data. These kinds of problems are handled by the knearest neighbor classifier for finding the similar kind of customers.
Knearest neighbor (Knn) algorithm pseudocode:
Let (X_{i}, C_{i}) where i = 1, 2……., n be data points. X_{i} denotes feature values & C_{i} denotes labels for X_{i} for each i.
Assuming the number of classes as ‘c’
_{Ci} ∈ {1, 2, 3, ……, c} for all values of i
Let x be a point for which label is not known, and we would like to find the label class using knearest neighbor algorithms.
Knn Algorithm Pseudocode:
 Calculate “d(x, x_{i})” i =1, 2, ….., n; where d denotes the Euclidean distance between the points.
 Arrange the calculated n Euclidean distances in nondecreasing order.
 Let k be a +ve integer, take the first k distances from this sorted list.
 Find those kpoints corresponding to these kdistances.
 Let k_{i} denotes the number of points belonging to the i^{th} class among k points i.e. k ≥ 0
 If k_{i} >k_{j} ∀ i ≠ j then put x in class i.
For data science beginners the about pseudocode will be hard to understand. So let’s understand the knn algorithm using an example.
K Nearest neighbor algorithm example
Let’s consider the above image where we have two different target classes white and orange circles. We have total 26 training samples. Now we would like to predict the target class for the blue circle. Considering k value as three, we need to calculate the similarity distance using similarity measures like Euclidean distance. If the similarity score is less which means the classes are close. In the image, we have calculated distance and placed the less distance circles to blue circle inside the Big circle. To learn about different similarity measures, please check out Five most similarity measures.
With the above example, you got some idea about the process of the knn algorithm. Now read the next paragraph to understand the knn algorithm in technical words.
Let’s consider a setup with “n” training samples, where x_{i} is the training data point. The training data points are categorized into “c” classes. Using KNN, we want to predict class for the new data point. So, the first step is to calculate the distance(Euclidean) between the new data point and all the training data points. Next step is to arrange all the distances in nondecreasing order. Assuming a positive value of “K” and filtering “K” least values from the sorted list. Now, we have K top distances. Let k_{i} denotes no. of points belonging to the i^{th} class among k points. If k_{i} >k_{j} ∀i ≠ j then put x in class i.
Nearest Neighbor Algorithm:
Nearest neighbor is a special case of knearest neighbor class. Where k value is 1 (k = 1). In this case, new data point target class will be assigned to the 1^{st} closest neighbor.
How to choose the value of K?
Selecting the value of K in Knearest neighbor is the most critical problem. A small value of K means that noise will have a higher influence on the result i.e., the probability of overfitting is very high. A large value of K makes it computationally expensive and defeats the basic idea behind KNN (that points that are near might have similar classes ). A simple approach to select k is k = n^(1/2).
To optimize the results, we can use Cross Validation. Using the crossvalidation technique, we can test KNN algorithm with different values of K. The model which gives good accuracy can be considered to be an optimal choice.
It depends on individual cases, at times best process is to run through each possible value of k and test our result.
Condensed Nearest Neighbor Data Reduction Rule:
Working on a big dataset can be an expensive task. Using the condensed nearest neighbor rule, we can clean our data and can sort the important observations out of it. This process can reduce the execution time of the machine learning algorithm. But there is a chance of accuracy reduction.
The steps to condense is to divide data points into these:
 Outliers: Observations that lie at an abnormal distance from all the data points. Most of these are extreme values. Removing these observations will increase the accuracy of the model.
 Prototypes: Minimum points in training set required to recognize nonoutlier points.
 Absorbed points: These are points that are correctly identified to be nonoutlier points.
KNearest Neighbor case study
Breast cancer diagnosis using knearest neighbor (Knn) algorithm
To diagnose Breast Cancer, the doctor uses his experience by analyzing details provided by a) Patient’s Past Medical History b) Reports of all the tests performed. At times, it becomes difficult to diagnose cancer even for experienced doctors. Since the information provided by the patient might be unclear and insufficient.
Breast cancer dataset details
Breast cancer database was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg. It contains 699 samples with 10 attributes.
1 2 3 4 5 6 7 8 9 10 11 12 13 
# Attribute Domain   1. Sample code number id number 2. Clump Thickness 1  10 3. Uniformity of Cell Size 1  10 4. Uniformity of Cell Shape 1  10 5. Marginal Adhesion 1  10 6. Single Epithelial Cell Size 1  10 7. Bare Nuclei 1  10 8. Bland Chromatin 1  10 9. Normal Nucleoli 1  10 10. Mitoses 1  10 11. Class: (2 for benign, 4 for malignant) 
The Main objective is to predict whether it’s benign or malignant. Many Scientists worked on this dataset and predicted class using different algorithms.
According to a research paper of M.Sarkar & T.Y. Leong, Department of Computer Science, The National University Singapore
“Conceptually and implementationwise, the Knearest neighbors algorithm is simpler than other techniques that have been applied to this problem. In addition, the Knearest neighbors algorithm produces the overall classification result 1.17% better than the best result known for this problem.”
This is a classification problem and most of the classification techniques can be classified into the following three groups:
 Parametric
 Semiparametric
 NonParametric
Parametric & Semiparametric classifiers need specific information about the structure of data in training set. It is difficult to fulfill this requirement in many cases. So, nonparametric classifier like KNN was considered.
Data was randomly split into training, crossvalidation & testing data. Experimentation was done with the value of K from K = 1 to 15. With KNN algorithm, the classification result of test set fluctuates between 99.12% and 98.02%. The best performance was obtained when K is 1.
Advantages of Knearest neighbors algorithm
 Knn is simple to implement.
 Knn executes quickly for small training data sets.
 performance asymptotically approaches the performance of the Bayes Classifier.
 Don’t need any prior knowledge about the structure of data in the training set.
 No retraining is required if the new training pattern is added to the existing training set.
Limitation to Knearest neighbors algorithm
 When the training set is large, it may take a lot of space.
 For every test data, the distance should be computed between test data and all the training data. Thus a lot of time may be needed for the testing.
References
[2] Sarkar M, Leong TY. Application of Knearest neighbors algorithm on breast cancer diagnosis problem
[3] Breast Cancer Wisconsin (Original) Data Set, UCI Repository
Follow us:
FACEBOOK QUORA TWITTER GOOGLE+  LINKEDIN REDDIT  FLIPBOARD  MEDIUM  GITHUB
I hope you like this post. If you have any questions then feel free to comment below. If you want me to write on one specific topic then do tell it to me in the comments below.
Related Courses:
Do check out unlimited data science courses
Title of the course  Course Link  Course Link 
Machine Learning: Classification 
Machine Learning: Classification 

Data Mining with Python: Classification and Regression 
Data Mining with Python: Classification and Regression 

Machine learning with Scikitlearn 
Machine learning with Scikitlearn 

Hey, your article was great!
Do you have similar ones written for logistic regression, Lasso Regularization or Stochastic Gradient Descent?
Hi Python mate,
Thanks for your compliment. We will try to write on logistic regression in coming post.