Knearest neighbor algorithm implementation in Python from scratch
Knearestneighbor algorithm implementation in Python from scratch
In the introduction to knearestneighbor algorithm article, we have learned the key aspects of the knn algorithm. Also learned about the applications using knn algorithm to solve the real world problems. Now in this article, we are going to implement KNearest Neighbors algorithm from scratch in Python programming language.
After learning knn algorithm, we can use prepacked python machine learning libraries to use knn classifier models directly. Then everything seems like a black box approach. Using the input data and the inbuilt knearest neighbor algorithms models to build the knn classifier model and using the trained knn classifier we can predict the results for the new dataset.This approach seems easy and mostly applied in the machine learning era. This method of using the inbuild python packages to solve a problem will be helpful you to get insights of KNN algorithm.
Let’s consider the above image. Here, we have three classes, i.e., Red, Green, Blue. The data points are dispersed. To draw a KNearest neighbor decision boundary map.
First of all, we will have to divide data set into training & testing data. Train & Test data can be split in any ratio like 60:40, 70:30, 80:20 etc. For each value of test data. We need to calculate metrics like Euclidean Distance and estimate the value of K. For the above classification; we have used K = 15. If the Euclidean distance is less, then it means classes are close. By sorting Euclidean distances in increasing order and selecting the class with maximum no. of data points out of top 15 Euclidean distances as the class of that testing data point. This way we can build classifiers using knn algorithm.
Why we need to implement knn algorithm from scratch
Implementation of KNearest Neighbor algorithm in python from scratch will help you to learn the core concept of Knn algorithm. As we are going implement each every component of the knn algorithm and the other components like how to use the datasets and find the accuracy of our implemented model etc.
The components will be
 How to Load the dataset.
 Split the dataset into train and testing purpose.
 Euclidean distance calculation function.
 Prediction of classes for records with unknown labels.
 Accuracy calculation of our prediction model.
Prerequisite Note:
This tutorial will help you to get started with the implementation of Machine Learning Algorithms. We expect you to have at least basic programming experience in Python programming language.
Before going to implement the k Nearest neighbor algorithms in Python from scratch, Let’s quickly look at the knearest neighbor algorithm pseudocode for our previous article introduction to the knearest neighbor algorithm.
If you have any doubts about Knn algorithm or want to revise it. You can read our article Introduction to Knearest neighbor classifier.
KNearest Neighbor Algorithm Pseudocode:
Let (X_{i}, C_{i}) where i = 1, 2……., n be data points. X_{i} denotes feature values & C_{i} denotes labels for X_{i} for each i.
Assuming the number of classes as ‘c’
C_{i} ∈ {1, 2, 3, ……, c} for all values of i
Let x be a point for which label is not known, and we would like to find the label class using knearest neighbor algorithms.
Procedure:
 Calculate “d(x, x_{i})” i =1, 2, ….., n; where d denotes the Euclidean distance between the points.
 Arrange the calculated n Euclidean distances in nondecreasing order.
 Let k be a +ve integer, take the first k distances from this sorted list.
 Find those kpoints corresponding to these kdistances.
 Let k_{i} denotes the number of points belonging to the i^{th} class among k points i.e. k ≥ 0
 If k_{i} >k_{j} ∀ i ≠ j then put x in class i.
Let’s use the above pseudocode for implementing the knn algorithm in python.
Description of the Dummy Data Set
We will use a sample dataset extracted from ionosphere database by John Hopkins University. We have converted the database into a small dataset so as to simplify the learning curve for our readers.
This dummy dataset consists of 6 attributes and 30 records. Out Of these 5 attributes are continuous variables with values ranging from 1 to +1 i.e, [1,+1]. Last(6th) attribute is a categorical variable with values as “g”(good) or “b”(bad) according to the definition summarized above. This is a binary classification task.
Python modules we were going to use
We are importing only four python modules.
 Unicodecsv: Our dataset is in CSV format i.e, Comma Separated Values. We can open this dataset using any text editor like notepad++, sublime, emac editor. But for data analysis, we need to import our data. For importing CSV data to Python lists or arrays we can use python’s unicodecsv module. It can handle unicode characters and can parse all elements of CSV to list.
 Random: This module implements pseudorandom number generators for various distributions. We want to split our dataset into train and test dataset. For splitting data in random order, we can use the random module.
 Operator: The operator module exports a set of efficient functions corresponding to the intrinsic operators of Python. We used itemgetter() method of this module to return a callable object that fetches item from its operand.
 Math: It provides access to the mathematical functions defined by the C standard. We used this module to calculate Euclidean distance.
KNearest neighbor algorithm implement in Python from scratch
We were going to follow the below workflow for implementing the knn algorithm in python.
 Getting Data
 Train & Test Data split
 Euclidean distance calculation
 Knn prediction function
 Accuracy calculation
Let’s get our hands dirty and start the coding stuff.
Getting Data:
For any type of programmatic implementation on the dataset, we first need to import it. Using unicodecsv reader, we inserted the dataset to a python list.
1 2 3 
with open('i_data_sample.csv','rb') as f: reader = unicodecsv.reader(f) i_data = list(reader) 
We can check the length of dataset using len()
1 2 
>>> len(dataset) 30 
Train & Test Data split:
We have written shuffle() method. It takes dataset in the form of python list as input. Using random.shuffle(), we shuffled the dataset. After shuffling, training and testing dataset are separated on 70:30 ratio i.e, training dataset consists of 70% data and testing consists of 30%.
1 2 3 4 5 
def shuffle(i_data): random.shuffle(i_data) train_data = i_data[:int(0.7*30)] test_data = i_data[int(0.7*30):] return train_data, test_data 
Euclidean Distance Calculation:
KNN algorithm classifies new data on the basis of nearest K neighbors according to Euclidean distance. We have written the euclideanDist() method to calculate Euclidean Distance between points and return its value.
1 2 3 4 5 6 
def euclideanDist(x, xi): d = 0.0 for i in range(len(x)1): d += pow((float(x[i])float(xi[i])),2) d = math.sqrt(d) return d 
KNN prediction function:
In KNN there is no training phase, for each new data it calculates Euclidean distance and compares with the nearest K neighbors. Class with maximum no. of data points in nearest K neighbors list is chosen as the class of new data point. We can choose the value of K to be n^(1/2) i.e., the square root of no. of data points else we can test our results by putting different K values.
We have written a knn_predict(), it takes training data, testing data & K value as parameters. For test data, it appends the predicted value “g” or “b” or “NaN” as 7th element of the list.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 
def knn_predict(test_data, train_data, k_value): for i in test_data: eu_Distance =[] knn = [] good = 0 bad = 0 for j in train_data: eu_dist = euclideanDist(i, j) eu_Distance.append((j[5], eu_dist)) eu_Distance.sort(key = operator.itemgetter(1)) knn = eu_Distance[:k_value] for k in knn: if k[0] =='g': good += 1 else: bad +=1 if good > bad: i.append('g') elif good < bad: i.append('b') else: i.append('NaN') 
Accuracy Calculation:
Now we have implemented our KNN algorithm. To test results of our algorithm, we can use “accuracy” as a metric. Accuracy is the ratio of no. of data points correctly classified to total no. of data points. Here, we are testing accuracy on test data set. For calculating accuracy, we have written a function.
1 2 3 4 5 6 7 8 
def accuracy(test_data): correct = 0 for i in test_data: if i[5] == i[6]: correct += 1 accuracy = float(correct)/len(test_data) *100 return accuracy 
We have implemented all the functions of our algorithm, and now it’s time to join up all the snippets to make a python script.
Complete Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 
import unicodecsv import random import operator import math #getdata() function definition def getdata(filename): with open(filename,'rb') as f: reader = unicodecsv.reader(f) return list(reader) #random train test data split function definition def shuffle(i_data): random.shuffle(i_data) train_data = i_data[:int(0.7*30)] test_data = i_data[int(0.7*30):] return train_data, test_data def euclideanDist(x, xi): d = 0.0 for i in range(len(x)1): d += pow((float(x[i])float(xi[i])),2) #euclidean distance d = math.sqrt(d) return d #KNN prediction and model training def knn_predict(test_data, train_data, k_value): for i in test_data: eu_Distance =[] knn = [] good = 0 bad = 0 for j in train_data: eu_dist = euclideanDist(i, j) eu_Distance.append((j[5], eu_dist)) eu_Distance.sort(key = operator.itemgetter(1)) knn = eu_Distance[:k_value] for k in knn: if k[0] =='g': good += 1 else: bad +=1 if good > bad: i.append('g') elif good < bad: i.append('b') else: i.append('NaN') #Accuracy calculation function def accuracy(test_data): correct = 0 for i in test_data: if i[5] == i[6]: correct += 1 accuracy = float(correct)/len(test_data) *100 #accuracy return accuracy dataset = getdata('i_data_sample_30.csv') #getdata function call with csv file as parameter train_dataset, test_dataset = shuffle(dataset) #train test data split K = 5 # Assumed K value knn_predict(test_dataset, train_dataset, K) print test_dataset print "Accuracy : ",accuracy(test_dataset) 
Output:
1 2 3 4 5 6 7 8 9 
[[u'0', u'0.32937', u'0.20635', u'0.46429', u'0', u'b', 'b'], [u'0.14211', u'1', u'1', u'1', u'0.25', u'g', 'g'], [u'0.10827', u'1', u'0.94575', u'0.75906', u'0.72684', u'g', 'g'], [u'0.43204', u'0.92314', u'0.91586', u'0.87379', u'0.09385', u'g', 'g'], [u'0.02543', u'0.78575', u'0.74695', u'0.63613', u'0.17539', u'g', 'g'], [u'0.10123', u'0.96142', u'0.73781', u'0.59615', u'0.23568', u'g', 'g'], [u'0', u'0', u'0', u'0', u'1', u'b', 'g'], [u'0.76131', u'0.8706', u'1', u'0.80025', u'0.27387', u'g', 'g'], [u'1', u'0.07661', u'0.09677', u'0.31755', u'0.01672', u'b', 'b']] 
1 
Accuracy : 88.88% 
Our KNN algorithm is giving 88.88% accuracy. It means 11.22% of our predictions may be wrong.
We have written our KNearest Neighbor Algorithm code from Scratch just by using simple mathematics functions. Cheers! 😀
Want to replicate this code?
Don’s Just copy paste it. Try to type it line by line and understand the code. You can use Jupyter notebook for running this code or can directly run the code using python idle. Sample dataset which we have used is available here.
References:
[1] Lichman, M. (2013). UCI Machine Learning Repository. Irvine, CA: the University of California, School of Information and Computer Science.
[2] Implement knearest neighbors in python from machinelearningmastery
[3] Ionosphere Data Set; UCI Machine Learning Repository
Follow us:
FACEBOOK QUORA TWITTER GOOGLE+  LINKEDIN REDDIT  FLIPBOARD  MEDIUM  GITHUB
I hope you like this post. If you have any questions, then feel free to comment below. If you want me to write on one particular topic, then do tell it to me in the comments below.
Related Courses:
Do check out unlimited data science courses
Title of the course  Course Link  Course Link 
Machine Learning: Classification 
Machine Learning: Classification 

Data Mining with Python: Classification and Regression 
Data Mining with Python: Classification and Regression 

Machine learning with Scikitlearn 
Machine learning with Scikitlearn 

[…] Knn algorithm implementation purely in python without any machine learning libraries […]
[…] the introduction to k nearest neighbor and knn classifier implementation in Python from scratch, We discussed the key aspects of knn algorithms and implementing knn […]
[…] Knn algorithm implementation purely in python without any machine learning libraries […]
[…] in the previous article how the decision tree algorithm works we have given the enough introduction to the working aspects of decision tree algorithm. In this […]
Hi Rahul,
Thanks for posting this knowledge. Can you give link to file i_data_sample.csv. can use it to further understand the data
Hi Sk,
Thanks for your compliment. You can find the link to download the data set in the reference section of the post. Anyhow, I am sharing you the link again here.
https://archive.ics.uci.edu/ml/datasets/Ionosphere
Hi ..!The dataset u hav specified in ur tutorial knn from scratch i.e. ionosphere database by John Hopkins University…. i downloaded it from UCI machine learning repositary, but this dataset contains some values other than float type due to which ur program isn’t giving the accuracy dat u hav got for ur sample dataset. How do i deal with this probem…also the above specified dataset is not given with description such as column names…
Hi Divyabhadergade,
This post is to completely explain about how to implement the knearest neighbor algorithm in python from scratch. As you said you won’t get the same accuracy when you used the complete dataset, You need to fine tune your model by considering only the features which increase the accuracy of the model.
For reference, I am sharing you the article about implementing the knearest neighbor algorithm in python with scikitlearn.
“train_data = i_data[:int(0.7*30)]
test_data = i_data[int(0.7*30):]”
I believe test_data should be 0.3*30
Hi Marshal Gu,
Yes. You are correct. Generally, we use the 70% of data for training and 30% of data for testing.
The code says for training use data from the start index to 70% observations index. Whereas for testing use the remaining data.