K-nearest neighbor algorithm implementation in Python from scratch

December 27, 2016 Rahul Saxena

K-nearest-neighbor implementation in Python from scratch

K-nearest-neighbor algorithm implementation in Python from scratch

In the introduction to k-nearest-neighbor algorithm article, we have learned the key aspects of the knn algorithm. Also learned about the applications using knn algorithm to solve the real world problems. Now in this article, we are going to implement K-Nearest Neighbors algorithm from scratch in Python programming language.

K-nearest-neighbor-black-box-model

After learning knn algorithm, we can use pre-packed python machine learning libraries to use knn classifier models directly. Then everything seems like a black box approach. Using the input data and the inbuilt k-nearest neighbor algorithms models to build the knn classifier model and using the trained knn classifier we can predict the results for the new dataset.This approach seems easy and mostly applied in the machine learning era. This method of using the inbuild python packages to solve a problem will be helpful you to get insights of KNN algorithm.

Image Credit:: Scikit-Learn

Let’s consider the above image. Here, we have three classes, i.e., Red, Green, Blue. The data points are dispersed. To draw a K-Nearest neighbor decision boundary map.
First of all, we will have to divide data set into training & testing data. Train & Test data can be split in any ratio like 60:40, 70:30, 80:20 etc. For each value of test data. We need to calculate metrics like Euclidean Distance and estimate the value of K. For the above classification; we have used K = 15. If the Euclidean distance is less, then it means classes are close. By sorting Euclidean distances in increasing order and selecting the class with maximum no. of data points out of top 15 Euclidean distances as the class of that testing data point. This way we can build classifiers using knn algorithm.

Why we need to implement knn algorithm from scratch

Implementation of K-Nearest Neighbor algorithm in python from scratch will help you to learn the core concept of Knn algorithm. As we are going implement each every component of the knn algorithm and the other components like how to use the datasets and find the accuracy of our implemented model etc.

The components will be

How to Load the dataset.
Split the dataset into train and testing purpose.
Euclidean distance calculation function.
Prediction of classes for records with unknown labels.
Accuracy calculation of our prediction model.

Prerequisite Note:

This tutorial will help you to get started with the implementation of Machine Learning Algorithms. We expect you to have at least basic programming experience in Python programming language.

Before going to implement the k- Nearest neighbor algorithms in Python from scratch, Let’s quickly look at the k-nearest neighbor algorithm pseudocode for our previous article introduction to the k-nearest neighbor algorithm.

If you have any doubts about Knn algorithm or want to revise it. You can read our article Introduction to K-nearest neighbor classifier.

K-Nearest Neighbor Algorithm Pseudocode:

Let (X_i, C_i) where i = 1, 2……., n be data points. X_i denotes feature values & C_i denotes labels for X_i for each i.
Assuming the number of classes as ‘c’
C_i ∈ {1, 2, 3, ……, c} for all values of i

Let x be a point for which label is not known, and we would like to find the label class using k-nearest neighbor algorithms.

Procedure:

Calculate “d(x, x_i)” i =1, 2, ….., n; where d denotes the Euclidean distance between the points.
Arrange the calculated n Euclidean distances in non-decreasing order.
Let k be a +ve integer, take the first k distances from this sorted list.
Find those k-points corresponding to these k-distances.
Let k_i denotes the number of points belonging to the i^th class among k points i.e. k ≥ 0
If k_i >k_j ∀ i ≠ j then put x in class i.

Let’s use the above pseudocode for implementing the knn algorithm in python.

Description of the Dummy Data Set

We will use a sample dataset extracted from ionosphere database by John Hopkins University. We have converted the database into a small dataset so as to simplify the learning curve for our readers.

This dummy dataset consists of 6 attributes and 30 records. Out Of these 5 attributes are continuous variables with values ranging from -1 to +1 i.e, [-1,+1]. Last(6th) attribute is a categorical variable with values as “g”(good) or “b”(bad) according to the definition summarized above. This is a binary classification task.

Python modules we were going to use

We are importing only four python modules.

Unicodecsv: Our dataset is in CSV format i.e, Comma Separated Values. We can open this dataset using any text editor like notepad++, sublime, emac editor. But for data analysis, we need to import our data. For importing CSV data to Python lists or arrays we can use python’s unicodecsv module. It can handle unicode characters and can parse all elements of CSV to list.
Random: This module implements pseudo-random number generators for various distributions. We want to split our dataset into train and test dataset. For splitting data in random order, we can use the random module.
Operator: The operator module exports a set of efficient functions corresponding to the intrinsic operators of Python. We used itemgetter() method of this module to return a callable object that fetches item from its operand.
Math: It provides access to the mathematical functions defined by the C standard. We used this module to calculate Euclidean distance.

K-Nearest neighbor algorithm implement in Python from scratch

We were going to follow the below workflow for implementing the knn algorithm in python.

Getting Data
Train & Test Data split
Euclidean distance calculation
Knn prediction function
Accuracy calculation

Let’s get our hands dirty and start the coding stuff.

Getting Data:

For any type of programmatic implementation on the dataset, we first need to import it. Using unicodecsv reader, we inserted the dataset to a python list.

with open('i_data_sample.csv','rb') as f:
    reader = unicodecsv.reader(f)
    i_data = list(reader)

We can check the length of dataset using len()

>>> len(dataset)
30

Train & Test Data split:

We have written shuffle() method. It takes dataset in the form of python list as input. Using random.shuffle(), we shuffled the dataset. After shuffling, training and testing dataset are separated on 70:30 ratio i.e, training dataset consists of 70% data and testing consists of 30%.

def shuffle(i_data):
 random.shuffle(i_data)
 train_data = i_data[:int(0.7*30)]
 test_data = i_data[int(0.7*30):]
 return train_data, test_data

Euclidean Distance Calculation:

KNN algorithm classifies new data on the basis of nearest K neighbors according to Euclidean distance. We have written the euclideanDist() method to calculate Euclidean Distance between points and return its value.

def euclideanDist(x, xi):
 d = 0.0
 for i in range(len(x)-1):
 d += pow((float(x[i])-float(xi[i])),2)
 d = math.sqrt(d)
 return d

KNN prediction function:

In KNN there is no training phase, for each new data it calculates Euclidean distance and compares with the nearest K neighbors. Class with maximum no. of data points in nearest K neighbors list is chosen as the class of new data point. We can choose the value of K to be n^(1/2) i.e., the square root of no. of data points else we can test our results by putting different K values.
We have written a knn_predict(), it takes training data, testing data & K value as parameters. For test data, it appends the predicted value “g” or “b” or “NaN” as 7th element of the list.

def knn_predict(test_data, train_data, k_value):
    for i in test_data:
        eu_Distance =[]
        knn = []
        good = 0
        bad = 0
        for j in train_data:
            eu_dist = euclideanDist(i, j)
            eu_Distance.append((j[5], eu_dist))
            eu_Distance.sort(key = operator.itemgetter(1))
            knn = eu_Distance[:k_value]
            for k in knn:
                if k[0] =='g':
                    good += 1
                else:
                    bad +=1
        if good > bad:
            i.append('g')
        elif good < bad:
            i.append('b')
        else:
            i.append('NaN')

Accuracy Calculation:

Now we have implemented our KNN algorithm. To test results of our algorithm, we can use “accuracy” as a metric. Accuracy is the ratio of no. of data points correctly classified to total no. of data points. Here, we are testing accuracy on test data set. For calculating accuracy, we have written a function.

def accuracy(test_data):
    correct = 0
    for i in test_data:
        if i[5] == i[6]:
            correct += 1

    accuracy = float(correct)/len(test_data) *100  
    return accuracy

We have implemented all the functions of our algorithm, and now it’s time to join up all the snippets to make a python script.

Complete Code:

import unicodecsv
import random
import operator
import math

#getdata() function definition
def getdata(filename):
    with open(filename,'rb') as f:
        reader = unicodecsv.reader(f)
        return list(reader)

#random train test data split function definition
def shuffle(i_data):
    random.shuffle(i_data)
    train_data = i_data[:int(0.7*30)]
    test_data = i_data[int(0.7*30):]
    return train_data, test_data

def euclideanDist(x, xi):
    d = 0.0
    for i in range(len(x)-1):
        d += pow((float(x[i])-float(xi[i])),2)  #euclidean distance
    d = math.sqrt(d)
    return d

#KNN prediction and model training
def knn_predict(test_data, train_data, k_value):
    for i in test_data:
        eu_Distance =[]
        knn = []
        good = 0

        bad = 0
        for j in train_data:
            eu_dist = euclideanDist(i, j)
            eu_Distance.append((j[5], eu_dist))
            eu_Distance.sort(key = operator.itemgetter(1))
            knn = eu_Distance[:k_value]
            for k in knn:
                if k[0] =='g':
                    good += 1
                else:
                    bad +=1
        if good > bad:
            i.append('g')
        elif good < bad:
            i.append('b')
        else:
            i.append('NaN')

#Accuracy calculation function
def accuracy(test_data):
    correct = 0
    for i in test_data:
        if i[5] == i[6]:
            correct += 1
    accuracy = float(correct)/len(test_data) *100  #accuracy 
    return accuracy

dataset = getdata('i_data_sample_30.csv')  #getdata function call with csv file as parameter
train_dataset, test_dataset = shuffle(dataset) #train test data split
K = 5                                          # Assumed K value
knn_predict(test_dataset, train_dataset, K)   
print test_dataset
print "Accuracy : ",accuracy(test_dataset)

Output:

[[u'0', u'-0.32937', u'0.20635', u'0.46429', u'0', u'b', 'b'],
 [u'0.14211', u'1', u'1', u'1', u'0.25', u'g', 'g'],
 [u'0.10827', u'1', u'0.94575', u'0.75906', u'0.72684', u'g', 'g'],
 [u'0.43204', u'0.92314', u'0.91586', u'0.87379', u'0.09385', u'g', 'g'],
 [u'0.02543', u'0.78575', u'0.74695', u'0.63613', u'0.17539', u'g', 'g'],
 [u'0.10123', u'0.96142', u'0.73781', u'0.59615', u'0.23568', u'g', 'g'],
 [u'0', u'0', u'0', u'0', u'-1', u'b', 'g'],
 [u'0.76131', u'0.8706', u'1', u'0.80025', u'-0.27387', u'g', 'g'],
 [u'-1', u'-0.07661', u'-0.09677', u'0.31755', u'-0.01672', u'b', 'b']]

Accuracy :  88.88%

Our KNN algorithm is giving 88.88% accuracy. It means 11.22% of our predictions may be wrong.

We have written our K-Nearest Neighbor Algorithm code from Scratch just by using simple mathematics functions. Cheers! 😀

Want to replicate this code?

Don’s Just copy paste it. Try to type it line by line and understand the code. You can use Jupyter notebook for running this code or can directly run the code using python idle. Sample dataset which we have used is available here.

References:

[1] Lichman, M. (2013). UCI Machine Learning Repository. Irvine, CA: the University of California, School of Information and Computer Science.

[2] Implement k-nearest neighbors in python from machinelearningmastery

[3] Ionosphere Data Set; UCI Machine Learning Repository

Related Courses:

Do check out unlimited data science courses

Title of the course	Course Link	Course Link
Machine Learning: Classification	Machine Learning: Classification	Will learn the basic introduction to classification. This course will introduce the most popular used classification algorithms. You will get a chance to implement them python using the python machine learning libraries.
Data Mining with Python: Classification and Regression	Data Mining with Python: Classification and Regression	Understand the key concepts in data mining and will learn how to apply these concepts to solve the real world problems. Will get hands on experience with python programming language. Hands on experience with numpy, pandas, matplotlib libraries (Python libraries)
Machine learning with Scikit-learn	Machine learning with Scikit-learn	Load data into scikit-learn; Run many machine learning algorithms both for unsupervised and supervised data. Assess model accuracy and performance Being able to decide what’s the best model for every scenario

8 Responses to “K-nearest neighbor algorithm implementation in Python from scratch”

architects in bangalore
7 months ago
Reply

This is an excellent contribution to the conversation. Your thoughtful analysis and practical advice make it a valuable resource. Keep up the fantastic work!
- Saimadhu Polamuri
  6 months ago
  Reply
  
  Thank you so much for your kind words! It’s truly rewarding to hear that the content resonated with you. I always strive to provide insightful analysis and actionable advice for readers like you. Your encouragement means a lot. Let’s keep the conversation going and explore more together! 🚀
Marshall Gu
7 years ago
Reply

“train_data = i_data[:int(0.7*30)]
test_data = i_data[int(0.7*30):]”

I believe test_data should be 0.3*30
- Saimadhu Polamuri
  7 years ago
  Reply
  
  Hi Marshal Gu,
  Yes. You are correct. Generally, we use the 70% of data for training and 30% of data for testing.
  
  The code says for training use data from the start index to 70% observations index. Whereas for testing use the remaining data.
Divya
7 years ago
Reply

Hi ..!The dataset u hav specified in ur tutorial knn from scratch i.e. ionosphere database by John Hopkins University…. i downloaded it from UCI machine learning repositary, but this dataset contains some values other than float type due to which ur program isn’t giving the accuracy dat u hav got for ur sample dataset. How do i deal with this probem…also the above specified dataset is not given with description such as column names…
- Saimadhu Polamuri
  7 years ago
  Reply
  
  Hi Divyabhadergade,
  This post is to completely explain about how to implement the k-nearest neighbor algorithm in python from scratch. As you said you won’t get the same accuracy when you used the complete dataset, You need to fine tune your model by considering only the features which increase the accuracy of the model.
  
  For reference, I am sharing you the article about implementing the k-nearest neighbor algorithm in python with scikit-learn.
SK
7 years ago
Reply

Hi Rahul,
Thanks for posting this knowledge. Can you give link to file i_data_sample.csv. can use it to further understand the data
- Saimadhu Polamuri
  7 years ago
  Reply
  
  Hi Sk,
  Thanks for your compliment. You can find the link to download the data set in the reference section of the post. Anyhow, I am sharing you the link again here.
  
  https://archive.ics.uci.edu/ml/datasets/Ionosphere