Knn R, K-nearest neighbor classifier implementation in R programming from scratch

January 2, 2017 Rahul Saxena

knn in implementation r from scratch

K-Nearest neighbor algorithm implement in R Programming from scratch

In the introduction to k-nearest-neighbor algorithm article, we have learned the core concepts of the knn algorithm. Also learned about the applications using knn algorithm to solve the real world problems.

In this post, we will be implementing K-Nearest Neighbor Algorithm on a dummy data set using R programming language from scratch. Along the way, we will implement a prediction model to predict classes for data.

Knn Implementation in R

Why we need to implement knn algorithm from scratch in R Programming Language

Implementation of K-Nearest Neighbor algorithm in R language from scratch will help us to apply the concepts of Knn algorithm. As we are going implement each every component of the knn algorithm and the other components like how to use the datasets and find the accuracy of our implemented model etc.

Problem Set

We will use a sample dataset extracted from ionosphere database by John Hopkins University. We have converted the database into a small dataset so as to simplify the learning curve for our readers.

Our objective is to program a Knn classifier in R programming language without using any machine learning package. We have two classes “g”(good) or “b”(bad), it is the response of radar from the ionosphere. The classifier could be capable of predicting “g” or “b” class for new records from training data.

Ionosphere Dataset Description

This dummy dataset consists of 6 attributes and 30 records. Out Of these 5 attributes are continuous variables with values ranging from -1 to +1 i.e, [-1,+1]. Last(6th) attribute is a categorical variable with values as “g”(good) or “b”(bad) according to the definition summarized above. This is a binary classification task.

K-Nearest Neighbor Algorithm Pseudocode

Let (X_i, C_i) where i = 1, 2……., n be data points. X_i denotes feature values & C_i denotes labels for X_ifor each i.
Assuming the number of classes as ‘c’
C_i ∈ {1, 2, 3, ……, c} for all values of i

Let x be a point for which label is not known, and we would like to find the label class using k-nearest neighbor algorithms.

Procedure:

Calculate “d(x, x_i)” i =1, 2, ….., n; where d denotes the Euclidean distance between the points.
Arrange the calculated n Euclidean distances in non-decreasing order.
Let k be a +ve integer, take the first k distances from this sorted list.
Find those k-points corresponding to these k-distances.
Let k_i denotes the number of points belonging to the i^th class among k points i.e. k ≥ 0
If k_i >k_j ∀ i ≠ j then put x in class i.

Let’s use the above pseudocode for implementing the knn algorithm in R Language.

Prerequisites:

Basic programming experience is required
Install R-Studio on your system.

K-Nearest neighbor algorithm implement in R Language from scratch

We are going to follow the below workflow for implementing the knn algorithm in R:

Getting Data
Train & Test Data Split
Euclidean Distance Calculation
KNN prediction function
Accuracy calculation

Let’s get our hands dirty and start the coding stuff.

Getting Data in R

For any programmatic implementation on the dataset, we first need to import it. Using read.csv(), we are importing dataset into knn.df dataframe. Since dataset has no header so, we are using header= FALSE. sep parameter is to define the literal which separates values our document.
knn.df is a dataframe. A dataframe is a table or 2-D array, in which each column contains measurements on one variable, and each row contains one record.

knn.df <- read.csv('i_data_sample_30.csv', header = FALSE, sep = ',')

For checking dimensions of the dataset, we can call dim() method and be passing data frame as a parameter.

dim(knn.df)
[1] 30  6

It shows that the data frame consists of 30 records and 6 columns.

To check summary of our dataset, we can use summary() method.

summary(knn.df)

       V1                V2                V3                V4                V5         
 Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.0000  
 1st Qu.:-0.1079   1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.:-0.6477  
 Median : 0.0000   Median : 0.8099   Median : 0.4059   Median : 0.5891   Median : 0.0000  
 Mean   :-0.0137   Mean   : 0.5611   Mean   : 0.3246   Mean   : 0.2769   Mean   :-0.0633  
 3rd Qu.: 0.1336   3rd Qu.: 1.0000   3rd Qu.: 0.9730   3rd Qu.: 0.8269   3rd Qu.: 0.2464  
 Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000  
 V6    
 b:15  
 g:15

It shows that records with bad class and good class are 15 each.

Train & Test Data split in R

Before Train & Test data split, we need to distribute it randomly. In R, we can use sample() method. It helps to randomize all the records of dataframe.
Please use set.seed(2), seed() method is used to produce reproducible results. In the next line we are passing sample() method inside dataframe. This is to randomize all 30 records of knn.df. Now, we are ready for a split. For dividing train, test data we are splitting them in 70:30 ratio i.e., 70% of data will be considered as train set & 30% as the test set.

set.seed(2)
knn.df<- knn.df[sample(nrow(knn.df)),]
train.df <- knn.df[1:as.integer(0.7*30),]
test.df <- knn.df[as.integer(0.7*30 +1):30,]

Euclidean Distance Calculation in R

Below snippet consists of a function defined in R to calculate Euclidean distance between 2 points a & b. The formula of Euclidean distance is:

Euclidean Distance

euclideanDist <- function(a, b){
  d = 0
  for(i in c(1:(length(a)-1) ))
  {
    d = d + (a[[i]]-b[[i]])^2
  }
  d = sqrt(d)
  return(d)
}

KNN prediction function in R

This function is the core part of this tutorial. We are writing a function knn_predict. It takes 3 arguments: test data, train data & value of K. It loops over all the records of test data and train data. It returns the predicted class labels of test data.

knn_predict <- function(test_data, train_data, k_value){
  pred <- c()  #empty pred vector 
  #LOOP-1
  for(i in c(1:nrow(test_data))){   #looping over each record of test data
    eu_dist =c()          #eu_dist & eu_char empty  vector
    eu_char = c()
    good = 0              #good & bad variable initialization with 0 value
    bad = 0
    
    #LOOP-2-looping over train data 
    for(j in c(1:nrow(train_data))){

      #adding euclidean distance b/w test data point and train data to eu_dist vector
      eu_dist <- c(eu_dist, euclideanDist(test_data[i,], train_data[j,]))

      #adding class variable of training data in eu_char
      eu_char <- c(eu_char, as.character(train_data[j,][[6]]))
    }
    
    eu <- data.frame(eu_char, eu_dist) #eu dataframe created with eu_char & eu_dist columns

    eu <- eu[order(eu$eu_dist),]       #sorting eu dataframe to gettop K neighbors
    eu <- eu[1:k_value,]               #eu dataframe with top K neighbors

    #Loop 3: loops over eu and counts classes of neibhors.
    for(k in c(1:nrow(eu))){
      if(as.character(eu[k,"eu_char"]) == "g"){
        good = good + 1
      }
      else
        bad = bad + 1
    }

    # Compares the no. of neighbors with class label good or bad
    if(good > bad){          #if majority of neighbors are good then put "g" in pred vector

      pred <- c(pred, "g")
    }
    else if(good < bad){
                   #if majority of neighbors are bad then put "b" in pred vector
      pred <- c(pred, "b")
    }
    
  }
  return(pred) #return pred vector
}

It returns a vector with predicted classes of test dataset. These predictions can be used to calculate accuracy metric.

Accuracy Calculation in R

The accuracy metric calculates the ratio of the number of correctly predicted class labels to the total number of predicted labels.

accuracy <- function(test_data){
  correct = 0
  for(i in c(1:nrow(test_data))){
    if(test_data[i,6] == test_data[i,7]){ 
      correct = correct+1
    }
  }
  accu = correct/nrow(test_data) * 100  
  return(accu)
}

KNN Algorithm accuracy print: In this code snippet we are joining all our functions. We are calling the knn_predict function with train and test dataframes that we split earlier and K value as 5.
We are appending the prediction vector as the 7th column in our test dataframe and then using accuracy() method we are printing accuracy of our KNN model.

K = 5
predictions <- knn_predict(test.df, train.df, K) #calling knn_predict()

test.df[,7] <- predictions #Adding predictions in test data as 7th column
print(accuracy(test.df))

Script Output:

Accuracy of our KNN model is 
77.77778

It prints accuracy of our knn model. Here our accuracy is 77.78%. That’s pretty good 🙂 for our randomly selected dummy dataset.
You can download the RMD file of this code from our GitHub repository.

Finally, we have implemented our KNN model in R programming without using any specific R packages.Hope you enjoyed learning it.

Related Courses:

Do check out unlimited data science courses

Title of the course	Course Link	Course Link
R Programming A-Z: R For Data Science With Real Exercises!	R Programming A-Z: R For Data Science With Real Exercises!	This course is truly step-by-step. In every new tutorial, you will build on what had already learned and move one extra step forward. After every video, you learn a new valuable concept that you can apply right away. And the best part is that you learn through live examples. In summary, this course has been designed for all skill levels and even if you have no programming or statistical background you will be successful in this course!.
R Programming: Advanced Analytics In R For Data Science	R Programming: Advanced Analytics In R For Data Science	Perform Data Preparation in R and locate missing data in dataframes. Apply the Factual Analysis method to replace missing records. Work with the gsub() and sub() functions for replacing strings. Use lapply() and sapply() when working with lists and vectors. Use lapply() and sapply() when working with lists and vectors.
Data Mining with R: Go from Beginner to Advanced!	Data Mining with R: Go from Beginner to Advanced!	Use R software for data import and export, data exploration and visualization, and for data analysis tasks, including performing a comprehensive set of data mining operations. Apply the dozens of included “hands-on” cases and examples using real data and R scripts to new and unique data analysis and data mining problems. Effectively use a number of popular, contemporary data mining methods and techniques in demand by industry including: (1) Decision, classification and regression trees (CART); (2) Random forests; (3) Linear and logistic regression; and (4) Various cluster analysis techniques.

29 Responses to “Knn R, K-nearest neighbor classifier implementation in R programming from scratch”

Latifa Rehmat
2 years ago
Reply

HI,
Happy to find your website.

please post all the Boosting family (XGBoost, LightGBM, Catboost, Gradient Boosting)
Sonia
3 years ago
Reply

Hi, I wanted to print out “predictions” but I get NULL as output. Any ideas?
- Saimadhu Polamuri
  3 years ago
  Reply
  
  Hi Sonia,
  
  Please check the features you are passing to perform predictions. Do let us know if you are still not able to get the predictions.
  
  Thanks,
Mo
4 years ago
Reply

What about KNN Regressor ?
- Saimadhu Polamuri
  4 years ago
  Reply
  
  Hi,
  
  Yes we can perform the regression with knn also. In knn regression we will average the K neighbor values as the predicted value, But using knn for regression is not an optimal option, it always better to go with the regression algorithms
  
  Thanks,
  Saimadhu
Naila
4 years ago
Reply

Hi
do you implement KNN Manhatten Distance or KNN Minkowski Distance? if yes then can you share that.
- Saimadhu Polamuri
  4 years ago
  Reply
  
  Hi Naila,
  
  For knn algorithm, we use the Euclidean distance not the Manhatten distance or Minkowski distance measure. You can check different similarity measures in this article.
  
  https://dataaspirant.com/2015/04/11/five-most-popular-similarity-measures-implementation-in-python/
Vignesh
4 years ago
Reply

i need the data set
- Saimadhu Polamuri
  4 years ago
  Reply
  
  Hi Vignesh,
  
  Sorry to say the dataset we used in the article is the dummy data we have created, the idea is to apply this code for any dataset. Feel free to use the code for other datasets and let me know if you face any issues.
  
  Thanks,
  Saimadhu
Cristian
5 years ago
Reply

Euclidean is wrong. It should be for(i in c(1:(length(a)) )) otherwise you don’t consider the last dimension
- Saimadhu Polamuri
  4 years ago
  Reply
  
  Hi Cristian,
  
  Thanks for identifying the code bug.
  
  Thanks & happy learning
AYOUB
5 years ago
Reply

I want your link in this article on GitHub thanks
- Saimadhu Polamuri
  4 years ago
  Reply
  
  Hi Ayoub,
  
  Sorry to say we are not having the code for this article in our GitHub, we are you can find all the dataaspirant codes in our Github link.
  
  Link: https://github.com/saimadhu-polamuri/DataAspirant_codes
  
  Thanks and happy learning.
shahid
5 years ago
Reply

i need help in knn
- Saimadhu Polamuri
  4 years ago
  Reply
  
  Hi Shahid,
  
  Please let me know what kind of help you need in knn.
  
  Thanks and happy learning.
Devraj
6 years ago
Reply

from where will I get the dataset “i_data_sample_30.csv”
- Saimadhu Polamuri
  4 years ago
  Reply
  
  Hi Devraj,
  
  The data we are showing in the article is the dummy dataset, the main intention is to use the same model building workflow for any other dataset.
  
  Thanks and happy learning.
afonso
6 years ago
Reply

The dataset in the link [https://archive.ics.uci.edu/index.php] is not avalable.
- Saimadhu Polamuri
  4 years ago
  Reply
  
  Hi Afonso,
  
  I guess the archive repo has removed or changed the URL link.
  
  Thanks and happy learning.
Michael
6 years ago
Reply

Can you send me the datafile? Thank’s a lot
- Saimadhu Polamuri
  4 years ago
  Reply
  
  Hi Michael,
  
  The dataset we have used in the article is the dummy dataset, the main intension is to apply the same model building workflow for any other dataset. Hope you can use the same model building framework for other datasets.
  
  Thanks and happy learning!
jeza
6 years ago
Reply

Thanks so much for your informative post
- Saimadhu Polamuri
  6 years ago
  Reply
  
  Hi Jeza,
  
  Thanks for your compliment. 🙂
Ricardo
7 years ago
Reply

Hi there.

I need the link with the file “i_data_sample_30.csv”, shown at the begining. Thanks a lot!!

Ricardo (sorry the poor English)
- Saimadhu Polamuri
  6 years ago
  Reply
  
  Hi Ricardo,
  
  Sent to your mail. 🙂
milind
7 years ago
Reply

from where do i download the data set for the above KNN implementation?
- Saimadhu Polamuri
  7 years ago
  Reply
  
  Hi Milind,
  
  The data set is the in this post is the dummy dataset. Sorry to say we miss placed that so how. You can get unlimited classification related dataset in the link https://archive.ics.uci.edu
  - SUNIL ARAVA
    4 years ago
    Reply
    
    Hi Sai,
    
    Please could you share the dataset used for this practice to my email-id sunilarava@yahoo.com
    
    Sunil
    - Saimadhu Polamuri
      4 years ago
      Reply
      
      Hi Sunil Arava,
      
      Sorry to say the dataset we used in this article is the dummy data we created, the idea is to use the code for any other dataset.
      
      Thanks,
      Saimadhu