Knn R, K-nearest neighbor classifier implementation in R programming from scratch
K-Nearest neighbor algorithm implement in R Programming from scratch
In the introduction to k-nearest-neighbor algorithm article, we have learned the core concepts of the knn algorithm. Also learned about the applications using knn algorithm to solve the real world problems.
In this post, we will be implementing K-Nearest Neighbor Algorithm on a dummy data set using R programming language from scratch. Along the way, we will implement a prediction model to predict classes for data.
Knn Implementation in R
Why we need to implement knn algorithm from scratch in R Programming Language
Implementation of K-Nearest Neighbor algorithm in R language from scratch will help us to apply the concepts of Knn algorithm. As we are going implement each every component of the knn algorithm and the other components like how to use the datasets and find the accuracy of our implemented model etc.
Problem Set
We will use a sample dataset extracted from ionosphere database by John Hopkins University. We have converted the database into a small dataset so as to simplify the learning curve for our readers.
Our objective is to program a Knn classifier in R programming language without using any machine learning package. We have two classes “g”(good) or “b”(bad), it is the response of radar from the ionosphere. The classifier could be capable of predicting “g” or “b” class for new records from training data.
Ionosphere Dataset Description
This dummy dataset consists of 6 attributes and 30 records. Out Of these 5 attributes are continuous variables with values ranging from -1 to +1 i.e, [-1,+1]. Last(6th) attribute is a categorical variable with values as “g”(good) or “b”(bad) according to the definition summarized above. This is a binary classification task.
K-Nearest Neighbor Algorithm Pseudocode
Let (Xi, Ci) where i = 1, 2……., n be data points. Xi denotes feature values & Ci denotes labels for Xi for each i.
Assuming the number of classes as ‘c’
Ci ∈ {1, 2, 3, ……, c} for all values of i
Let x be a point for which label is not known, and we would like to find the label class using k-nearest neighbor algorithms.
Procedure:
- Calculate “d(x, xi)” i =1, 2, ….., n; where d denotes the Euclidean distance between the points.
- Arrange the calculated n Euclidean distances in non-decreasing order.
- Let k be a +ve integer, take the first k distances from this sorted list.
- Find those k-points corresponding to these k-distances.
- Let ki denotes the number of points belonging to the ith class among k points i.e. k ≥ 0
- If ki >kj ∀ i ≠ j then put x in class i.
Let’s use the above pseudocode for implementing the knn algorithm in R Language.
Prerequisites:
- Basic programming experience is required
- Install R-Studio on your system.
K-Nearest neighbor algorithm implement in R Language from scratch
We are going to follow the below workflow for implementing the knn algorithm in R:
- Getting Data
- Train & Test Data Split
- Euclidean Distance Calculation
- KNN prediction function
- Accuracy calculation
Let’s get our hands dirty and start the coding stuff.
Getting Data in R
For any programmatic implementation on the dataset, we first need to import it. Using read.csv(), we are importing dataset into knn.df dataframe. Since dataset has no header so, we are using header= FALSE. sep parameter is to define the literal which separates values our document.
knn.df is a dataframe. A dataframe is a table or 2-D array, in which each column contains measurements on one variable, and each row contains one record.
knn.df <- read.csv('i_data_sample_30.csv', header = FALSE, sep = ',')
For checking dimensions of the dataset, we can call dim() method and be passing data frame as a parameter.
dim(knn.df) [1] 30 6
It shows that the data frame consists of 30 records and 6 columns.
To check summary of our dataset, we can use summary() method.
summary(knn.df) V1 V2 V3 V4 V5 Min. :-1.0000 Min. :-1.0000 Min. :-1.0000 Min. :-1.0000 Min. :-1.0000 1st Qu.:-0.1079 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.:-0.6477 Median : 0.0000 Median : 0.8099 Median : 0.4059 Median : 0.5891 Median : 0.0000 Mean :-0.0137 Mean : 0.5611 Mean : 0.3246 Mean : 0.2769 Mean :-0.0633 3rd Qu.: 0.1336 3rd Qu.: 1.0000 3rd Qu.: 0.9730 3rd Qu.: 0.8269 3rd Qu.: 0.2464 Max. : 1.0000 Max. : 1.0000 Max. : 1.0000 Max. : 1.0000 Max. : 1.0000 V6 b:15 g:15
It shows that records with bad class and good class are 15 each.
Train & Test Data split in R
Before Train & Test data split, we need to distribute it randomly. In R, we can use sample() method. It helps to randomize all the records of dataframe.
Please use set.seed(2), seed() method is used to produce reproducible results. In the next line we are passing sample() method inside dataframe. This is to randomize all 30 records of knn.df. Now, we are ready for a split. For dividing train, test data we are splitting them in 70:30 ratio i.e., 70% of data will be considered as train set & 30% as the test set.
set.seed(2) knn.df<- knn.df[sample(nrow(knn.df)),] train.df <- knn.df[1:as.integer(0.7*30),] test.df <- knn.df[as.integer(0.7*30 +1):30,]
Euclidean Distance Calculation in R
Below snippet consists of a function defined in R to calculate Euclidean distance between 2 points a & b. The formula of Euclidean distance is:
euclideanDist <- function(a, b){ d = 0 for(i in c(1:(length(a)-1) )) { d = d + (a[[i]]-b[[i]])^2 } d = sqrt(d) return(d) }
KNN prediction function in R
This function is the core part of this tutorial. We are writing a function knn_predict. It takes 3 arguments: test data, train data & value of K. It loops over all the records of test data and train data. It returns the predicted class labels of test data.
knn_predict <- function(test_data, train_data, k_value){ pred <- c() #empty pred vector #LOOP-1 for(i in c(1:nrow(test_data))){ #looping over each record of test data eu_dist =c() #eu_dist & eu_char empty vector eu_char = c() good = 0 #good & bad variable initialization with 0 value bad = 0 #LOOP-2-looping over train data for(j in c(1:nrow(train_data))){ #adding euclidean distance b/w test data point and train data to eu_dist vector eu_dist <- c(eu_dist, euclideanDist(test_data[i,], train_data[j,])) #adding class variable of training data in eu_char eu_char <- c(eu_char, as.character(train_data[j,][[6]])) } eu <- data.frame(eu_char, eu_dist) #eu dataframe created with eu_char & eu_dist columns eu <- eu[order(eu$eu_dist),] #sorting eu dataframe to gettop K neighbors eu <- eu[1:k_value,] #eu dataframe with top K neighbors #Loop 3: loops over eu and counts classes of neibhors. for(k in c(1:nrow(eu))){ if(as.character(eu[k,"eu_char"]) == "g"){ good = good + 1 } else bad = bad + 1 } # Compares the no. of neighbors with class label good or bad if(good > bad){ #if majority of neighbors are good then put "g" in pred vector pred <- c(pred, "g") } else if(good < bad){ #if majority of neighbors are bad then put "b" in pred vector pred <- c(pred, "b") } } return(pred) #return pred vector }
It returns a vector with predicted classes of test dataset. These predictions can be used to calculate accuracy metric.
Accuracy Calculation in R
The accuracy metric calculates the ratio of the number of correctly predicted class labels to the total number of predicted labels.
accuracy <- function(test_data){ correct = 0 for(i in c(1:nrow(test_data))){ if(test_data[i,6] == test_data[i,7]){ correct = correct+1 } } accu = correct/nrow(test_data) * 100 return(accu) }
KNN Algorithm accuracy print: In this code snippet we are joining all our functions. We are calling the knn_predict function with train and test dataframes that we split earlier and K value as 5.
We are appending the prediction vector as the 7th column in our test dataframe and then using accuracy() method we are printing accuracy of our KNN model.
K = 5 predictions <- knn_predict(test.df, train.df, K) #calling knn_predict() test.df[,7] <- predictions #Adding predictions in test data as 7th column print(accuracy(test.df))
Script Output:
Accuracy of our KNN model is 77.77778
It prints accuracy of our knn model. Here our accuracy is 77.78%. That’s pretty good 🙂 for our randomly selected dummy dataset.
You can download the RMD file of this code from our GitHub repository.
Finally, we have implemented our KNN model in R programming without using any specific R packages.Hope you enjoyed learning it.
Related Articles To Read
Follow us:
FACEBOOK| QUORA |TWITTER| GOOGLE+ | LINKEDIN| REDDIT | FLIPBOARD | MEDIUM | GITHUB
I hope you like this post. If you have any questions, then feel free to comment below. If you want me to write on one particular topic, then do tell it to me in the comments below.
Related Courses:
Do check out unlimited data science courses
Title of the course | Course Link | Course Link |
R Programming A-Z: R For Data Science With Real Exercises! |
R Programming A-Z: R For Data Science With Real Exercises! |
|
R Programming: Advanced Analytics In R For Data Science |
R Programming: Advanced Analytics In R For Data Science |
|
Data Mining with R: Go from Beginner to Advanced! |
Data Mining with R: Go from Beginner to Advanced! |
|
HI,
Happy to find your website.
please post all the Boosting family (XGBoost, LightGBM, Catboost, Gradient Boosting)
Hi, I wanted to print out “predictions” but I get NULL as output. Any ideas?
Hi Sonia,
Please check the features you are passing to perform predictions. Do let us know if you are still not able to get the predictions.
Thanks,
What about KNN Regressor ?
Hi,
Yes we can perform the regression with knn also. In knn regression we will average the K neighbor values as the predicted value, But using knn for regression is not an optimal option, it always better to go with the regression algorithms
Thanks,
Saimadhu
Hi
do you implement KNN Manhatten Distance or KNN Minkowski Distance? if yes then can you share that.
Hi Naila,
For knn algorithm, we use the Euclidean distance not the Manhatten distance or Minkowski distance measure. You can check different similarity measures in this article.
https://dataaspirant.com/2015/04/11/five-most-popular-similarity-measures-implementation-in-python/
i need the data set
Hi Vignesh,
Sorry to say the dataset we used in the article is the dummy data we have created, the idea is to apply this code for any dataset. Feel free to use the code for other datasets and let me know if you face any issues.
Thanks,
Saimadhu
Euclidean is wrong. It should be for(i in c(1:(length(a)) )) otherwise you don’t consider the last dimension
Hi Cristian,
Thanks for identifying the code bug.
Thanks & happy learning
I want your link in this article on GitHub thanks
Hi Ayoub,
Sorry to say we are not having the code for this article in our GitHub, we are you can find all the dataaspirant codes in our Github link.
Link: https://github.com/saimadhu-polamuri/DataAspirant_codes
Thanks and happy learning.
i need help in knn
Hi Shahid,
Please let me know what kind of help you need in knn.
Thanks and happy learning.
from where will I get the dataset “i_data_sample_30.csv”
Hi Devraj,
The data we are showing in the article is the dummy dataset, the main intention is to use the same model building workflow for any other dataset.
Thanks and happy learning.
The dataset in the link [https://archive.ics.uci.edu/index.php] is not avalable.
Hi Afonso,
I guess the archive repo has removed or changed the URL link.
Thanks and happy learning.
Can you send me the datafile? Thank’s a lot
Hi Michael,
The dataset we have used in the article is the dummy dataset, the main intension is to apply the same model building workflow for any other dataset. Hope you can use the same model building framework for other datasets.
Thanks and happy learning!
Thanks so much for your informative post
Hi Jeza,
Thanks for your compliment. 🙂
Hi there.
I need the link with the file “i_data_sample_30.csv”, shown at the begining. Thanks a lot!!
Ricardo (sorry the poor English)
Hi Ricardo,
Sent to your mail. 🙂
from where do i download the data set for the above KNN implementation?
Hi Milind,
The data set is the in this post is the dummy dataset. Sorry to say we miss placed that so how. You can get unlimited classification related dataset in the link https://archive.ics.uci.edu
Hi Sai,
Please could you share the dataset used for this practice to my email-id sunilarava@yahoo.com
Sunil
Hi Sunil Arava,
Sorry to say the dataset we used in this article is the dummy data we created, the idea is to use the code for any other dataset.
Thanks,
Saimadhu