# Knn R, K-nearest neighbor classifier implementation in R programming from scratch knn in implementation r from scratch

# K-Nearest neighbor algorithm implement in R Programming from scratch

In the introduction to k-nearest-neighbor algorithm article, we have learned the core concepts of the knn algorithm. Also learned about the applications using knn algorithm to solve the real world problems.

In this post, we will be implementing K-Nearest Neighbor Algorithm on a dummy data set using R programming language from scratch. Along the way, we will implement a prediction model to predict classes for data.

### Knn Implementation in R

#### Why we need to implement knn algorithm from scratch in R Programming Language

Implementation of K-Nearest Neighbor algorithm in R language from scratch will help us to apply the concepts of Knn algorithm. As we are going implement each every component of the knn algorithm and the other components like how to use the datasets and find the accuracy of our implemented model etc.

### Problem Set

We will use a sample dataset extracted from ionosphere database by John Hopkins University. We have converted the database into a small dataset so as to simplify the learning curve for our readers.

Our objective is to program a Knn classifier in R programming language without using any machine learning package. We have two classes “g”(good) or “b”(bad), it is the response of radar from the ionosphere. The classifier could be capable of predicting “g” or “b” class for new records from training data.

### Ionosphere Dataset Description

This dummy dataset consists of 6 attributes and 30 records. Out Of these 5 attributes are continuous variables with values ranging from -1 to +1 i.e, [-1,+1]. Last(6th) attribute is a categorical variable with values as “g”(good) or “b”(bad) according to the definition summarized above. This is a binary classification task.

### K-Nearest Neighbor Algorithm Pseudocode

Let (Xi, Ci) where i = 1, 2……., n be data points. Xi denotes feature values & Ci denotes labels for Xfor each i.
Assuming the number of classes as ‘c’
Ci ∈ {1, 2, 3, ……, c} for all values of i

Let x be a point for which label is not known, and we would like to find the label class using k-nearest neighbor algorithms.

### Procedure:

1. Calculate “d(x, xi)” i =1, 2, ….., n; where d denotes the Euclidean distance between the points.
2. Arrange the calculated n Euclidean distances in non-decreasing order.
3. Let k be a +ve integer, take the first k distances from this sorted list.
4. Find those k-points corresponding to these k-distances.
5. Let ki denotes the number of points belonging to the ith class among k points i.e. k ≥ 0
6. If ki >kj ∀ i ≠ j then put x in class i.

Let’s use the above pseudocode for implementing the knn algorithm in R Language.

#### Prerequisites:

1. Basic programming experience is required
2. Install R-Studio on your system.

## K-Nearest neighbor algorithm implement in R Language from scratch

We are going to follow the below workflow for implementing the knn algorithm in R:

1. Getting Data
2. Train & Test Data Split
3. Euclidean Distance Calculation
4. KNN prediction function
5. Accuracy calculation

Let’s get our hands dirty and start the coding stuff.

### Getting Data in R

For any programmatic implementation on the dataset, we first need to import it. Using read.csv(), we are importing dataset into knn.df dataframe. Since dataset has no header so, we are using header= FALSE. sep parameter is to define the literal which separates values our document.
knn.df is a dataframe. A dataframe is a table or 2-D array, in which each column contains measurements on one variable, and each row contains one record.

```knn.df <- read.csv('i_data_sample_30.csv', header = FALSE, sep = ',')
```

For checking dimensions of the dataset, we can call dim() method and be passing data frame as a parameter.

```dim(knn.df)
 30  6
```

It shows that the data frame consists of 30 records and 6 columns.

To check summary of our dataset, we can use summary() method.

```summary(knn.df)

V1                V2                V3                V4                V5
Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.0000   Min.   :-1.0000
1st Qu.:-0.1079   1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.:-0.6477
Median : 0.0000   Median : 0.8099   Median : 0.4059   Median : 0.5891   Median : 0.0000
Mean   :-0.0137   Mean   : 0.5611   Mean   : 0.3246   Mean   : 0.2769   Mean   :-0.0633
3rd Qu.: 0.1336   3rd Qu.: 1.0000   3rd Qu.: 0.9730   3rd Qu.: 0.8269   3rd Qu.: 0.2464
Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000   Max.   : 1.0000
V6
b:15
g:15```

It shows that records with bad class and good class are 15 each.

### Train & Test Data split in R

Before Train & Test data split, we need to distribute it randomly. In R, we can use sample() method. It helps to randomize all the records of dataframe.
Please use set.seed(2), seed() method is used to produce reproducible results. In the next line we are passing sample() method inside dataframe. This is to randomize all 30 records of knn.df. Now, we are ready for a split. For dividing train, test data we are splitting them in 70:30 ratio i.e., 70% of data will be considered as train set & 30% as the test set.

```set.seed(2)
knn.df<- knn.df[sample(nrow(knn.df)),]
train.df <- knn.df[1:as.integer(0.7*30),]
test.df <- knn.df[as.integer(0.7*30 +1):30,]```

### Euclidean Distance Calculation in R

Below snippet consists of a function defined in R to calculate Euclidean distance between 2 points a & b. The formula of Euclidean distance is: Euclidean Distance

```euclideanDist <- function(a, b){
d = 0
for(i in c(1:(length(a)-1) ))
{
d = d + (a[[i]]-b[[i]])^2
}
d = sqrt(d)
return(d)
}```

### KNN prediction function in R

This function is the core part of this tutorial. We are writing a function knn_predict. It takes 3 arguments: test data, train data & value of K. It loops over all the records of test data and train data.  It returns the predicted class labels of test data.

```knn_predict <- function(test_data, train_data, k_value){
pred <- c()  #empty pred vector
#LOOP-1
for(i in c(1:nrow(test_data))){   #looping over each record of test data
eu_dist =c()          #eu_dist & eu_char empty  vector
eu_char = c()
good = 0              #good & bad variable initialization with 0 value

#LOOP-2-looping over train data
for(j in c(1:nrow(train_data))){

#adding euclidean distance b/w test data point and train data to eu_dist vector
eu_dist <- c(eu_dist, euclideanDist(test_data[i,], train_data[j,]))

#adding class variable of training data in eu_char
eu_char <- c(eu_char, as.character(train_data[j,][]))
}

eu <- data.frame(eu_char, eu_dist) #eu dataframe created with eu_char & eu_dist columns

eu <- eu[order(eu\$eu_dist),]       #sorting eu dataframe to gettop K neighbors
eu <- eu[1:k_value,]               #eu dataframe with top K neighbors

#Loop 3: loops over eu and counts classes of neibhors.
for(k in c(1:nrow(eu))){
if(as.character(eu[k,"eu_char"]) == "g"){
good = good + 1
}
else
}

# Compares the no. of neighbors with class label good or bad
if(good > bad){          #if majority of neighbors are good then put "g" in pred vector

pred <- c(pred, "g")
}
#if majority of neighbors are bad then put "b" in pred vector
pred <- c(pred, "b")
}

}
return(pred) #return pred vector
}```

It returns a vector with predicted classes of test dataset. These predictions can be used to calculate accuracy metric.

### Accuracy Calculation in R

The accuracy metric calculates the ratio of the number of correctly predicted class labels to the total number of predicted labels.

```accuracy <- function(test_data){
correct = 0
for(i in c(1:nrow(test_data))){
if(test_data[i,6] == test_data[i,7]){
correct = correct+1
}
}
accu = correct/nrow(test_data) * 100
return(accu)
}```

KNN Algorithm accuracy print: In this code snippet we are joining all our functions. We are calling the knn_predict function with train and test dataframes that we split earlier and K value as 5.
We are appending the prediction vector as the 7th column in our test dataframe and then using accuracy() method we are printing accuracy of our KNN model.

```K = 5
predictions <- knn_predict(test.df, train.df, K) #calling knn_predict()

test.df[,7] <- predictions #Adding predictions in test data as 7th column
print(accuracy(test.df))```

#### Script Output:

```Accuracy of our KNN model is
77.77778```

It prints accuracy of our knn model. Here our accuracy is 77.78%. That’s pretty good 🙂 for our randomly selected dummy dataset.
You can download the RMD file of this code from our GitHub repository.

Finally, we have implemented our KNN model in R programming without using any specific R packages.Hope you enjoyed learning it.

I hope you like this post. If you have any questions, then feel free to comment below.  If you want me to write on one particular topic, then do tell it to me in the comments below.

### Related Courses:

Do check out unlimited data science courses

 Title of the course Course Link Course Link R Programming A-Z: R For Data Science With Real Exercises!        R Programming A-Z: R For Data Science With Real Exercises!    This course is truly step-by-step. In every new tutorial, you will build on what had already learned and move one extra step forward. After every video, you learn a new valuable concept that you can apply right away. And the best part is that you learn through live examples. In summary, this course has been designed for all skill levels and even if you have no programming or statistical background you will be successful in this course!. R Programming: Advanced Analytics In R For Data Science        R Programming: Advanced Analytics In R For Data Science    Perform Data Preparation in R and locate missing data in dataframes. Apply the Factual Analysis method to replace missing records. Work with the gsub() and sub() functions for replacing strings. Use lapply() and sapply() when working with lists and vectors. Use lapply() and sapply() when working with lists and vectors. Data Mining with R: Go from Beginner to Advanced!        Data Mining with R: Go from Beginner to Advanced!    Use R software for data import and export, data exploration and visualization, and for data analysis tasks, including performing a comprehensive set of data mining operations. Apply the dozens of included “hands-on” cases and examples using real data and R scripts to new and unique data analysis and data mining problems. Effectively use a number of popular, contemporary data mining methods and techniques in demand by industry including: (1) Decision, classification and regression trees (CART); (2) Random forests; (3) Linear and logistic regression; and (4) Various cluster analysis techniques.

### 29 Responses to “Knn R, K-nearest neighbor classifier implementation in R programming from scratch”

• Latifa Rehmat
5 months ago

HI,

• 2 years ago

Hi, I wanted to print out “predictions” but I get NULL as output. Any ideas?

• 2 years ago

Hi Sonia,

Please check the features you are passing to perform predictions. Do let us know if you are still not able to get the predictions.

Thanks,

• Mo
3 years ago

• 3 years ago

Hi,

Yes we can perform the regression with knn also. In knn regression we will average the K neighbor values as the predicted value, But using knn for regression is not an optimal option, it always better to go with the regression algorithms

Thanks,

• Naila
3 years ago

Hi
do you implement KNN Manhatten Distance or KNN Minkowski Distance? if yes then can you share that.

• 3 years ago

Hi Naila,

For knn algorithm, we use the Euclidean distance not the Manhatten distance or Minkowski distance measure. You can check different similarity measures in this article.

https://dataaspirant.com/2015/04/11/five-most-popular-similarity-measures-implementation-in-python/

• Vignesh
3 years ago

i need the data set

• 3 years ago

Hi Vignesh,

Sorry to say the dataset we used in the article is the dummy data we have created, the idea is to apply this code for any dataset. Feel free to use the code for other datasets and let me know if you face any issues.

Thanks,

• Cristian
3 years ago

Euclidean is wrong. It should be for(i in c(1:(length(a)) )) otherwise you don’t consider the last dimension

• 3 years ago

Hi Cristian,

Thanks for identifying the code bug.

Thanks & happy learning

• AYOUB
3 years ago

• 3 years ago

Hi Ayoub,

Sorry to say we are not having the code for this article in our GitHub, we are you can find all the dataaspirant codes in our Github link.

Thanks and happy learning.

• shahid
4 years ago

i need help in knn

• 2 years ago

Hi Shahid,

Thanks and happy learning.

• Devraj
5 years ago

from where will I get the dataset “i_data_sample_30.csv”

• 2 years ago

Hi Devraj,

The data we are showing in the article is the dummy dataset, the main intention is to use the same model building workflow for any other dataset.

Thanks and happy learning.

• afonso
5 years ago

The dataset in the link [https://archive.ics.uci.edu/index.php] is not avalable.

• 2 years ago

Hi Afonso,

I guess the archive repo has removed or changed the URL link.

Thanks and happy learning.

• Michael
5 years ago

Can you send me the datafile? Thank’s a lot

• 2 years ago

Hi Michael,

The dataset we have used in the article is the dummy dataset, the main intension is to apply the same model building workflow for any other dataset. Hope you can use the same model building framework for other datasets.

Thanks and happy learning!

• jeza
5 years ago

Thanks so much for your informative post

• 5 years ago

Hi Jeza,

• Ricardo
5 years ago

Hi there.

I need the link with the file “i_data_sample_30.csv”, shown at the begining. Thanks a lot!!

Ricardo (sorry the poor English)

• 5 years ago

Hi Ricardo,

• milind
6 years ago

from where do i download the data set for the above KNN implementation?

• 6 years ago

Hi Milind,

The data set is the in this post is the dummy dataset. Sorry to say we miss placed that so how. You can get unlimited classification related dataset in the link https://archive.ics.uci.edu

• SUNIL ARAVA
3 years ago

Hi Sai,

Please could you share the dataset used for this practice to my email-id sunilarava@yahoo.com

Sunil

• 3 years ago

Hi Sunil Arava,