Knn R, K-nearest neighbor classifier implementation in R programming from scratch knn in implementation r from scratch

K-Nearest neighbor algorithm implement in R Programming from scratch

In the introduction to k-nearest-neighbor algorithm article, we have learned the core concepts of the knn algorithm. Also learned about the applications using knn algorithm to solve the real world problems.

In this post, we will be implementing K-Nearest Neighbor Algorithm on a dummy data set using R programming language from scratch. Along the way, we will implement a prediction model to predict classes for data.

Knn Implementation in R

Why we need to implement knn algorithm from scratch in R Programming Language

Implementation of K-Nearest Neighbor algorithm in R language from scratch will help us to apply the concepts of Knn algorithm. As we are going implement each every component of the knn algorithm and the other components like how to use the datasets and find the accuracy of our implemented model etc.

Problem Set

We will use a sample dataset extracted from ionosphere database by John Hopkins University. We have converted the database into a small dataset so as to simplify the learning curve for our readers.

Our objective is to program a Knn classifier in R programming language without using any machine learning package. We have two classes “g”(good) or “b”(bad), it is the response of radar from the ionosphere. The classifier could be capable of predicting “g” or “b” class for new records from training data.

Ionosphere Dataset Description

This dummy dataset consists of 6 attributes and 30 records. Out Of these 5 attributes are continuous variables with values ranging from -1 to +1 i.e, [-1,+1]. Last(6th) attribute is a categorical variable with values as “g”(good) or “b”(bad) according to the definition summarized above. This is a binary classification task.

K-Nearest Neighbor Algorithm Pseudocode

Let (Xi, Ci) where i = 1, 2……., n be data points. Xi denotes feature values & Ci denotes labels for Xfor each i.
Assuming the number of classes as ‘c’
Ci ∈ {1, 2, 3, ……, c} for all values of i

Let x be a point for which label is not known, and we would like to find the label class using k-nearest neighbor algorithms.

Procedure:

1. Calculate “d(x, xi)” i =1, 2, ….., n; where d denotes the Euclidean distance between the points.
2. Arrange the calculated n Euclidean distances in non-decreasing order.
3. Let k be a +ve integer, take the first k distances from this sorted list.
4. Find those k-points corresponding to these k-distances.
5. Let ki denotes the number of points belonging to the ith class among k points i.e. k ≥ 0
6. If ki >kj ∀ i ≠ j then put x in class i.

Let’s use the above pseudocode for implementing the knn algorithm in R Language.

Prerequisites:

1. Basic programming experience is required
2. Install R-Studio on your system.

K-Nearest neighbor algorithm implement in R Language from scratch

We are going to follow the below workflow for implementing the knn algorithm in R:

1. Getting Data
2. Train & Test Data Split
3. Euclidean Distance Calculation
4. KNN prediction function
5. Accuracy calculation

Let’s get our hands dirty and start the coding stuff.

Getting Data in R

For any programmatic implementation on the dataset, we first need to import it. Using read.csv(), we are importing dataset into knn.df dataframe. Since dataset has no header so, we are using header= FALSE. sep parameter is to define the literal which separates values our document.
knn.df is a dataframe. A dataframe is a table or 2-D array, in which each column contains measurements on one variable, and each row contains one record.

For checking dimensions of the dataset, we can call dim() method and be passing data frame as a parameter.

It shows that the data frame consists of 30 records and 6 columns.

To check summary of our dataset, we can use summary() method.

It shows that records with bad class and good class are 15 each.

Train & Test Data split in R

Before Train & Test data split, we need to distribute it randomly. In R, we can use sample() method. It helps to randomize all the records of dataframe.
Please use set.seed(2), seed() method is used to produce reproducible results. In the next line we are passing sample() method inside dataframe. This is to randomize all 30 records of knn.df. Now, we are ready for a split. For dividing train, test data we are splitting them in 70:30 ratio i.e., 70% of data will be considered as train set & 30% as the test set.

Euclidean Distance Calculation in R

Below snippet consists of a function defined in R to calculate Euclidean distance between 2 points a & b. The formula of Euclidean distance is: Euclidean Distance

KNN prediction function in R

This function is the core part of this tutorial. We are writing a function knn_predict. It takes 3 arguments: test data, train data & value of K. It loops over all the records of test data and train data.  It returns the predicted class labels of test data.

It returns a vector with predicted classes of test dataset. These predictions can be used to calculate accuracy metric.

Accuracy Calculation in R

The accuracy metric calculates the ratio of the number of correctly predicted class labels to the total number of predicted labels.

KNN Algorithm accuracy print: In this code snippet we are joining all our functions. We are calling the knn_predict function with train and test dataframes that we split earlier and K value as 5.
We are appending the prediction vector as the 7th column in our test dataframe and then using accuracy() method we are printing accuracy of our KNN model.

Script Output:

It prints accuracy of our knn model. Here our accuracy is 77.78%. That’s pretty good 🙂 for our randomly selected dummy dataset.
You can download the RMD file of this code from our GitHub repository.

Finally, we have implemented our KNN model in R programming without using any specific R packages.Hope you enjoyed learning it.

Related Articles To Read

I hope you like this post. If you have any questions, then feel free to comment below.  If you want me to write on one particular topic, then do tell it to me in the comments below.

Related Courses:

Do check out unlimited data science courses

 Title of the course Course Link Course Link R Programming A-Z: R For Data Science With Real Exercises!    R Programming A-Z: R For Data Science With Real Exercises!  This course is truly step-by-step. In every new tutorial, you will build on what had already learned and move one extra step forward. After every video, you learn a new valuable concept that you can apply right away. And the best part is that you learn through live examples. In summary, this course has been designed for all skill levels and even if you have no programming or statistical background you will be successful in this course!. R Programming: Advanced Analytics In R For Data Science    R Programming: Advanced Analytics In R For Data Science  Perform Data Preparation in R and locate missing data in dataframes. Apply the Factual Analysis method to replace missing records. Work with the gsub() and sub() functions for replacing strings. Use lapply() and sapply() when working with lists and vectors. Use lapply() and sapply() when working with lists and vectors. Data Mining with R: Go from Beginner to Advanced!    Data Mining with R: Go from Beginner to Advanced!  Use R software for data import and export, data exploration and visualization, and for data analysis tasks, including performing a comprehensive set of data mining operations. Apply the dozens of included “hands-on” cases and examples using real data and R scripts to new and unique data analysis and data mining problems. Effectively use a number of popular, contemporary data mining methods and techniques in demand by industry including: (1) Decision, classification and regression trees (CART); (2) Random forests; (3) Linear and logistic regression; and (4) Various cluster analysis techniques.
• […] we are going to examine a wine dataset. Our motive is to predict the origin of the wine. As in our Knn implementation in R programming post, we built a Knn classifier in R from scratch, but that process is not a feasible […]

• milind says:

from where do i download the data set for the above KNN implementation?

• Saimadhu Polamuri says:

Hi Milind,

The data set is the in this post is the dummy dataset. Sorry to say we miss placed that so how. You can get unlimited classification related dataset in the link https://archive.ics.uci.edu

• Ricardo says:

Hi there.

I need the link with the file “i_data_sample_30.csv”, shown at the begining. Thanks a lot!!

Ricardo (sorry the poor English)

• Saimadhu Polamuri says:

Hi Ricardo,

Sent to your mail. 🙂

• jeza says:

Thanks so much for your informative post

• Saimadhu Polamuri says:

Hi Jeza,

Thanks for your compliment. 🙂