Gaussian Naive Bayes Classifier implementation in Python

February 20, 2017 Rahul Saxena

Gaussian Naive Bayes classifier Implementation in Python

Building Gaussian Naive Bayes Classifier in Python

In this post, we are going to implement the Naive Bayes classifier in Python using my favorite machine learning library scikit-learn. Next, we are going to use the trained Naive Bayes (supervised classification), model to predict the Census Income.

As we discussed the Bayes theorem in naive Bayes classifier post. We hope you know the basics of the Bayes theorem. If not let’s quickly look at the basics of Bayes theorem once.

Bayes’ theorem is based on conditional probability. The conditional probability helps us calculating the probability that something will happen, given that something else has already happened. Not getting let’s understand with few examples.

Conditional Probability Examples

Below are the few examples helps to clearly understand the definition of conditional probability.

Purchasing mac book when you already purchased the iPhone.
Having a refreshing drink when you are in the movie theater.
Buying peanuts when you brought a chilled soft drink.

Conditional Probability Example

Using the Bayes theorem the naive Bayes classifier works. The naive Bayes classifier assumes all the features are independent to each other. Even if the features depend on each other or upon the existence of the other features. Naive Bayes classifier considers all of these properties to independently contribute to the probability that the user buys the MacBook.

To learn the key concepts related to Naive Bayes. You can read our article on Introduction to Naive Bayes. This will help you understand the core concepts related to Naive Bayes.

In the introduction to Naive Bayes post, we discussed three popular Naive Bayes algorithms:

Gaussian Naive Bayes,
Multinomial Naive Bayes.
Bernoulli Naive Bayes.

As a continues to the Naive Bayes algorithm article. Now we are going to implement Gaussian Naive Bayes on a “Census Income” dataset.

Gaussian Naive Bayes

A Gaussian Naive Bayes algorithm is a special type of NB algorithm. It’s specifically used when the features have continuous values. It’s also assumed that all the features are following a gaussian distribution i.e, normal distribution.

Census Income Dataset

Census Income dataset is to predict whether the income of a person >$50K/yr (greater than $50K/yr) or <=$50K/yr. The data was collected by Barry Becker from 1994 Census dataset.

This dataset was contributed to UCI repository, and It’s openly available at this link. The dataset consists of 15 columns of a mix of discrete as well as continuous data.

	Variable Name	Variable Range
1.	age	[17 – 90]
2.	workclass	[‘State-gov’, ‘Self-emp-not-inc’, ‘Private’, ‘Federal-gov’, ‘Local-gov’, ‘Self-emp-inc’, ‘Without-pay’, ‘Never-worked’]
3.	fnlwgt	[77516- 257302]
4.	education	[‘Bachelors’, ‘HS-grad’, ’11th’, ‘Masters’, ‘9th’, ‘Some-college’, ‘Assoc-acdm’, ‘Assoc-voc’, ‘7th-8th’, ‘Doctorate’, ‘Prof-school’, ‘5th-6th’, ’10th’, ‘1st-4th’, ‘Preschool’, ’12th’]
5.	education_num	[1 – 16]
6.	marital_status	[‘Never-married’, ‘Married-civ-spouse’, ‘Divorced’, ‘Married-spouse-absent’, ‘Separated’, ‘Married-AF-spouse’, ‘Widowed’]
7.	occupation	[‘Adm-clerical’, ‘Exec-managerial’, ‘Handlers-cleaners’, ‘Prof-specialty’, ‘Other-service’, ‘Sales’, ‘Craft-repair’, ‘Transport-moving’, ‘Farming-fishing’, ‘Machine-op-inspct’, ‘Tech-support’, ‘Protective-serv’, ‘Armed-Forces’, ‘Priv-house-serv’]
8.	relationship	[‘Not-in-family’, ‘Husband’, ‘Wife’, ‘Own-child’, ‘Unmarried’, ‘Other-relative’]
9.	race	[‘White’, ‘Black’, ‘Asian-Pac-Islander’, ‘Amer-Indian-Eskimo’, ‘Other’]
10.	sex	[‘Male’, ‘Female’]
11.	capital_gain	[0 – 99999]
12.	capital_loss	[0 – 4356]
13.	hours_per_week	[1 – 99]
14.	native_country	[‘United-States’, ‘Cuba’, ‘Jamaica’, ‘India’, ‘Mexico’, ‘South’, ‘Puerto-Rico’, ‘Honduras’, ‘England’, ‘Canada’, ‘Germany’, ‘Iran’, ‘Philippines’, ‘Italy’, ‘Poland’, ‘Columbia’, ‘Cambodia’, ‘Thailand’, ‘Ecuador’, ‘Laos’, ‘Taiwan’, ‘Haiti’, ‘Portugal’, ‘Dominican-Republic’, ‘El-Salvador’, ‘France’, ‘Guatemala’, ‘China’, ‘Japan’, ‘Yugoslavia’, ‘Peru’, ‘Outlying-US(Guam-USVI-etc)’, ‘Scotland’, ‘Trinadad&Tobago’, ‘Greece’, ‘Nicaragua’, ‘Vietnam’, ‘Hong’, ‘Ireland’, ‘Hungary’, ‘Holand-Netherlands’]
15.	income	[‘<=50K’, ‘>50K’]

The final target variable consists of two values: ‘<=50K” & ‘>50K’.

Implementation of Gaussian NB on Census Income dataset

Importing Python Machine Learning Libraries

We need to import pandas, numpy and sklearn libraries. From sklearn, we need to import preprocessing modules like Imputer. The Imputer package helps to impute the missing values.

If you are not setup the python machine learning libraries setup. You can first complete it to run the codes in this articles.

# Required Python Machine learning Packages
import pandas as pd
import numpy as np
# For preprocessing the data
from sklearn.preprocessing import Imputer
from sklearn import preprocessing
# To split the dataset into train and test datasets
from sklearn.cross_validation import train_test_split
# To model the Gaussian Navie Bayes classifier
from sklearn.naive_bayes import GaussianNB
# To calculate the accuracy score of the model
from sklearn.metrics import accuracy_score

The train_test_split module is for splitting the dataset into training and testing set. The accuracy_score module will be used for calculating the accuracy of our Gaussian Naive Bayes algorithm.

Data Import

For importing the census data, we are using pandas read_csv() method. This method is a very simple and fast method for importing data.

We are passing four parameters. The ‘adult.data’ parameter is the file name. The header parameter is for giving details to pandas that whether the first row of data consists of headers or not. In our dataset, there is no header. So, we are passing None.

The delimiter parameter is for giving the information the delimiter that is separating the data. Here, we are using ‘ *, *’ delimiter. This delimiter is to show delete the spaces before and after the data values. This is very helpful when there is inconsistency in spaces used with data values.

adult_df = pd.read_csv('adult.data',
                       header = None, delimiter=' *, *', engine='python')

Let’s add headers to our dataframe. The below code snippet can be used to perform this task.

adult_df.columns = ['age', 'workclass', 'fnlwgt', 'education', 'education_num',
                    'marital_status', 'occupation', 'relationship',
                    'race', 'sex', 'capital_gain', 'capital_loss',
                    'hours_per_week', 'native_country', 'income']

Handling Missing Data

Let’s try to test whether there is any null value in our dataset or not. We can do this using isnull() method.

adult_df.isnull().sum()

Script Output

age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
income            0
dtype: int64

The above output shows that there is no “null” value in our dataset.

Let’s try to test whether any categorical attribute contains a “?” in it or not. At times there exists “?” or ” ” in place of missing values. Using the below code snippet we are going to test whether our adult_df data frame consists of categorical variables with values as “?”.

for value in ['workclass', 'education',
          'marital_status', 'occupation',
          'relationship','race', 'sex',
          'native_country', 'income']:
    print value,":", sum(adult_df[value] == '?')

Script Output

workclass : 1836
education : 0
marital_status : 0
occupation : 1843
relationship : 0
race : 0
sex : 0
native_country : 583
income : 0

The output of the above code snippet shows that there are 1836 missing values in workclass attribute. 1843 missing values in occupation attribute and 583 values in native_country attribute.

Data preprocessing

For preprocessing, we are going to make a duplicate copy of our original dataframe.We are duplicating adult_df to adult_df_rev dataframe.

adult_df_rev = adult_df

As we want to perform some imputation for missing values. Before doing that, we need some summary statistics of our dataframe. For this, we can use describe() method. It can be used to generate various summary statistics, excluding NaN values.

We are passing an “include” parameter with value as “all”, this is used to specify that. we want summary statistics of all the attributes.

adult_df_rev.describe(include= 'all')

You check the basic statistics about the dataset after running the above command. You can spend some time here to get in details about each and ever stats provided.

Data Imputation Step

Now, it’s time to impute the missing values. Some of our categorical values have missing values i.e, “?”. We are going to replace the “?” with the above describe methods top row’s value. For example, we are going to replace the “?” values of workplace attribute with “Private” value.

for value in ['workclass', 'education',
          'marital_status', 'occupation',
          'relationship','race', 'sex',
          'native_country', 'income']:
    adult_df_rev[value].replace(['?'], [adult_df_rev.describe(include='all')[value][2]],
                                inplace='True')

We have performed data imputation step. 🙂
You can check changes in dataframe by printing adult_df_rev

For naive Bayes, we need to convert all the data values in one format. We are going to encode all the labels with the value between 0 and n_classes-1.

One-Hot Encoder

For implementing this, we are going to use LabelEncoder of scikit learn library. For encoding, we can also use the One-Hot encoder. It encodes the data into binary format.

le = preprocessing.LabelEncoder()
workclass_cat = le.fit_transform(adult_df.workclass)
education_cat = le.fit_transform(adult_df.education)
marital_cat   = le.fit_transform(adult_df.marital_status)
occupation_cat = le.fit_transform(adult_df.occupation)
relationship_cat = le.fit_transform(adult_df.relationship)
race_cat = le.fit_transform(adult_df.race)
sex_cat = le.fit_transform(adult_df.sex)
native_country_cat = le.fit_transform(adult_df.native_country)

#initialize the encoded categorical columns
adult_df_rev['workclass_cat'] = workclass_cat
adult_df_rev['education_cat'] = education_cat
adult_df_rev['marital_cat'] = marital_cat
adult_df_rev['occupation_cat'] = occupation_cat
adult_df_rev['relationship_cat'] = relationship_cat
adult_df_rev['race_cat'] = race_cat
adult_df_rev['sex_cat'] = sex_cat
adult_df_rev['native_country_cat'] = native_country_cat

#drop the old categorical columns from dataframe
dummy_fields = ['workclass', 'education', 'marital_status', 
                  'occupation', 'relationship', 'race',
                  'sex', 'native_country']
adult_df_rev = adult_df_rev.drop(dummy_fields, axis = 1)

Using the above code snippets, we have created multiple categorical columns like “marital_cat”, “race_cat” etc. You can see the top 6 lines of the dataframe using adult_df_rev.head()

By printing the adult_df_rev.head() result. You will be able to see that all the columns should be reindexed. They are not in proper order. For reindexing the columns, you can use the code snippet provided below:

adult_df_rev = adult_df_rev.reindex_axis(['age', 'workclass_cat', 'fnlwgt', 'education_cat',
                                    'education_num', 'marital_cat', 'occupation_cat',
                                    'relationship_cat', 'race_cat', 'sex_cat', 'capital_gain',
                                    'capital_loss', 'hours_per_week', 'native_country_cat', 
                                    'income'], axis= 1)

adult_df_rev.head(1)

The output to the above code snippet will show you that all the columns are reindexed properly. I have passed the list of column names as a parameter and axis=1 for reindexing the columns.

Standardization of Data

All the data values of our dataframe are numeric. Now, we need to convert them on a single scale. We can standardize the values. We can use the below formula for standardization.

$latex {x}_i = \frac{{x}_i – mean(x)} {\sigma(x)}$

num_features = ['age', 'workclass_cat', 'fnlwgt', 'education_cat', 'education_num',
                'marital_cat', 'occupation_cat', 'relationship_cat', 'race_cat',
                'sex_cat', 'capital_gain', 'capital_loss', 'hours_per_week',
                'native_country_cat']

scaled_features = {}
for each in num_features:
    mean, std = adult_df_rev[each].mean(), adult_df_rev[each].std()
    scaled_features[each] = [mean, std]
    adult_df_rev.loc[:, each] = (adult_df_rev[each] - mean)/std

We have converted our data values into standardized values. You can print and check the output of dataframe.

Data Slicing

Let’s split the data into training and test set. We can easily perform this step using sklearn’s train_test_split() method.

features = adult_df_rev.values[:,:14]
target = adult_df_rev.values[:,14]
features_train, features_test, target_train, target_test = train_test_split(features,
                                                                            target, test_size = 0.33, random_state = 10)

Using above code snippet, we have divided the data into features and target set. The feature set consists of 14 columns i.e, predictor variables and target set consists of 1 column with class values.

The features_train & target_train consists of training data and the features_test & target_test consists of testing data.

Gaussian Naive Bayes Implementation

After completing the data preprocessing. it’s time to implement machine learning algorithm on it. We are going to use sklearn’s GaussianNB module.

clf = GaussianNB()
clf.fit(features_train, target_train)
target_pred = clf.predict(features_test)

We have built a GaussianNB classifier. The classifier is trained using training data. We can use fit() method for training it. After building a classifier, our model is ready to make predictions. We can use predict() method with test set features as its parameters.

Accuracy of our Gaussian Naive Bayes model

It’s time to test the quality of our model. We have made some predictions. Let’s compare the model’s prediction with actual target values for the test set. By following this method, we are going to calculate the accuracy of our model.

accuracy_score(target_test, target_pred, normalize = True)

Script Output

0.80141447980643965

Awesome! Our model is giving an accuracy of 80%. This is not bad with a simple implementation. You can create random test datasets and test the model to get know how well the trained Gaussian Naive Bayes model is performing.

We can save the trained scikit-learn model with Python pickle. You can check out how to save the trained scikit-learn model with Python Pickle.

Related Data science Courses

29 Responses to “Gaussian Naive Bayes Classifier implementation in Python”

Santiago
4 years ago
Reply

How can I get the coefficients of the trained model? Thanks!
- Saimadhu Polamuri
  4 years ago
  Reply
  
  Hi Santiago,
  
  For the gaussian Naive Bayes algorithm, we won’t have the coefficients like linear regression algorithm. If you want to know other parameters of the model, you can simply print the fitted model. In our case, we need to print the clf variable in the code.
  
  Thanks and Happy Learning!
marco
4 years ago
Reply

I tried your code with python 3.7 and pandas 0.25 but the reindex is returning NaN Values
- Saimadhu Polamuri
  4 years ago
  Reply
  
  Hi Marco,
  
  I would suggest you to consider the alternative for reindexing in version 3.7
  
  Thanks,
  Saimadhu
Jayaraj
5 years ago
Reply

can you provide adult.data to run in local
- Saimadhu Polamuri
  4 years ago
  Reply
  
  Hi Jayaraj,
  
  The data we used is the dummy, the idea is to learn how to apply the same algorithm on different datasets, feel free to use the same code for applying on different dataset wit few modifications.
  
  Happy Learning!
Vaidotas Ivoška
5 years ago
Reply

How do I predict new 0/1 values using the trained model on a new dataset without target values?
- Saimadhu Polamuri
  4 years ago
  Reply
  
  Hi Vaidota Ivoska,
  
  For that you need to have the same features data you used for training the model, then you need to pass the same features data to predict method (model.predict function) to get the target values. like 0/1.
  
  Thanks & Happy Learning!
Mart
6 years ago
Reply

Hi Saimadhu, using your example above, how can I view the actual coefficients? Is there an equivalent of “coef_” using GaussianNB?
- Saimadhu Polamuri
  4 years ago
  Reply
  
  Hi Mart,
  
  For gaussianNb we won’t get the coef_ like we used to get for the linear regression and support vector machine algorithm. The reason is so simple to understand, in linear regression and support vector machine we are trying to create the lines/planes which can separate the target classes with a reasonable error. Whereas in gaussianNb this is not the case. I hope i cleared your question, if not please let me know will try to explain with an example.
  
  Thanks and happy learning!
Daniel
6 years ago
Reply

Is there a place that you confirm your features follow this assumption you stated – “assumed that all the features are following a gaussian distribution” ?
- Saimadhu Polamuri
  6 years ago
  Reply
  
  Hi Daniel,
  
  Yes said it correct. Before we assume the features were falling the Gaussian distribution., We need to cross-check whether the features were following the Gaussian distribution. As this blog post more towards the Gaussian Naive Bayes classifier I haven’t spent much time on explaining that.
Sam
6 years ago
Reply

Getting this error : ValueError: Found input variables with inconsistent numbers of samples: [21815, 10746]
- Saimadhu Polamuri
  6 years ago
  Reply
  
  Hi Sam,
  Could you please let me know in which code section your are facing this error.
Madhivarman
7 years ago
Reply

i checked the data points while applying fit_transform still the attributes is NAN… seems no error in my code.. 🙁
- Saimadhu Polamuri
  7 years ago
  Reply
  
  OK, Madhivarman.
  
  Try to debug from the beginning of the code. 😛
  
  The code in the article has to work without any errors.
  - Madhivarman
    7 years ago
    Reply
    
    Sure thing 😂😂😎
Madhivarman
7 years ago
Reply

race_cat 32561
sex_cat 32561

all values of attributes race_cat and sex_cat becomes Nan
- Saimadhu Polamuri
  7 years ago
  Reply
  
  Hi Madhivarman,
  
  Have you checked the data points for applying the fit_transform. If the fit_transform worked properly, I hope you won’t face any issues.
Madhivarman
7 years ago
Reply

hello..when i encode race_cat , sex_cat encoding is successfully done..but when i store the whole data frame it stored as NAN for race_cat and sex_cat attributes..

Any idea why this happened?
- Saimadhu Polamuri
  7 years ago
  Reply
  
  Hi Madhivarman,
  
  Have you checked the data points for applying the fit_transform. If the fit_transform worked properly, I hope you won’t face any issues.
Chandar
7 years ago
Reply

Hi, I have created the model and now In order to use any new features which are not in training/testing set to be predicted how should we approach, as here we have converted all the categorical values into numeric. So which is the best method to use to predict a new set of features.
- Saimadhu Polamuri
  6 years ago
  Reply
  
  Hi Chandra,
  
  I am afraid that we can’t do that.
Valerio
7 years ago
Reply

In the `clf.fit(features_train, target_train)` I get the following error “Input contains NaN, infinity or a value too large for dtype(‘float64’).”
- Saimadhu Polamuri
  7 years ago
  Reply
  
  Hi Valerio,
  
  The dataset we are using is not having nan values. You can use the below code to check the features which are having Nan values.
  df.isnull().sum()
  
  You can find the similar code in the article and you can observe no nan values in any features.
ziqi
7 years ago
Reply

I met the same problem.
ValueError: could not convert string to float: <=50K

Could somebody help?
- Saimadhu Polamuri
  7 years ago
  Reply
  
  Hi Ziqi,
  Could you please let me know which block of the code giving you the error?
Kishori
7 years ago
Reply

Getting an error when GaussianNB is applied :
ValueError: could not convert string to float: <=50K
- Saimadhu Polamuri
  7 years ago
  Reply
  
  Hi kishori,
  I guess some issue with the format in your code. Please check the data type and do the needed format conversion.
  
  # Data type checking type(input) # Converting to float (If it is string) input_float = float(input)