# Gaussian Naive Bayes Classifier implementation in Python

Gaussian Naive Bayes classifier Implementation in Python

## Building Gaussian Naive Bayes Classifier in Python

In this post, we are going to implement the Naive Bayes classifier in Python using my favorite machine learning library scikit-learn. Next, we are going to use the trained Naive Bayes (supervised classification), model to predict the Census Income.

As we discussed the Bayes theorem in naive Bayes classifier post. We hope you know the basics of the Bayes theorem. If not let’s quickly look at the basics of Bayes theorem once.

Bayes’ theorem is based on conditional probability. The conditional probability helps us calculating the probability that something will happen, given that something else has already happened. Not getting let’s understand with few examples.

### Conditional Probability Examples

Below are the few examples helps to clearly understand the definition of conditional probability.

• Having a refreshing drink when you are in the movie theater.
• Buying peanuts when you brought a chilled soft drink.

Conditional Probability Example

Using the Bayes theorem the naive Bayes classifier works. The naive Bayes classifier assumes all the features are independent to each other. Even if the features depend on each other or upon the existence of the other features. Naive Bayes classifier considers all of these properties to independently contribute to the probability that the user buys the  MacBook.

To learn the key concepts related to Naive Bayes. You can read our article on Introduction to Naive Bayes. This will help you understand the core concepts related to Naive Bayes.

In the introduction to Naive Bayes post, we discussed three popular Naive Bayes algorithms:

• Gaussian Naive Bayes,
• Multinomial Naive Bayes.
• Bernoulli Naive Bayes.

As a continues to the Naive Bayes algorithm article. Now we are going to implement Gaussian Naive Bayes on a “Census Income” dataset.

## Gaussian Naive Bayes

A Gaussian Naive Bayes algorithm is a special type of NB algorithm. It’s specifically used when the features have continuous values. It’s also assumed that all the features are following a gaussian distribution i.e, normal distribution.

## Census Income Dataset

Census Income dataset is to predict whether the income of a person >$50K/yr (greater than$50K/yr) or <=\$50K/yr. The data was collected by Barry Becker from 1994 Census dataset.

This dataset was contributed to UCI repository, and It’s openly available at this link. The dataset consists of 15 columns of a mix of discrete as well as continuous data.

 Variable Name Variable Range 1. age [17 – 90] 2. workclass [‘State-gov’, ‘Self-emp-not-inc’, ‘Private’, ‘Federal-gov’, ‘Local-gov’, ‘Self-emp-inc’, ‘Without-pay’, ‘Never-worked’] 3. fnlwgt [77516- 257302] 4. education [‘Bachelors’, ‘HS-grad’, ’11th’, ‘Masters’, ‘9th’, ‘Some-college’, ‘Assoc-acdm’, ‘Assoc-voc’, ‘7th-8th’, ‘Doctorate’, ‘Prof-school’, ‘5th-6th’, ’10th’, ‘1st-4th’, ‘Preschool’, ’12th’] 5. education_num [1 – 16] 6. marital_status [‘Never-married’, ‘Married-civ-spouse’, ‘Divorced’, ‘Married-spouse-absent’, ‘Separated’, ‘Married-AF-spouse’, ‘Widowed’] 7. occupation [‘Adm-clerical’, ‘Exec-managerial’, ‘Handlers-cleaners’, ‘Prof-specialty’, ‘Other-service’, ‘Sales’, ‘Craft-repair’, ‘Transport-moving’, ‘Farming-fishing’, ‘Machine-op-inspct’, ‘Tech-support’, ‘Protective-serv’, ‘Armed-Forces’, ‘Priv-house-serv’] 8. relationship [‘Not-in-family’, ‘Husband’, ‘Wife’, ‘Own-child’, ‘Unmarried’, ‘Other-relative’] 9. race [‘White’, ‘Black’, ‘Asian-Pac-Islander’, ‘Amer-Indian-Eskimo’, ‘Other’] 10. sex [‘Male’, ‘Female’] 11. capital_gain [0 – 99999] 12. capital_loss [0 – 4356] 13. hours_per_week [1 – 99] 14. native_country [‘United-States’, ‘Cuba’, ‘Jamaica’, ‘India’, ‘Mexico’, ‘South’, ‘Puerto-Rico’, ‘Honduras’, ‘England’, ‘Canada’, ‘Germany’, ‘Iran’, ‘Philippines’, ‘Italy’, ‘Poland’, ‘Columbia’, ‘Cambodia’, ‘Thailand’, ‘Ecuador’, ‘Laos’, ‘Taiwan’, ‘Haiti’, ‘Portugal’, ‘Dominican-Republic’, ‘El-Salvador’, ‘France’, ‘Guatemala’, ‘China’, ‘Japan’, ‘Yugoslavia’, ‘Peru’, ‘Outlying-US(Guam-USVI-etc)’, ‘Scotland’, ‘Trinadad&Tobago’, ‘Greece’, ‘Nicaragua’, ‘Vietnam’, ‘Hong’, ‘Ireland’, ‘Hungary’, ‘Holand-Netherlands’] 15. income [‘<=50K’, ‘>50K’]

The final target variable consists of two values: ‘<=50K” & ‘>50K’.

## Implementation of Gaussian NB on Census Income dataset

### Importing Python Machine Learning Libraries

We need to import pandas, numpy and sklearn libraries. From sklearn, we need to import preprocessing modules like Imputer. The Imputer package helps to impute the missing values.

If you are not setup the python machine learning libraries setup. You can first complete it to run the codes in this articles.

The train_test_split module is for splitting the dataset into training and testing set. The accuracy_score module will be used for calculating the accuracy of our Gaussian Naive Bayes algorithm.

### Data Import

For importing the census data, we are using pandas read_csv() method. This method is a very simple and fast method for importing data.

We are passing four parameters. The ‘adult.data’ parameter is the file name. The header parameter is for giving details to pandas that whether the first row of data consists of headers or not. In our dataset, there is no header. So, we are passing None.

The delimiter parameter is for giving the information the delimiter that is separating the data. Here, we are using ‘ *, *’ delimiter. This delimiter is to show delete the spaces before and after the data values. This is very helpful when there is inconsistency in spaces used with data values.

Let’s add headers to our dataframe. The below code snippet can be used to perform this task.

### Handling Missing Data

Let’s try to test whether there is any null value in our dataset or not. We can do this using isnull() method.

#### Script Output

The above output shows that there is no “null” value in our dataset.

Let’s try to test whether any categorical attribute contains a “?” in it or not. At times there exists “?” or ” ” in place of missing values. Using the below code snippet we are going to test whether our adult_df data frame consists of categorical variables with values as “?”.

#### Script Output

The output of the above code snippet shows that there are 1836 missing values in workclass attribute. 1843 missing values in occupation attribute and 583 values in native_country attribute.

### Data preprocessing

For preprocessing, we are going to make a duplicate copy of our original dataframe.We are duplicating adult_df to adult_df_rev dataframe.

As we want to perform some imputation for missing values. Before doing that, we need some summary statistics of our dataframe. For this, we can use describe() method. It can be used to generate various summary statistics, excluding NaN values.

We are passing an “include” parameter with value as “all”, this is used to specify that. we want summary statistics of all the attributes.

You check the basic statistics about the dataset after running the above command. You can spend some time here to get in details about each and ever stats provided.

### Data Imputation Step

Now, it’s time to impute the missing values. Some of our categorical values have missing values i.e, “?”. We are going to replace the “?” with the above describe methods top row’s value. For example, we are going to replace the “?” values of workplace attribute with “Private” value.

We have performed data imputation step. 🙂
You can check changes in dataframe by printing  adult_df_rev

For naive Bayes, we need to convert all the data values in one format. We are going to encode all the labels with the value between 0 and n_classes-1.

### One-Hot Encoder

For implementing this, we are going to use LabelEncoder of scikit learn library. For encoding, we can also use the One-Hot encoder. It encodes the data into binary format.

Using the above code snippets, we have created multiple categorical columns like “marital_cat”, “race_cat” etc. You can see the top 6 lines of the dataframe using adult_df_rev.head()

By printing the  adult_df_rev.head()  result. You will be able to see that all the columns should be reindexed. They are not in proper order. For reindexing the columns, you can use the code snippet provided below:

The output to the above code snippet will show you that all the columns are reindexed properly. I have passed the list of column names as a parameter and axis=1 for reindexing the columns.

### Standardization of Data

All the data values of our dataframe are numeric. Now, we need to convert them on a single scale. We can standardize the values.  We can use the below formula for standardization.

${x}_i = \frac{{x}_i - mean(x)} {\sigma(x)}$

We have converted our data values into standardized values. You can print and check the output of dataframe.

## Data Slicing

Let’s split the data into training and test set. We can easily perform this step using sklearn’s train_test_split() method.

Using above code snippet, we have divided the data into features and target set. The feature set consists of 14 columns i.e, predictor variables and target set consists of 1 column with class values.

The features_train & target_train consists of training data and the features_test & target_test consists of testing data.

### Gaussian Naive Bayes Implementation

After completing the data preprocessing. it’s time to implement machine learning algorithm on it. We are going to use sklearn’s GaussianNB module.

We have built a GaussianNB classifier. The classifier is trained using training data. We can use fit() method for training it. After building a classifier, our model is ready to make predictions. We can use predict() method with test set features as its parameters.

### Accuracy of our Gaussian Naive Bayes model

It’s time to test the quality of our model. We have made some predictions. Let’s compare the model’s prediction with actual target values for the test set. By following this method, we are going to calculate the accuracy of our model.

#### Script Output

Awesome! Our model is giving an accuracy of 80%. This is not bad with a simple implementation. You can create random test datasets and test the model to get know how well the trained Gaussian Naive Bayes model is performing.

We can save the trained scikit-learn model with Python pickle. You can check out how to save the trained scikit-learn model with Python Pickle.

I hope you like this post. If you have any questions, then feel free to comment below.  If you want me to write on one particular topic, then do tell it to me in the comments below.

#### Related Data science Courses

• Kishori says:

Getting an error when GaussianNB is applied :
ValueError: could not convert string to float: <=50K

• Hi kishori,
I guess some issue with the format in your code. Please check the data type and do the needed format conversion.

 # Data type checking type(input) # Converting to float (If it is string) input_float = float(input) 

• ziqi says:

I met the same problem.
ValueError: could not convert string to float: <=50K

Could somebody help?

• Hi Ziqi,
Could you please let me know which block of the code giving you the error?

• Valerio says:

In the clf.fit(features_train, target_train) I get the following error “Input contains NaN, infinity or a value too large for dtype(‘float64’).”

• Hi Valerio,

The dataset we are using is not having nan values. You can use the below code to check the features which are having Nan values.
df.isnull().sum()

You can find the similar code in the article and you can observe no nan values in any features.

• Chandar says:

Hi, I have created the model and now In order to use any new features which are not in training/testing set to be predicted how should we approach, as here we have converted all the categorical values into numeric. So which is the best method to use to predict a new set of features.

hello..when i encode race_cat , sex_cat encoding is successfully done..but when i store the whole data frame it stored as NAN for race_cat and sex_cat attributes..

Any idea why this happened?

Have you checked the data points for applying the fit_transform. If the fit_transform worked properly, I hope you won’t face any issues.

race_cat 32561
sex_cat 32561

all values of attributes race_cat and sex_cat becomes Nan

Have you checked the data points for applying the fit_transform. If the fit_transform worked properly, I hope you won’t face any issues.

i checked the data points while applying fit_transform still the attributes is NAN… seems no error in my code.. 🙁

Try to debug from the beginning of the code. 😛

The code in the article has to work without any errors.

Sure thing 😂😂😎

• Sam says:

Getting this error : ValueError: Found input variables with inconsistent numbers of samples: [21815, 10746]

• Hi Sam,
Could you please let me know in which code section your are facing this error.

• Daniel says:

Is there a place that you confirm your features follow this assumption you stated – “assumed that all the features are following a gaussian distribution” ?

• Hi Daniel,

Yes said it correct. Before we assume the features were falling the Gaussian distribution., We need to cross-check whether the features were following the Gaussian distribution. As this blog post more towards the Gaussian Naive Bayes classifier I haven’t spent much time on explaining that.