# How to implement logistic regression model in python for binary classification

Logistic Regression Python

In the last few articles, we talked about different classification algorithms. For every classification algorithm, we learn the background concepts of the algorithm and in the followed article we used the background concepts of the algorithm to build the classification model. Later we used the mode to perform the regression or classification tasks.

Likewise in this article, we are going to implement the logistic regression model in python to perform the binary classification task. In this, we are mainly concentrating on the implementation of logistic regression in python, as the background concepts explained in how the logistic regression model works article.

Take a cup of tea/coffee and check out the below prerequisite concepts before we drive further.

Now let’s view the concepts we are going to learn by the end of this article.

• What is binary classification
• Logistic regression introduction
• Building logistic regression model in python
• Binary classification problem
• Dataset description
• Data creation for modeling and testing
• Selecting the features
• Split the data into train and test dataset
• Understanding the training data
• Implementing the logistic regression model in python with scikit-learn
• Logistic regression model  accuracy calculation

## What is binary classification

Binary classification is performing the task of classifying the binary targets with the use of supervised classification algorithms. The binary target means having only 2 targets values/classes. To get the clear picture about the binary classification lets looks at the below binary classification problems.

• Identifying the image as a cat or not.
• Targets: cat or not a cat
• Predicting to whom the voter will vote Bill Clinton or Bob Dole
• Targets: Bill Clinton or Bod Dole
• Forecasting will it rain tomorrow.
• Targets: Will rain or sunny day

Hope with the above classification problems you are having the clear understanding on the binary classification problems.

## Logistic regression introduction

The logistic regression algorithm is the simplest classification algorithm used for the binary classification task. Which can also be used for solving the multi-classification problems. In summarizing way of saying logistic regression model will take the feature values and calculates the probabilities using the sigmoid or softmax functions.

The sigmoid function used for binary classification problems and Softmax function used of multi-classification problems.

Later the calculated probabilities used to find the target class. In general, the high probability class treated as the final target class.

The above explanation is indeed as you already know the logistic regression algorithm. If you new to the logistic regression algorithm please check out how the logistic regression algorithm works before you continue this article.

## Building logistic regression model in python

To build the logistic regression model in python we are going to use the Scikit-learn package. We are going to follow the below workflow for implementing the logistic regression model.

• Understanding the data.
• Split the data into training and test dataset.
• Use the training dataset to model the logistic regression model.
• Calculate the accuracy of the trained model on the training dataset.
• Calculate the accuracy of the model on the test dataset.

We are going to build the logistic regression model in the above workflow to address the binary classification problem. Let’s look into the problem we are going to solve.

### Binary classification problem

Predicting Election Results with Logistic Regression model

We are going to build the logistic regression model to predict, for whom the voter will vote. Given the voter details.

• Will the voter will vote for Bill Clinton?
• Will the voter will vote for Bobe Dole?

### Dataset description

The dataset we are going to use is the 1996 United States President election dataset. In a moment we are going to look into all the features in the dataset. Before that, let me give you the quick summary about this election.

### The United States 1996 President Election Summary

• This 53rd united states president elections held on November 5, 1996.
• Nominees:
• Bill Clinton
• Boby Dole
• Ross Perot
• With 49.2 % vote percentage Bill Clinton won the Elections.

Love to read more about the election? Then check out the few details about the election in wiki United States President Election article.

Before we begin the modeling let’s import the required python packages.

```# Required Python Packages
import pandas as pd
import numpy as np
import plotly.plotly as py
import plotly.graph_objs as go
py.sign_in('YOUR_PLOTLY_USER_NAME', 'API_KEY')

from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics```
• Pandas package is required for data analysis. In the process of modeling logistic regression classifier, first we are going to load the dataset (CSV format) into pandas data frame and then we play around with the loaded dataset.
• Numpy package is for performing the numerical calculation.
• Plotly package for visualizing the data set for better understanding.
• We need to sign_in with your plotly credential to use this package. You can find the credential in your plotly account after you create an account.
• Sklearn package is for modeling the machine learning algorithms.
• train_test_split method to split the dataset into the train and test dataset.
• LogisticRegression method for modeling the logistic regression classifier.
• metrics method for calculating the accuracy of the trained classifiers.

Now let’s load the data set and look into all the features available to model the logistic regression model in python. You can download the data set from our GitHub.

```# Files
DATA_SET_PATH = "../Inputs/anes_dataset.csv"

def main():
"""
Logistic Regression classifier main
:return:
"""
# Load the data set for training and testing the logistic regression classifier
print "Number of Observations :: ", len(dataset)

if __name__ == "__main__":
main()```
• Created the main function and loading the elections dataset into pandas dataframe. As the dataset in CSV format, we are calling the pandas read_csv function with dataset path as a parameter.
• To know the number of observations (rows) in the dataset, we are calling the python len() function with the loaded dataset.

#### Script Output:

`Number of Observations:: 944`

From the script output, the number observation in the dataset are 944. We are going to play with the observation to model the logistic regression model 🙂

As we already know the dataset size, now lets chek out the few observation to know about the features in the dataset.

```def main():
"""
Logistic Regression classifier main
:return:
"""
# Load the data set for training and testing the logistic regression classifier
print "Number of Observations :: ", len(dataset)

# Get the first observation

if __name__ == "__main__":
main()```

We can use the pandas head method to get the top observations of the loaded dataset.

#### Script Output:

```popul  TVnews  selfLR  ClinLR  DoleLR  PID  age  educ  income  vote
0       7       7       1       6    6   36     3       1     1
190       1       3       3       5    1   20     4       1     0
31       7       2       2       6    1   24     6       1     0
83       4       3       4       5    1   28     6       1     0
640       7       5       6       4    0   68     6       1     0```

let’s write a function to get the header names of the given dataset. Later we store all the header names, which can be used in modeling the logistic regression.

```def dataset_headers(dataset):
"""
To get the dataset header names
:param dataset: loaded dataset into pandas DataFrame
"""
return list(dataset.columns.values)```

This dataset_headers function takes the dataset (loaded ) and returns the header names in the loaded dataset. As the dataset_headers function excepts the loaded dataset, we are going to call this function inside the main function we wrote earlier.

```def main():
"""
Logistic Regression classifier main
:return:
"""
# Load the data set for training and testing the logistic regression classifier

if __name__ == "__main__":
main()```

#### Script Output:

`Data set headers :: ['popul', 'TVnews', 'selfLR', 'ClinLR', 'DoleLR', 'PID', 'age', 'educ', 'income', 'vote']`

Now let’s discuss each header (feature).

1. popul:
• Means the population in the census place in 1000
2. TVnews:
• The number of time the voter views the Tv news in a week.
3. selfLR
• Is the person’s selfreported political learnings from left to right.
4. ClinLR
• Is the person’s impression on Bill Clinton’s Political learning
from left to right
5. DoleLR
• Is the person impression of Bob Dole’s Political learnings from left to right.
6. PID
• Party Identification of the person.
• If the PID is
• 0 means the sStrong Democrat,
• 1 means Week democrat,
• 2 means Independent democrat likewise
7. age
• Age of the voter.
8. educ
• Education qualification of the voter.
9. income
• Income of the voter.
10. vote
• The vote is the target which we are going to predict using the trained logistic regression model.
• vote having two possible outcomes: 0 means Clinton, 1 means Dole.

### Data creation for modeling and testing

#### Selecting the features

Out of the above features (headers), we are going to use only the below headers. To select the best features from all the available features we use the feature engineering concepts.

As the feature engineering concepts too board to explain we are going to use the below, selected features which are logically having the high chance in predicting to whom the voter will vote.

```training_features = ['TVnews', 'PID', 'age', 'educ', 'income']
target = 'vote'```

For training the logistic regression model we are going feature in the training_fearures and the target.

#### Split the data into train and test dataset

```def main():
"""
Logistic Regression classifier main
:return:
"""
# Load the data set for training and testing the logistic regression classifier

training_features = ['TVnews', 'PID', 'age', 'educ', 'income']
target = 'vote'

# Train , Test data split
train_x, test_x, train_y, test_y = train_test_split(dataset[training_features], dataset[target], train_size=0.7)

print "train_x size :: ", train_x.shape
print "train_y size :: ", train_y.shape

print "test_x size :: ", test_x.shape
print "test_y size :: ", test_y.shape

if __name__ == "__main__":
main()```

• To split the dataset into train and test dataset we are using the scikit-learn(sk-learn) method train_test_split with selected training features data and the target.
• We are using the train_size as 0.7 which means out of the all the observation considering 70% of observation for training and remaining 30% for testing.
• The train_test_split give four outputs which are train_x, test_x, train_y and test_y.
• For know the size of each of the about four outputs we are printing the shape.

#### Script Output:

```train_x size ::  (660, 5)
train_y size ::  (660,)
test_x size ::  (284, 5)
test_y size ::  (284,)```

### Understanding the training data

For understanding the training data features, Let’s look at the each possible value for each feature data and how the relation with target classes(0 for Clinton, 1 for Dole).

we can find the relation between the feature and the target with a histogram. To create the histogram we need the frequencies count of each possible value of the feature and with the target classes.

Let me explain what I am talking about with an example.

Example:

Suppose we have two features for building a classification model which predicts, will the student gets A grade in the exam by considering the two features. Let’s say the features are the number of study classes attended in a day and gender of the student. To create the histogram to find the relation between gender and the target A grade or not, we need frequencies like the below.

As the feature having two possible values boy or girl and the target also having the two possible outcomes A grade or not

• Frequencies count of the boy and A grade
• Frequencies count of the boy and Not
• Frequencies count of the girl and A grade
• Frequencies count of the girl and Not

So now let’s write a function with takes the dataset feature header and target to get the about kind of frequencies results.

#### Frequencies on feature and target relation

```def unique_observations(dataset, header, method=1):
"""
To get unique observations in the loaded pandas DataFrame column
:param dataset:
:param method: Method to perform the unique (default method=1 for pandas and method=0 for numpy )
:return:
"""
try:
if method == 0:
# With Numpy
elif method == 1:
# With Pandas
else:
observations = None
print "Wrong method type, Use 1 for pandas and 0 for numpy"
except Exception as e:
observations = None
print "Error: {error_msg} /n Please check the inputs once..!".format(error_msg=e.message)
return observations

"""
To get the frequency relation between targets and the unique feature observations
:param dataset:
:return: feature unique observations dictionary of frequency count dictionary
"""

frequencies = {}
for feature in feature_unique_observations:
frequencies[feature] = {unique_targets[0]: len(
unique_targets[1]: len(
return frequencies```

To get the frequencies relation between target and feature, we written two functions unique_observations and feature_target_frequency_relation

The unique_observation function takes the dataset and header as input parameters and returns the unique values in the dataset for that header. Suppose if the data is [1, 2, 1, 2, 3, 5, 1, 4] then output will be [1, 2, 3, 4, 5]

Next function is the feature_target_frequency_relation will take dataset and header and target header as an input parameter and returns the frequencies.

Let’s run the fearue_target_frequency_relation function with feature header (“educ”) and target feature (“vote”)

```def main():
"""
Logistic Regression classifier main
:return:
"""
# Load the data set for training and testing the logistic regression classifier

training_features = ['TVnews', 'PID', 'age', 'educ', 'income']
target = 'vote'

# Train , Test data split
train_x, test_x, train_y, test_y = train_test_split(dataset[training_features], dataset[target], train_size=0.7)

print "edu_target_frequencies :: ", feature_target_frequency_relation(dataset, [training_features[3], target])

if __name__ == "__main__":
main()```

#### Script Output

`edu_target_frequencies ::  {1: {0: 10, 1: 3}, 2: {0: 38, 1: 14}, 3: {0: 153, 1: 95}, 4: {0: 106, 1: 81}, 5: {0: 53, 1: 37}, 6: {0: 119, 1: 108}, 7: {0: 72, 1: 55}}`

This seems interesting in the training dataset the feature education (educ)  value is 1 for 13 (10 + 3) time and out of 13, 10 votes for Clinton and only 3 votes for the dole. Which is the strong signal for our classifier while predicting to whom the voter will vote.

Now we are ready with frequencies so we need to write a function which takes the calculated frequencies as input store the histogram.

```def feature_target_histogram(feature_target_frequencies, feature_header):
"""

:param feature_target_frequencies:
:return:
"""
keys = feature_target_frequencies.keys()
y0 = [feature_target_frequencies[key][0] for key in keys]
y1 = [feature_target_frequencies[key][1] for key in keys]
trace1 = go.Bar(
x=keys,
y=y0,
name='Clinton'
)
trace2 = go.Bar(
x=keys,
y=y1,
name='Dole'
)

data = [trace1, trace2]
layout = go.Layout(
barmode='group',
title='Feature :: ' + feature_header + ' Clinton Vs Dole votes Frequency',
xaxis=dict(title="Feature :: " + feature_header + " classes"),
)
fig = go.Figure(data=data, layout=layout)
# plot_url = py.plot(fig, filename=feature_header + ' - Target - Histogram')

Don’t get scared about the code it’s just the histogram template form plotly. We just need to few modifications to the template for our needs. Once the modification did we just need to call the function with the proper inputs.

As the target (vote) having two possible outcomes, we need to compare the relation wth histograms. For that I am getting the results in keys, y0, y1 for feature (educ) these are the results for keys, y0, y1

```edu_target_frequencies ::  {1: {0: 10, 1: 3}, 2: {0: 38, 1: 14}, 3: {0: 153, 1: 95}, 4: {0: 106, 1: 81}, 5: {0: 53, 1: 37}, 6: {0: 119, 1: 108}, 7: {0: 72, 1: 55}}

keys ::  [1, 2, 3, 4, 5, 6, 7]
y0 ::  [10, 38, 153, 106, 53, 119, 72]
y1 ::  [3, 14, 95, 81, 37, 108, 55]```

If you observe the edu_target_frequencies and the keys, y0, y1 you can clearly understand what we are trying to do here.

Now let’s call the feature_target_histogram function for all the tranning_feature and check out the results.

```def main():
"""
Logistic Regression classifier main
:return:
"""
# Load the data set for training and testing the logistic regression classifier

training_features = ['TVnews', 'PID', 'age', 'educ', 'income']
target = 'vote'

# Train , Test data split
train_x, test_x, train_y, test_y = train_test_split(dataset[training_features], dataset[target], train_size=0.7)

for feature in training_features:
feature_target_frequencies = feature_target_frequency_relation(dataset, [feature, target])
feature_target_histogram(feature_target_frequencies, feature)

if __name__ == "__main__":
main()```

Below are the stored histogram images after running the above code.

#### TVnews and Target(vote) histogram

TVnews Target Histogram

#### PID and Target(vote) histogram

PID Target Histogram

#### Age and Target(vote) histogram

Age Target Histogram

#### Education and Target(vote) histogram

Educ Target Histogram

#### Income and Target(vote) histogram

Income Target Histogram

Please spend some time on understanding each histogram and how the relation with the target. Now let implement the logistic regression model in python with selected training features and the target.

### Implementing the logistic regression model in python with scikit-learn

```def train_logistic_regression(train_x, train_y):
"""
Training logistic regression model with train dataset features(train_x) and target(train_y)
:param train_x:
:param train_y:
:return:
"""

logistic_regression_model = LogisticRegression()
logistic_regression_model.fit(train_x, train_y)
return logistic_regression_model

def main():
"""
Logistic Regression classifier main
:return:
"""
# Load the data set for training and testing the logistic regression classifier

training_features = ['TVnews', 'PID', 'age', 'educ', 'income']
target = 'vote'

# Train , Test data split
train_x, test_x, train_y, test_y = train_test_split(dataset[training_features], dataset[target], train_size=0.7)

# Training Logistic regression model
trained_logistic_regression_model = train_logistic_regression(train_x, train_y)

if __name__ == "__main__":
main()```

To implement the logistic regression model we created the function train_logistic_regression with train_x and train_y as input parameters. With this logistic regression model created and trained with the training dataset. Now let’s chek out the accuracies of the model.

### Logistic regression model  accuracy calculation

Let’s write the function which takes the trained logistic regression model feature values (train_x or test_x) and target values (train_y or test_y ) for calculating the accuracy.

```def model_accuracy(trained_model, features, targets):
"""
Get the accuracy score of the model
:param trained_model:
:param features:
:param targets:
:return:
"""
accuracy_score = trained_model.score(features, targets)
return accuracy_score```

This function will take the trained model, features and targets as input. Uses the trained_modle and the features to predict the targets and the compare with the actual targets and returns the accuracy score.

Now let’s  call the above function with train_x  and train_y for getting accuracies of our model on train dataset and later call the same function with test_x and test_y for getting accuracies of our model the on test dataset.

#### Logistic regression model accuracy on train dataset

```train_accuracy = model_accuracy(trained_logistic_regression_model, train_x, train_y)

print "Train Accuracy :: ", train_accuracy```

#### Script output:

`Train Accuracy ::  0.901515151515`

#### Logistic regression model accuracy on test dataset

```# Testing the logistic regression model
test_accuracy = model_accuracy(trained_logistic_regression_model, test_x, test_y)

print "Test Accuracy :: ", test_accuracy```

#### Script output:

`Test Accuracy ::  0.911971830986`

With the selected training features we got a test accuracy 91%. Play will different features and let me know the test accuracy you got in the comments.

We can save this trained logistic regression model to use in some other applications without importing the major libraries. Check out how to dump and load the trained classifier article.

Use the scikit learn predict method to predict, whom the voter will vote. Given the voter features and let me know the results in the comments section. If you face any difficulty in using the predict method, Do check out how I use predict method in implementing decision tree classifier in python.

### Logistic regression model complete code

```#!/usr/bin/env python
# logistic_regression.py
# Date: 19-March-2017
# About: Implementing Logistic Regression Classifier to predict to whom the voter will vote.

# Required Python Packages
import pandas as pd
import numpy as np
import pdb
import plotly.plotly as py
import plotly.graph_objs as go

# import plotly.plotly as py
# from plotly.graph_objs import *
py.sign_in('dataaspirant', 'RhJdlA1OsXsTjcRA0Kka')

from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

# Files
DATA_SET_PATH = "../Inputs/anes_dataset.csv"

"""
To get the dataset header names
:param dataset: loaded dataset into pandas DataFrame
"""
return list(dataset.columns.values)

"""
To get unique observations in the loaded pandas DataFrame column
:param dataset:
:param method: Method to perform the unique (default method=1 for pandas and method=0 for numpy )
:return:
"""
try:
if method == 0:
# With Numpy
elif method == 1:
# With Pandas
else:
observations = None
print "Wrong method type, Use 1 for pandas and 0 for numpy"
except Exception as e:
observations = None
print "Error: {error_msg} /n Please check the inputs once..!".format(error_msg=e.message)
return observations

"""
To get the frequency relation between targets and the unique feature observations
:param dataset:
:return: feature unique observations dictionary of frequency count dictionary
"""

frequencies = {}
for feature in feature_unique_observations:
frequencies[feature] = {unique_targets[0]: len(
unique_targets[1]: len(
return frequencies

"""

:param feature_target_frequencies:
:return:
"""
keys = feature_target_frequencies.keys()
y0 = [feature_target_frequencies[key][0] for key in keys]
y1 = [feature_target_frequencies[key][1] for key in keys]

trace1 = go.Bar(
x=keys,
y=y0,
name='Clinton'
)
trace2 = go.Bar(
x=keys,
y=y1,
name='Dole'
)
data = [trace1, trace2]
layout = go.Layout(
barmode='group',
title='Feature :: ' + feature_header + ' Clinton Vs Dole votes Frequency',
xaxis=dict(title="Feature :: " + feature_header + " classes"),
)
fig = go.Figure(data=data, layout=layout)
# plot_url = py.plot(fig, filename=feature_header + ' - Target - Histogram')

def train_logistic_regression(train_x, train_y):
"""
Training logistic regression model with train dataset features(train_x) and target(train_y)
:param train_x:
:param train_y:
:return:
"""

logistic_regression_model = LogisticRegression()
logistic_regression_model.fit(train_x, train_y)
return logistic_regression_model

def model_accuracy(trained_model, features, targets):
"""
Get the accuracy score of the model
:param trained_model:
:param features:
:param targets:
:return:
"""
accuracy_score = trained_model.score(features, targets)
return accuracy_score

def main():
"""
Logistic Regression classifier main
:return:
"""
# Load the data set for training and testing the logistic regression classifier
print "Number of Observations :: ", len(dataset)

# Get the first observation

training_features = ['TVnews', 'PID', 'age', 'educ', 'income']
target = 'vote'

# Train , Test data split
train_x, test_x, train_y, test_y = train_test_split(dataset[training_features], dataset[target], train_size=0.7)
print "train_x size :: ", train_x.shape
print "train_y size :: ", train_y.shape

print "test_x size :: ", test_x.shape
print "test_y size :: ", test_y.shape

print "edu_target_frequencies :: ", feature_target_frequency_relation(dataset, [training_features[3], target])

for feature in training_features:
feature_target_frequencies = feature_target_frequency_relation(dataset, [feature, target])
feature_target_histogram(feature_target_frequencies, feature)

# Training Logistic regression model
trained_logistic_regression_model = train_logistic_regression(train_x, train_y)

train_accuracy = model_accuracy(trained_logistic_regression_model, train_x, train_y)

# Testing the logistic regression model
test_accuracy = model_accuracy(trained_logistic_regression_model, test_x, test_y)

print "Train Accuracy :: ", train_accuracy
print "Test Accuracy :: ", test_accuracy

if __name__ == "__main__":
main()```

You can get the complete code in Dataaspirant Github

I hope you like this post. If you have any questions, then feel free to comment below.  If you want me to write on one particular topic, then do tell it to me in the comments below.

### 6 Responses to “How to implement logistic regression model in python for binary classification”

• Hey man, good tutorial. Just wanted to remind you that in the complete code, you have put your username and password for plotly sign in. Thanks for the tutorial again.

• Hi Omkaar Kamath,

We have updated the password of plotly in the credential section, thanks a lot again.

Thanks and happy learning!

• Hello, thanks for a concise explanation,

1) I have a question about feature selection, even though you briefly mentioned it: “As the feature engineering concepts too broad to explain we are going to use the below, selected features which are logically having the high chance in predicting to whom the voter will vote.”, what method did you use when picking [‘TVnews’, ‘PID’, ‘age’, ‘educ’, ‘income’] as the important features (I tried SelectKBest and RFE)

2) Can you give any advice on selecting a specific method when it comes to feature selection.

Thanks

• Hi Alan,

You can try different feature selection methods. In scikit learn you can find the best features for modeling in order (Highly influenced feature will be first.)

When it comes to the article I have taken the features which generally make an impact on voting. It’s not good to directly applying different methods to select the best features for modeling, We need add features based on the domine knowledge.

• Hi,

Thank you for clear demonstration of logistic regression. I tried running this code but I’m getting the following error.

File “C:\Users\Banu\Anaconda3\lib\json\encoder.py”, line 179, in default
raise TypeError(repr(o) + ” is not JSON serializable”)

TypeError: dict_keys([0, 1, 2, 3, 4, 5, 6, 7]) is not JSON serializable

I even tried using keys = json.dumps( feature_target_frequencies.keys()). But couldn’t make it work. Any suggestions or corrections is highly appreciated.

Thank you