How to implement logistic regression model in python for binary classification

Logistic Regression Python

Logistic Regression Python

In the last few articles, we talked about different classification algorithms. For every classification algorithm, we learn the background concepts of the algorithm and in the followed article we used the background concepts of the algorithm to build the classification model. Later we used the mode to perform the regression or classification tasks.

Likewise in this article, we are going to implement the logistic regression model in python to perform the binary classification task. In this, we are mainly concentrating on the implementation of logistic regression in python, as the background concepts explained in how the logistic regression model works article.

Take a cup of tea/coffee and check out the below prerequisite concepts before we drive further.

Now let’s view the concepts we are going to learn by the end of this article.

Table of contents

  • What is binary classification
  • Logistic regression introduction
  • Building logistic regression model in python
    • Binary classification problem
    • Dataset description
    • Data creation for modeling and testing
      • Selecting the features
      • Split the data into train and test dataset
    • Understanding the training data
    • Implementing the logistic regression model in python with scikit-learn
    • Logistic regression model  accuracy calculation
Building Logistic regression classifier in Python Click To Tweet

What is binary classification

Binary classification is performing the task of classifying the binary targets with the use of supervised classification algorithms. The binary target means having only 2 targets values/classes. To get the clear picture about the binary classification lets looks at the below binary classification problems.

  • Identifying the image as a cat or not.
    • Targets: cat or not a cat
  • Predicting to whom the voter will vote Bill Clinton or Bob Dole
    • Targets: Bill Clinton or Bod Dole
  • Forecasting will it rain tomorrow.
    • Targets: Will rain or sunny day

Hope with the above classification problems you are having the clear understanding on the binary classification problems.

Logistic regression introduction

The logistic regression algorithm is the simplest classification algorithm used for the binary classification task. Which can also be used for solving the multi-classification problems. In summarizing way of saying logistic regression model will take the feature values and calculates the probabilities using the sigmoid or softmax functions.

The sigmoid function used for binary classification problems and Softmax function used of multi-classification problems.

Later the calculated probabilities used to find the target class. In general, the high probability class treated as the final target class.

The above explanation is indeed as you already know the logistic regression algorithm. If you new to the logistic regression algorithm please check out how the logistic regression algorithm works before you continue this article.

Building logistic regression model in python

To build the logistic regression model in python we are going to use the Scikit-learn package. We are going to follow the below workflow for implementing the logistic regression model.

  • Load the data set.
  • Understanding the data.
  • Split the data into training and test dataset.
  • Use the training dataset to model the logistic regression model.
  • Calculate the accuracy of the trained model on the training dataset.
  • Calculate the accuracy of the model on the test dataset.

We are going to build the logistic regression model in the above workflow to address the binary classification problem. Let’s look into the problem we are going to solve.

Binary classification problem

Predicting Election Results with Logistic Regression model

Predicting Election Results with Logistic Regression model

We are going to build the logistic regression model to predict, for whom the voter will vote. Given the voter details.

  • Will the voter will vote for Bill Clinton?
  • Will the voter will vote for Bobe Dole?

Dataset description

The dataset we are going to use is the 1996 United States President election dataset. In a moment we are going to look into all the features in the dataset. Before that, let me give you the quick summary about this election.

The United States 1996 President Election Summary

  • This 53rd united states president elections held on November 5, 1996.
  • Nominees:
    • Bill Clinton
    • Boby Dole
    • Ross Perot
  • With 49.2 % vote percentage Bill Clinton won the Elections.

Love to read more about the election? Then check out the few details about the election in wiki United States President Election article.

Before we begin the modeling let’s import the required python packages.

  • Pandas package is required for data analysis. In the process of modeling logistic regression classifier, first we are going to load the dataset (CSV format) into pandas data frame and then we play around with the loaded dataset.
  • Numpy package is for performing the numerical calculation.
  • Plotly package for visualizing the data set for better understanding.
    • We need to sign_in with your plotly credential to use this package. You can find the credential in your plotly account after you create an account.
  • Sklearn package is for modeling the machine learning algorithms.
    • train_test_split method to split the dataset into the train and test dataset.
    • LogisticRegression method for modeling the logistic regression classifier.
    • metrics method for calculating the accuracy of the trained classifiers.

Now let’s load the data set and look into all the features available to model the logistic regression model in python. You can download the data set from our GitHub.

Load the dataset

  • Created the main function and loading the elections dataset into pandas dataframe. As the dataset in CSV format, we are calling the pandas read_csv function with dataset path as a parameter.
  • To know the number of observations (rows) in the dataset, we are calling the python len() function with the loaded dataset.

Script Output:

From the script output, the number observation in the dataset are 944. We are going to play with the observation to model the logistic regression model 🙂

As we already know the dataset size, now lets chek out the few observation to know about the features in the dataset.

We can use the pandas head method to get the top observations of the loaded dataset.

Script Output:

let’s write a function to get the header names of the given dataset. Later we store all the header names, which can be used in modeling the logistic regression.

This dataset_headers function takes the dataset (loaded ) and returns the header names in the loaded dataset. As the dataset_headers function excepts the loaded dataset, we are going to call this function inside the main function we wrote earlier.

 

Script Output:

Now let’s discuss each header (feature).

  1. popul:
    • Means the population in the census place in 1000
  2. TVnews:
    • The number of time the voter views the Tv news in a week.
  3. selfLR
    • Is the person’s selfreported political learnings from left to right. 
  4. ClinLR
    • Is the person’s impression on Bill Clinton’s Political learning
      from left to right
  5. DoleLR
    • Is the person impression of Bob Dole’s Political learnings from left to right.
  6. PID
    • Party Identification of the person.
    • If the PID is
      • 0 means the sStrong Democrat,
      • 1 means Week democrat,
      • 2 means Independent democrat likewise
  7. age
    • Age of the voter.
  8. educ
    • Education qualification of the voter.
  9. income
    • Income of the voter.
  10. vote
    • The vote is the target which we are going to predict using the trained logistic regression model.
    • vote having two possible outcomes: 0 means Clinton, 1 means Dole.

Data creation for modeling and testing

Selecting the features

Out of the above features (headers), we are going to use only the below headers. To select the best features from all the available features we use the feature engineering concepts.

As the feature engineering concepts too board to explain we are going to use the below, selected features which are logically having the high chance in predicting to whom the voter will vote.

 

For training the logistic regression model we are going feature in the training_fearures and the target.

Split the data into train and test dataset

 

  • To split the dataset into train and test dataset we are using the scikit-learn(sk-learn) method train_test_split with selected training features data and the target.
  • We are using the train_size as 0.7 which means out of the all the observation considering 70% of observation for training and remaining 30% for testing.
  • The train_test_split give four outputs which are train_x, test_x, train_y and test_y.
  • For know the size of each of the about four outputs we are printing the shape.

Script Output:

Understanding the training data

For understanding the training data features, Let’s look at the each possible value for each feature data and how the relation with target classes(0 for Clinton, 1 for Dole).

we can find the relation between the feature and the target with a histogram. To create the histogram we need the frequencies count of each possible value of the feature and with the target classes.

Let me explain what I am talking about with an example.

Example:

Suppose we have two features for building a classification model which predicts, will the student gets A grade in the exam by considering the two features. Let’s say the features are the number of study classes attended in a day and gender of the student. To create the histogram to find the relation between gender and the target A grade or not, we need frequencies like the below.

As the feature having two possible values boy or girl and the target also having the two possible outcomes A grade or not

  • Frequencies count of the boy and A grade
  • Frequencies count of the boy and Not
  • Frequencies count of the girl and A grade
  • Frequencies count of the girl and Not

So now let’s write a function with takes the dataset feature header and target to get the about kind of frequencies results.

Frequencies on feature and target relation

To get the frequencies relation between target and feature, we written two functions unique_observations and feature_target_frequency_relation

The unique_observation function takes the dataset and header as input parameters and returns the unique values in the dataset for that header. Suppose if the data is [1, 2, 1, 2, 3, 5, 1, 4] then output will be [1, 2, 3, 4, 5]

Next function is the feature_target_frequency_relation will take dataset and header and target header as an input parameter and returns the frequencies.

Let’s run the fearue_target_frequency_relation function with feature header (“educ”) and target feature (“vote”)

 

Script Output

This seems interesting in the training dataset the feature education (educ)  value is 1 for 13 (10 + 3) time and out of 13, 10 votes for Clinton and only 3 votes for the dole. Which is the strong signal for our classifier while predicting to whom the voter will vote.

Now we are ready with frequencies so we need to write a function which takes the calculated frequencies as input store the histogram.

 

Don’t get scared about the code it’s just the histogram template form plotly. We just need to few modifications to the template for our needs. Once the modification did we just need to call the function with the proper inputs.

As the target (vote) having two possible outcomes, we need to compare the relation wth histograms. For that I am getting the results in keys, y0, y1 for feature (educ) these are the results for keys, y0, y1

If you observe the edu_target_frequencies and the keys, y0, y1 you can clearly understand what we are trying to do here.

Now let’s call the feature_target_histogram function for all the tranning_feature and check out the results.

Below are the stored histogram images after running the above code.

TVnews and Target(vote) histogram

Logistic regression TVnews Target Histogram

TVnews Target Histogram

PID and Target(vote) histogram

Logistic regression PID Target Histogram

PID Target Histogram

Age and Target(vote) histogram

Logistic regression Age Target Histogram

Age Target Histogram

Education and Target(vote) histogram

Logistic regression Educ Target Histogram

Educ Target Histogram

Income and Target(vote) histogram

Logistic regression Income Target Histogram

Income Target Histogram

Please spend some time on understanding each histogram and how the relation with the target. Now let implement the logistic regression model in python with selected training features and the target.

Implementing the logistic regression model in python with scikit-learn

To implement the logistic regression model we created the function train_logistic_regression with train_x and train_y as input parameters. With this logistic regression model created and trained with the training dataset. Now let’s chek out the accuracies of the model.

Logistic regression model  accuracy calculation

Let’s write the function which takes the trained logistic regression model feature values (train_x or test_x) and target values (train_y or test_y ) for calculating the accuracy.

This function will take the trained model, features and targets as input. Uses the trained_modle and the features to predict the targets and the compare with the actual targets and returns the accuracy score.

Now let’s  call the above function with train_x  and train_y for getting accuracies of our model on train dataset and later call the same function with test_x and test_y for getting accuracies of our model the on test dataset.

Logistic regression model accuracy on train dataset

Script output:

Logistic regression model accuracy on test dataset

Script output:

With the selected training features we got a test accuracy 91%. Play will different features and let me know the test accuracy you got in the comments.

We can save this trained logistic regression model to use in some other applications without importing the major libraries. Check out how to dump and load the trained classifier article.

Use the scikit learn predict method to predict, whom the voter will vote. Given the voter features and let me know the results in the comments section. If you face any difficulty in using the predict method, Do check out how I use predict method in implementing decision tree classifier in python.

Logistic regression model complete code

You can get the complete code in Dataaspirant Github

Follow us:

FACEBOOKQUORA |TWITTERGOOGLE+ | LINKEDINREDDIT FLIPBOARD | MEDIUM | GITHUB

I hope you like this post. If you have any questions, then feel free to comment below.  If you want me to write on one particular topic, then do tell it to me in the comments below.

Related Data Science Courses

 

 

4 Responses to “How to implement logistic regression model in python for binary classification

  • Hi,

    Thank you for clear demonstration of logistic regression. I tried running this code but I’m getting the following error.

    File “C:\Users\Banu\Anaconda3\lib\json\encoder.py”, line 179, in default
    raise TypeError(repr(o) + ” is not JSON serializable”)

    TypeError: dict_keys([0, 1, 2, 3, 4, 5, 6, 7]) is not JSON serializable

    I even tried using keys = json.dumps( feature_target_frequencies.keys()). But couldn’t make it work. Any suggestions or corrections is highly appreciated.

    Thank you

  • Hello, thanks for a concise explanation,

    1) I have a question about feature selection, even though you briefly mentioned it: “As the feature engineering concepts too broad to explain we are going to use the below, selected features which are logically having the high chance in predicting to whom the voter will vote.”, what method did you use when picking [‘TVnews’, ‘PID’, ‘age’, ‘educ’, ‘income’] as the important features (I tried SelectKBest and RFE)

    2) Can you give any advice on selecting a specific method when it comes to feature selection.

    Thanks

    • Hi Alan,

      You can try different feature selection methods. In scikit learn you can find the best features for modeling in order (Highly influenced feature will be first.)

      When it comes to the article I have taken the features which generally make an impact on voting. It’s not good to directly applying different methods to select the best features for modeling, We need add features based on the domine knowledge.

Leave a Reply

Your email address will not be published. Required fields are marked *