How to implement logistic regression model in python for binary classification
In the last few articles, we talked about different classification algorithms. For every classification algorithm, we learn the background concepts of the algorithm and in the followed article we used the background concepts of the algorithm to build the classification model. Later we used the mode to perform the regression or classification tasks.
Likewise in this article, we are going to implement the logistic regression model in python to perform the binary classification task. In this, we are mainly concentrating on the implementation of logistic regression in python, as the background concepts explained in how the logistic regression model works article.
Take a cup of tea/coffee and check out the below prerequisite concepts before we drive further.
Now let’s view the concepts we are going to learn by the end of this article.
Table of contents
- What is binary classification
- Logistic regression introduction
- Building logistic regression model in python
- Binary classification problem
- Dataset description
- Data creation for modeling and testing
- Selecting the features
- Split the data into train and test dataset
- Understanding the training data
- Implementing the logistic regression model in python with scikit-learn
- Logistic regression model accuracy calculation
What is binary classification
Binary classification is performing the task of classifying the binary targets with the use of supervised classification algorithms. The binary target means having only 2 targets values/classes. To get the clear picture about the binary classification lets looks at the below binary classification problems.
- Identifying the image as a cat or not.
- Targets: cat or not a cat
- Predicting to whom the voter will vote Bill Clinton or Bob Dole
- Targets: Bill Clinton or Bod Dole
- Forecasting will it rain tomorrow.
- Targets: Will rain or sunny day
Hope with the above classification problems you are having the clear understanding on the binary classification problems.
Logistic regression introduction
The logistic regression algorithm is the simplest classification algorithm used for the binary classification task. Which can also be used for solving the multi-classification problems. In summarizing way of saying logistic regression model will take the feature values and calculates the probabilities using the sigmoid or softmax functions.
The sigmoid function used for binary classification problems and Softmax function used of multi-classification problems.
Later the calculated probabilities used to find the target class. In general, the high probability class treated as the final target class.
The above explanation is indeed as you already know the logistic regression algorithm. If you new to the logistic regression algorithm please check out how the logistic regression algorithm works before you continue this article.
Building logistic regression model in python
To build the logistic regression model in python we are going to use the Scikit-learn package. We are going to follow the below workflow for implementing the logistic regression model.
- Load the data set.
- Understanding the data.
- Split the data into training and test dataset.
- Use the training dataset to model the logistic regression model.
- Calculate the accuracy of the trained model on the training dataset.
- Calculate the accuracy of the model on the test dataset.
We are going to build the logistic regression model in the above workflow to address the binary classification problem. Let’s look into the problem we are going to solve.
Binary classification problem
We are going to build the logistic regression model to predict, for whom the voter will vote. Given the voter details.
- Will the voter will vote for Bill Clinton?
- Will the voter will vote for Bobe Dole?
Dataset description
The dataset we are going to use is the 1996 United States President election dataset. In a moment we are going to look into all the features in the dataset. Before that, let me give you the quick summary about this election.
The United States 1996 President Election Summary
- This 53rd united states president elections held on November 5, 1996.
- Nominees:
- Bill Clinton
- Boby Dole
- Ross Perot
- With 49.2 % vote percentage Bill Clinton won the Elections.
Love to read more about the election? Then check out the few details about the election in wiki United States President Election article.
Before we begin the modeling let’s import the required python packages.
# Required Python Packages import pandas as pd import numpy as np import plotly.plotly as py import plotly.graph_objs as go py.sign_in('YOUR_PLOTLY_USER_NAME', 'API_KEY') from sklearn.cross_validation import train_test_split from sklearn.linear_model import LogisticRegression from sklearn import metrics
- Pandas package is required for data analysis. In the process of modeling logistic regression classifier, first we are going to load the dataset (CSV format) into pandas data frame and then we play around with the loaded dataset.
- Numpy package is for performing the numerical calculation.
- Plotly package for visualizing the data set for better understanding.
- We need to sign_in with your plotly credential to use this package. You can find the credential in your plotly account after you create an account.
- Sklearn package is for modeling the machine learning algorithms.
- train_test_split method to split the dataset into the train and test dataset.
- LogisticRegression method for modeling the logistic regression classifier.
- metrics method for calculating the accuracy of the trained classifiers.
Now let’s load the data set and look into all the features available to model the logistic regression model in python. You can download the data set from our GitHub.
Load the dataset
# Files DATA_SET_PATH = "../Inputs/anes_dataset.csv" def main(): """ Logistic Regression classifier main :return: """ # Load the data set for training and testing the logistic regression classifier dataset = pd.read_csv(DATA_SET_PATH) print "Number of Observations :: ", len(dataset) if __name__ == "__main__": main()
- Created the main function and loading the elections dataset into pandas dataframe. As the dataset in CSV format, we are calling the pandas read_csv function with dataset path as a parameter.
- To know the number of observations (rows) in the dataset, we are calling the python len() function with the loaded dataset.
Script Output:
Number of Observations:: 944
From the script output, the number observation in the dataset are 944. We are going to play with the observation to model the logistic regression model 🙂
As we already know the dataset size, now lets chek out the few observation to know about the features in the dataset.
def main(): """ Logistic Regression classifier main :return: """ # Load the data set for training and testing the logistic regression classifier dataset = pd.read_csv(DATA_SET_PATH) print "Number of Observations :: ", len(dataset) # Get the first observation print dataset.head() if __name__ == "__main__": main()
We can use the pandas head method to get the top observations of the loaded dataset.
Script Output:
popul TVnews selfLR ClinLR DoleLR PID age educ income vote 0 7 7 1 6 6 36 3 1 1 190 1 3 3 5 1 20 4 1 0 31 7 2 2 6 1 24 6 1 0 83 4 3 4 5 1 28 6 1 0 640 7 5 6 4 0 68 6 1 0
let’s write a function to get the header names of the given dataset. Later we store all the header names, which can be used in modeling the logistic regression.
def dataset_headers(dataset): """ To get the dataset header names :param dataset: loaded dataset into pandas DataFrame :return: list of header names """ return list(dataset.columns.values)
This dataset_headers function takes the dataset (loaded ) and returns the header names in the loaded dataset. As the dataset_headers function excepts the loaded dataset, we are going to call this function inside the main function we wrote earlier.
def main(): """ Logistic Regression classifier main :return: """ # Load the data set for training and testing the logistic regression classifier dataset = pd.read_csv(DATA_SET_PATH) headers = dataset_headers(dataset) print "Data set headers :: {headers}".format(headers=headers) if __name__ == "__main__": main()
Script Output:
Data set headers :: ['popul', 'TVnews', 'selfLR', 'ClinLR', 'DoleLR', 'PID', 'age', 'educ', 'income', 'vote']
Now let’s discuss each header (feature).
- popul:
- Means the population in the census place in 1000
- TVnews:
- The number of time the voter views the Tv news in a week.
- selfLR
- Is the person’s self–reported political learnings from left to right.
- ClinLR
- Is the person’s impression on Bill Clinton’s Political learning
from left to right
- Is the person’s impression on Bill Clinton’s Political learning
- DoleLR
- Is the person impression of Bob Dole’s Political learnings from left to right.
- PID
- Party Identification of the person.
- If the PID is
- 0 means the sStrong Democrat,
- 1 means Week democrat,
- 2 means Independent democrat likewise
- age
- Age of the voter.
- educ
- Education qualification of the voter.
- income
- Income of the voter.
- vote
- The vote is the target which we are going to predict using the trained logistic regression model.
- vote having two possible outcomes: 0 means Clinton, 1 means Dole.
Data creation for modeling and testing
Selecting the features
Out of the above features (headers), we are going to use only the below headers. To select the best features from all the available features we use the feature engineering concepts.
As the feature engineering concepts too board to explain we are going to use the below, selected features which are logically having the high chance in predicting to whom the voter will vote.
training_features = ['TVnews', 'PID', 'age', 'educ', 'income'] target = 'vote'
For training the logistic regression model we are going feature in the training_fearures and the target.
Split the data into train and test dataset
def main(): """ Logistic Regression classifier main :return: """ # Load the data set for training and testing the logistic regression classifier dataset = pd.read_csv(DATA_SET_PATH) training_features = ['TVnews', 'PID', 'age', 'educ', 'income'] target = 'vote' # Train , Test data split train_x, test_x, train_y, test_y = train_test_split(dataset[training_features], dataset[target], train_size=0.7) print "train_x size :: ", train_x.shape print "train_y size :: ", train_y.shape print "test_x size :: ", test_x.shape print "test_y size :: ", test_y.shape if __name__ == "__main__": main()
- To split the dataset into train and test dataset we are using the scikit-learn(sk-learn) method train_test_split with selected training features data and the target.
- We are using the train_size as 0.7 which means out of the all the observation considering 70% of observation for training and remaining 30% for testing.
- The train_test_split give four outputs which are train_x, test_x, train_y and test_y.
- For know the size of each of the about four outputs we are printing the shape.
Script Output:
train_x size :: (660, 5) train_y size :: (660,) test_x size :: (284, 5) test_y size :: (284,)
Understanding the training data
For understanding the training data features, Let’s look at the each possible value for each feature data and how the relation with target classes(0 for Clinton, 1 for Dole).
we can find the relation between the feature and the target with a histogram. To create the histogram we need the frequencies count of each possible value of the feature and with the target classes.
Let me explain what I am talking about with an example.
Example:
Suppose we have two features for building a classification model which predicts, will the student gets A grade in the exam by considering the two features. Let’s say the features are the number of study classes attended in a day and gender of the student. To create the histogram to find the relation between gender and the target A grade or not, we need frequencies like the below.
As the feature having two possible values boy or girl and the target also having the two possible outcomes A grade or not
- Frequencies count of the boy and A grade
- Frequencies count of the boy and Not
- Frequencies count of the girl and A grade
- Frequencies count of the girl and Not
So now let’s write a function with takes the dataset feature header and target to get the about kind of frequencies results.
Frequencies on feature and target relation
def unique_observations(dataset, header, method=1): """ To get unique observations in the loaded pandas DataFrame column :param dataset: :param header: :param method: Method to perform the unique (default method=1 for pandas and method=0 for numpy ) :return: """ try: if method == 0: # With Numpy observations = np.unique(dataset[[header]]) elif method == 1: # With Pandas observations = pd.unique(dataset[header].values.ravel()) else: observations = None print "Wrong method type, Use 1 for pandas and 0 for numpy" except Exception as e: observations = None print "Error: {error_msg} /n Please check the inputs once..!".format(error_msg=e.message) return observations def feature_target_frequency_relation(dataset, f_t_headers): """ To get the frequency relation between targets and the unique feature observations :param dataset: :param f_t_headers: feature and target header :return: feature unique observations dictionary of frequency count dictionary """ feature_unique_observations = unique_observations(dataset, f_t_headers[0]) unique_targets = unique_observations(dataset, f_t_headers[1]) frequencies = {} for feature in feature_unique_observations: frequencies[feature] = {unique_targets[0]: len( dataset[(dataset[f_t_headers[0]] == feature) & (dataset[f_t_headers[1]] == unique_targets[0])]), unique_targets[1]: len( dataset[(dataset[f_t_headers[0]] == feature) & (dataset[f_t_headers[1]] == unique_targets[1])])} return frequencies
To get the frequencies relation between target and feature, we written two functions unique_observations and feature_target_frequency_relation
The unique_observation function takes the dataset and header as input parameters and returns the unique values in the dataset for that header. Suppose if the data is [1, 2, 1, 2, 3, 5, 1, 4] then output will be [1, 2, 3, 4, 5]
Next function is the feature_target_frequency_relation will take dataset and header and target header as an input parameter and returns the frequencies.
Let’s run the fearue_target_frequency_relation function with feature header (“educ”) and target feature (“vote”)
def main(): """ Logistic Regression classifier main :return: """ # Load the data set for training and testing the logistic regression classifier dataset = pd.read_csv(DATA_SET_PATH) training_features = ['TVnews', 'PID', 'age', 'educ', 'income'] target = 'vote' # Train , Test data split train_x, test_x, train_y, test_y = train_test_split(dataset[training_features], dataset[target], train_size=0.7) print "edu_target_frequencies :: ", feature_target_frequency_relation(dataset, [training_features[3], target]) if __name__ == "__main__": main()
Script Output
edu_target_frequencies :: {1: {0: 10, 1: 3}, 2: {0: 38, 1: 14}, 3: {0: 153, 1: 95}, 4: {0: 106, 1: 81}, 5: {0: 53, 1: 37}, 6: {0: 119, 1: 108}, 7: {0: 72, 1: 55}}
This seems interesting in the training dataset the feature education (educ) value is 1 for 13 (10 + 3) time and out of 13, 10 votes for Clinton and only 3 votes for the dole. Which is the strong signal for our classifier while predicting to whom the voter will vote.
Now we are ready with frequencies so we need to write a function which takes the calculated frequencies as input store the histogram.
def feature_target_histogram(feature_target_frequencies, feature_header): """ :param feature_target_frequencies: :param feature_header: :return: """ keys = feature_target_frequencies.keys() y0 = [feature_target_frequencies[key][0] for key in keys] y1 = [feature_target_frequencies[key][1] for key in keys] trace1 = go.Bar( x=keys, y=y0, name='Clinton' ) trace2 = go.Bar( x=keys, y=y1, name='Dole' ) data = [trace1, trace2] layout = go.Layout( barmode='group', title='Feature :: ' + feature_header + ' Clinton Vs Dole votes Frequency', xaxis=dict(title="Feature :: " + feature_header + " classes"), yaxis=dict(title="Votes Frequency") ) fig = go.Figure(data=data, layout=layout) # plot_url = py.plot(fig, filename=feature_header + ' - Target - Histogram') py.image.save_as(fig, filename=feature_header + '_Target_Histogram.png')
Don’t get scared about the code it’s just the histogram template form plotly. We just need to few modifications to the template for our needs. Once the modification did we just need to call the function with the proper inputs.
As the target (vote) having two possible outcomes, we need to compare the relation wth histograms. For that I am getting the results in keys, y0, y1 for feature (educ) these are the results for keys, y0, y1
edu_target_frequencies :: {1: {0: 10, 1: 3}, 2: {0: 38, 1: 14}, 3: {0: 153, 1: 95}, 4: {0: 106, 1: 81}, 5: {0: 53, 1: 37}, 6: {0: 119, 1: 108}, 7: {0: 72, 1: 55}} keys :: [1, 2, 3, 4, 5, 6, 7] y0 :: [10, 38, 153, 106, 53, 119, 72] y1 :: [3, 14, 95, 81, 37, 108, 55]
If you observe the edu_target_frequencies and the keys, y0, y1 you can clearly understand what we are trying to do here.
Now let’s call the feature_target_histogram function for all the tranning_feature and check out the results.
def main(): """ Logistic Regression classifier main :return: """ # Load the data set for training and testing the logistic regression classifier dataset = pd.read_csv(DATA_SET_PATH) training_features = ['TVnews', 'PID', 'age', 'educ', 'income'] target = 'vote' # Train , Test data split train_x, test_x, train_y, test_y = train_test_split(dataset[training_features], dataset[target], train_size=0.7) for feature in training_features: feature_target_frequencies = feature_target_frequency_relation(dataset, [feature, target]) feature_target_histogram(feature_target_frequencies, feature) if __name__ == "__main__": main()
Below are the stored histogram images after running the above code.
TVnews and Target(vote) histogram
PID and Target(vote) histogram
Age and Target(vote) histogram
Education and Target(vote) histogram
Income and Target(vote) histogram
Please spend some time on understanding each histogram and how the relation with the target. Now let implement the logistic regression model in python with selected training features and the target.
Implementing the logistic regression model in python with scikit-learn
def train_logistic_regression(train_x, train_y): """ Training logistic regression model with train dataset features(train_x) and target(train_y) :param train_x: :param train_y: :return: """ logistic_regression_model = LogisticRegression() logistic_regression_model.fit(train_x, train_y) return logistic_regression_model def main(): """ Logistic Regression classifier main :return: """ # Load the data set for training and testing the logistic regression classifier dataset = pd.read_csv(DATA_SET_PATH) training_features = ['TVnews', 'PID', 'age', 'educ', 'income'] target = 'vote' # Train , Test data split train_x, test_x, train_y, test_y = train_test_split(dataset[training_features], dataset[target], train_size=0.7) # Training Logistic regression model trained_logistic_regression_model = train_logistic_regression(train_x, train_y) if __name__ == "__main__": main()
To implement the logistic regression model we created the function train_logistic_regression with train_x and train_y as input parameters. With this logistic regression model created and trained with the training dataset. Now let’s chek out the accuracies of the model.
Logistic regression model accuracy calculation
Let’s write the function which takes the trained logistic regression model feature values (train_x or test_x) and target values (train_y or test_y ) for calculating the accuracy.
def model_accuracy(trained_model, features, targets): """ Get the accuracy score of the model :param trained_model: :param features: :param targets: :return: """ accuracy_score = trained_model.score(features, targets) return accuracy_score
This function will take the trained model, features and targets as input. Uses the trained_modle and the features to predict the targets and the compare with the actual targets and returns the accuracy score.
Now let’s call the above function with train_x and train_y for getting accuracies of our model on train dataset and later call the same function with test_x and test_y for getting accuracies of our model the on test dataset.
Logistic regression model accuracy on train dataset
train_accuracy = model_accuracy(trained_logistic_regression_model, train_x, train_y) print "Train Accuracy :: ", train_accuracy
Script output:
Train Accuracy :: 0.901515151515
Logistic regression model accuracy on test dataset
# Testing the logistic regression model test_accuracy = model_accuracy(trained_logistic_regression_model, test_x, test_y) print "Test Accuracy :: ", test_accuracy
Script output:
Test Accuracy :: 0.911971830986
With the selected training features we got a test accuracy 91%. Play will different features and let me know the test accuracy you got in the comments.
We can save this trained logistic regression model to use in some other applications without importing the major libraries. Check out how to dump and load the trained classifier article.
Use the scikit learn predict method to predict, whom the voter will vote. Given the voter features and let me know the results in the comments section. If you face any difficulty in using the predict method, Do check out how I use predict method in implementing decision tree classifier in python.
Logistic regression model complete code
#!/usr/bin/env python # logistic_regression.py # Author : Saimadhu # Date: 19-March-2017 # About: Implementing Logistic Regression Classifier to predict to whom the voter will vote. # Required Python Packages import pandas as pd import numpy as np import pdb import plotly.plotly as py import plotly.graph_objs as go # import plotly.plotly as py # from plotly.graph_objs import * py.sign_in('dataaspirant', 'RhJdlA1OsXsTjcRA0Kka') from sklearn.cross_validation import train_test_split from sklearn.linear_model import LogisticRegression from sklearn import metrics # Files DATA_SET_PATH = "../Inputs/anes_dataset.csv" def dataset_headers(dataset): """ To get the dataset header names :param dataset: loaded dataset into pandas DataFrame :return: list of header names """ return list(dataset.columns.values) def unique_observations(dataset, header, method=1): """ To get unique observations in the loaded pandas DataFrame column :param dataset: :param header: :param method: Method to perform the unique (default method=1 for pandas and method=0 for numpy ) :return: """ try: if method == 0: # With Numpy observations = np.unique(dataset[[header]]) elif method == 1: # With Pandas observations = pd.unique(dataset[header].values.ravel()) else: observations = None print "Wrong method type, Use 1 for pandas and 0 for numpy" except Exception as e: observations = None print "Error: {error_msg} /n Please check the inputs once..!".format(error_msg=e.message) return observations def feature_target_frequency_relation(dataset, f_t_headers): """ To get the frequency relation between targets and the unique feature observations :param dataset: :param f_t_headers: feature and target header :return: feature unique observations dictionary of frequency count dictionary """ feature_unique_observations = unique_observations(dataset, f_t_headers[0]) unique_targets = unique_observations(dataset, f_t_headers[1]) frequencies = {} for feature in feature_unique_observations: frequencies[feature] = {unique_targets[0]: len( dataset[(dataset[f_t_headers[0]] == feature) & (dataset[f_t_headers[1]] == unique_targets[0])]), unique_targets[1]: len( dataset[(dataset[f_t_headers[0]] == feature) & (dataset[f_t_headers[1]] == unique_targets[1])])} return frequencies def feature_target_histogram(feature_target_frequencies, feature_header): """ :param feature_target_frequencies: :param feature_header: :return: """ keys = feature_target_frequencies.keys() y0 = [feature_target_frequencies[key][0] for key in keys] y1 = [feature_target_frequencies[key][1] for key in keys] trace1 = go.Bar( x=keys, y=y0, name='Clinton' ) trace2 = go.Bar( x=keys, y=y1, name='Dole' ) data = [trace1, trace2] layout = go.Layout( barmode='group', title='Feature :: ' + feature_header + ' Clinton Vs Dole votes Frequency', xaxis=dict(title="Feature :: " + feature_header + " classes"), yaxis=dict(title="Votes Frequency") ) fig = go.Figure(data=data, layout=layout) # plot_url = py.plot(fig, filename=feature_header + ' - Target - Histogram') py.image.save_as(fig, filename=feature_header + '_Target_Histogram.png') def train_logistic_regression(train_x, train_y): """ Training logistic regression model with train dataset features(train_x) and target(train_y) :param train_x: :param train_y: :return: """ logistic_regression_model = LogisticRegression() logistic_regression_model.fit(train_x, train_y) return logistic_regression_model def model_accuracy(trained_model, features, targets): """ Get the accuracy score of the model :param trained_model: :param features: :param targets: :return: """ accuracy_score = trained_model.score(features, targets) return accuracy_score def main(): """ Logistic Regression classifier main :return: """ # Load the data set for training and testing the logistic regression classifier dataset = pd.read_csv(DATA_SET_PATH) print "Number of Observations :: ", len(dataset) # Get the first observation print dataset.head() headers = dataset_headers(dataset) print "Data set headers :: {headers}".format(headers=headers) training_features = ['TVnews', 'PID', 'age', 'educ', 'income'] target = 'vote' # Train , Test data split train_x, test_x, train_y, test_y = train_test_split(dataset[training_features], dataset[target], train_size=0.7) print "train_x size :: ", train_x.shape print "train_y size :: ", train_y.shape print "test_x size :: ", test_x.shape print "test_y size :: ", test_y.shape print "edu_target_frequencies :: ", feature_target_frequency_relation(dataset, [training_features[3], target]) for feature in training_features: feature_target_frequencies = feature_target_frequency_relation(dataset, [feature, target]) feature_target_histogram(feature_target_frequencies, feature) # Training Logistic regression model trained_logistic_regression_model = train_logistic_regression(train_x, train_y) train_accuracy = model_accuracy(trained_logistic_regression_model, train_x, train_y) # Testing the logistic regression model test_accuracy = model_accuracy(trained_logistic_regression_model, test_x, test_y) print "Train Accuracy :: ", train_accuracy print "Test Accuracy :: ", test_accuracy if __name__ == "__main__": main()
You can get the complete code in Dataaspirant Github
Follow us:
FACEBOOK| QUORA |TWITTER| GOOGLE+ | LINKEDIN| REDDIT | FLIPBOARD | MEDIUM | GITHUB
I hope you like this post. If you have any questions, then feel free to comment below. If you want me to write on one particular topic, then do tell it to me in the comments below.
Related Data Science Courses
- Data Science and Machine Learning with Python
- Complete Machine learning course with Python and R
- Machine learning and Classification models
- Classification Algorithms Case study
Hey man, good tutorial. Just wanted to remind you that in the complete code, you have put your username and password for plotly sign in. Thanks for the tutorial again.
Hi Omkaar Kamath,
We have updated the password of plotly in the credential section, thanks a lot again.
Thanks and happy learning!
Hello, thanks for a concise explanation,
1) I have a question about feature selection, even though you briefly mentioned it: “As the feature engineering concepts too broad to explain we are going to use the below, selected features which are logically having the high chance in predicting to whom the voter will vote.”, what method did you use when picking [‘TVnews’, ‘PID’, ‘age’, ‘educ’, ‘income’] as the important features (I tried SelectKBest and RFE)
2) Can you give any advice on selecting a specific method when it comes to feature selection.
Thanks
Hi Alan,
You can try different feature selection methods. In scikit learn you can find the best features for modeling in order (Highly influenced feature will be first.)
When it comes to the article I have taken the features which generally make an impact on voting. It’s not good to directly applying different methods to select the best features for modeling, We need add features based on the domine knowledge.
Hi,
Thank you for clear demonstration of logistic regression. I tried running this code but I’m getting the following error.
File “C:\Users\Banu\Anaconda3\lib\json\encoder.py”, line 179, in default
raise TypeError(repr(o) + ” is not JSON serializable”)
TypeError: dict_keys([0, 1, 2, 3, 4, 5, 6, 7]) is not JSON serializable
I even tried using keys = json.dumps( feature_target_frequencies.keys()). But couldn’t make it work. Any suggestions or corrections is highly appreciated.
Thank you
Hi Bhavana,
Thanks for your compliment. I guess the issue is with the python versions. In the function, its throwing error a TypeError. If you understand the logic what I am doing inside the function, update the function with your code. Else you can create a python virtual environment and run the code. Below is the link
https://dataaspirant.com/2016/03/22/python-datamining-packages-virtual-environment-setup-in-ubuntu/