How to save Scikit Learn models with Python Pickle library

February 13, 2017 Saimadhu Polamuri

Save the Scikit Learn models

Save the trained scikit learn models with Python Pickle

The final and the most exciting phase in the journey of solving the data science problems is how well the trained model is performing over the test dataset or in the production phase. In some cases, the trained model results outperform our expectations. Sometimes the trained model performance is not up to our expectations.

How to save the scikit learn models with Python Pickle #machinelearning #datascience Click To Tweet

However, this phase is more like the recursion type. We always need to make changes in the way we train models to well-optimized and to give decent and reasonable results. As we need recursively testing the model, we trained and not always we do in local systems. We need a way to save this trained model in the local system as a black box and send this black box to where ever it needs.

Why we need to save the scikit learn models

Let’s understand the need to save the trained scikit learn models a bit more. In the learning phase of machine learning, we follow the below workflow to solve any machine learning project.

Machine Learning Project workflow

Getting the project related data from different sources.
Performing the data cleaning techniques on the data gathered.
Considering the split criteria (70% of training and 30% testing) to split the data into the training and testing datasets.
Using the training dataset to model any classification or clustering algorithm.
Using the trained model to predict the target class for the test dataset.
If the trained model accuracy was not good enough, do changes in any of the above stages.

The last phase where we need to spend most of the time after the cleaning phase is to get the trained model which performs well enough to place in the production. This trained model testing won’t always be on the local system. In most of the cases, the performance of the trained model will calculate in the real environment.

Why save the trained models example

Suppose we build an Email classification model to classify the email as Spam or Not for a free email service provider We may create the email spam identification model in our local system but classify every email hit to the user the modeled classifier needs to be on the production server or in interrelated to any app.

Even though the whole project is a python based. Still, we can’t run the models on the servers where the entire application frame working is running. So In such cases, we need to dump the modeled classifier. In plain words saying. If we dump the modeled classifier means we are storing the coefficients and some related information of the modeled classifier.

In the next step, we need to load the dumped model where it is required. Once we successfully loaded the previously dumped model, then the classification of the email should happen without any issues. So in the situation like these, we need to figure out a way to dump the trained models and to use them whenever and where ever it required.

Methods to save the scikit learn models

The modern ways to save the trained scikit learn models is using packages like

Pickle (Python Object Serialization Library)
Joblib (One of the scikit-learn Method)

Before learning how to save the trained models. In this article, we are going to cover only about the Pickle library. Let’s first understand the functionality of the Pickle library. Then we learn how to save the scikit-learn models and loading them back to as same as the previously modeled models.

What is Pickle?

Pickle is one of the Python standard libraries. Which is so powerful and the best choice to perform the task like

Serialization
Marshalling

The above two functionalities are popularly known as Pickling and Unpickling

Pickling
- Pickling is the process of converting any Python object into a stream of bytes by following the hierarchy of the object we are trying to convert.
Unpickling
- Unpickling is the process of converting the pickled (stream of bytes) back into to the original Python object by following the object hierarchy

The objects can be anything. Suppose we perform the python pickling on a python list or dictionary object also. Below is the list of what kind of objects can pickle and what kind of objects should not pickle

What can Pickle and what can’t Pickle in Python

What can Pickle

All number related data types and the complex number data type too.
The Boolean kind of data type.
Python string, list, tuple, dictionaries.
Built-in function and object classes.

Whan can’t Pickle

Pickle already pickled python object
Pickle high recursive Python objects

Python Pickle Examples

Before we pickle the scikit learn models. Let’s quickly see an example on how to pickle and unpickle the Python list-objects.

Pickle the Python List object

import pickle

# pickle list object

numbers_list = [1, 2, 3, 4, 5]
list_pickle_path = 'list_pickle.pkl'

# Create an variable to pickle and open it in write mode
list_pickle = open(list_pickle_path, 'wb')
pickle.dump(numbers_list, list_pickle)
list_pickle.close()

Importing the Python Standard serialization package pickle.
Creating the python list object with 1 to 5 numbers.
Given the path to store the numbers list pickle (‘list_pickle.pkl’)
Open the list_pickle in write mode in the list_pickle.pkl path.
Use the dump method in a pickle with numbers_list and the opened list_pickle to create a pickle
Close the created pickle.

With the above code list_picke.pkl will create in our local system. We can use this created pkl file where ever we would like to. Now let’s code to perform the unpickling to get use the pickled list object again.

# unpickling the list object

# Need to open the pickled list object into read mode

list_pickle_path = 'list_pickle.pkl'
list_unpickle = open(list_pickle_path, 'rb')

# load the unpickle object into a variable
numbers_list = pickle.load(list_unpickle)

print "Numbers List :: ", numbers_list

Open the list_pickle.pkl in the read mode.
Using the pickle load method to load the opened list_unpickle.
Print the number list again.

As we learned how to perform the pickling the Python objects with the above example. Now let’s create a simple scikit-learn model and dump it into the pickle file.

Building a Decision tree classifier with Scikit-Learn

To implement the decision tree classifier in Python with Scikit-Learn I am using the code from the article How to build the decision tree classifier in Python with scikit-learn

import pickle
import pandas as pd
# Scikit-learn method to split the dataset into train and test dataset
from sklearn.cross_validation import train_test_split
# Scikit-learn method to implement the decsion tree classifier
from sklearn.tree import DecisionTreeClassifier


# Load the dataset
balance_scale_data = pd.read_csv(
    'https://archive.ics.uci.edu/ml/machine-learning-databases/balance-scale/balance-scale.data', sep=',', header=None)
print "Dataset Length:: ", len(balance_scale_data)
print "Dataset Shape:: ", balance_scale_data.shape

# Split the dataset into train and test dataset
X = balance_scale_data.values[:, 1:5]
Y = balance_scale_data.values[:, 0]


X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=100)

# Decision model with Gini index critiria
decision_tree_model = DecisionTreeClassifier(criterion="gini", random_state=100, max_depth=3, min_samples_leaf=5)
decision_tree_model.fit(X_train, y_train)
print "Decision Tree classifier :: ", decision_tree_model

Importing all the required Python packages.
Downloading the balance scale dataset for UCI weblink
Splitting the downloaded dataset into the train and test dataset.
Implementing the decision tree classifier and printing the modeled decision tree classifier model.

After running the above script for modeling the decision tree classifier, we can expect the below output.

Script Output

Dataset Length::  625
Dataset Shape::  (625, 5)
Decision Tree classifier ::  DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=5,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=100, splitter='best')

Now let’s use the modeled decision tree classifier to save into the local system as a pkl file so we can use it in other systems or in the other required applications.

Dump the scikit learn models with Python Pickle

# Dump the trained decision tree classifier with Pickle
decision_tree_pkl_filename = 'decision_tree_classifier_20170212.pkl'
# Open the file to save as pkl file
decision_tree_model_pkl = open(decision_tree_pkl_filename, 'wb')
pickle.dump(decision_tree_model, decision_tree_model_pkl)
# Close the pickle instances
decision_tree_model_pkl.close()

Created the decision_tree_pkl filename with the path where the pickled file where it needs to place.
Using the filename opened and decision_tree_model_pkl in write mode.
Calling the pickle dump method to perform the pickling the modeled decision tree classifier.
Close the opened decision_tree_mdoel_pkl

Now load the pickled modeled decision tree model.

Loading the scikit learn models with Pickle

# Loading the saved decision tree model pickle
decision_tree_model_pkl = open(decision_tree_pkl_filename, 'rb')
decision_tree_model = pickle.load(decision_tree_model_pkl)
print "Loaded Decision tree model :: ", decision_tree_model

Opening the decision_tree_pkl_filename in the read mode.
Use the pickle load method to load the saved decison_tree_model.
Print the loaded decision tree classifier.

Script Output

Loaded Decision tree model ::  DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=5,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=100, splitter='best')

Once we successfully loaded the saved scikit learn models, we can use them in the general way to predict for test dataset or in the production servers.

Conclusion

In this article, we learned about the python serialized library Pickle. Then we learned how to use the Python Pickle to save the modeled scikit learn models and how to use them back as trained models.

If you would like to learn more about building the machine learning models in python. Please have a look at the machine learning models implementation in python.

8 Responses to “How to save Scikit Learn models with Python Pickle library”

Manoj
5 years ago
Reply

Hi Saimadhu,

Really an useful article.This gave some good awareness on how trained models will be implemented in production. Thanks for all your efforts.

May I please request to write an article on evaluating model performance over a period of time and how model learns from the new incoming data . If there is an existing article available, please redirect me.
- Saimadhu Polamuri
  4 years ago
  Reply
  
  Hi Manoj,
  
  We have written an article about how to evaluate the machine learning classification model. Below is the link for that.
  
  Link: https://dataaspirant.com/six-popular-classification-evaluation-metrics-in-machine-learning/
  
  Thanks and happy learning.
Debarati
5 years ago
Reply

Hi, I tried the above sample pickle script:
—————–
import pickle

numbers_list = [1, 2, 3, 4, 5]
list_pickle_path = ‘list_pickle.pkl’

# Create an variable to pickle and open it in write mode
list_pickle = open(list_pickle_path, ‘wb’)
pickle.dump(numbers_list, list_pickle)
list_pickle.close()
—————-
list_pickle_path = ‘list_pickle.pkl’
list_unpickle = open(list_pickle_path, ‘r’)

# load the unpickle object into a variable
numbers_list = pickle.load(list_unpickle)

print(“Numbers List :: “, numbers_list)
————–
the file got created, but while unpickling, i get the below error:

—————————————————————————
TypeError Traceback (most recent call last)
in ()
3
4 # load the unpickle object into a variable
—-> 5 numbers_list = pickle.load(list_unpickle)
6
7 print(“Numbers List :: “, numbers_list)

TypeError: a bytes-like object is required, not ‘str’

can you please suggest.
- Saimadhu Polamuri
  4 years ago
  Reply
  
  Hi Debarati,
  
  Could you please use the “rb” method for reading, instead of using “r”?
  
  Thanks and happy learning!
Randi Griffin
6 years ago
Reply

This line threw an error: list_unpickle = open(list_pickle_path, ‘r’). I had to change ‘r’ to ‘rb’ and then it worked.
- Saimadhu Polamuri
  4 years ago
  Reply
  
  Hi Randi Griffin,
  
  You are correct, I have commented the same for the comments, using “rb” instead of using “r”
  
  Thanks and happy learning!
Raj
6 years ago
Reply

open the pickle file in rb mode note r mode
- Saimadhu Polamuri
  4 years ago
  Reply
  
  Hi Raj,
  
  You are correct please open the file in rb model.
  
  Thanks and happy learning!