In machine learning procedures, making pipelines and extracting the best out of them is very crucial nowadays. We can understand that for a library to provide all the best services is difficult and even if they are providing such high-performing functions then they become heavy-weighted. Steppy is a library that tries to build an optimal pipeline but it is a lightweight library. In this article, we are going to discuss the steppy library and we will look at its implementation for a simple classification problem. The major points to be discussed in the article are listed below.
What is Steppy?
Steppy is an open-source library that can be used for performing data science experiments developed using the python language. The main reason behind developing this library is to make the procedure of experiments fast and reproducible. Along with this, it is a lightweight library and enables us to make high-performing machine learning pipelines. Developers of this library aim to make data science practitioners focused on the data side instead of focusing on issues regarding software development.
Why do We Need steppy?
In the above section, we have discussed what steppy is and by looking at such points we can say this library can provide an environment where the experiments are fast, reproducible, and easy. With these capabilities, this library also helps in removing the difficulties with reproducibility and provides functions that can also be used by beginners. This library has two main abstractions using which we can make machine learning pipelines. Abstractions are as follows:
- Step: this is a wrapper that helps in handling the multiple aspects of building a machine learning pipeline. Examples of aspects can be saving results from the model, cross-checking the model while training, etc.
- Transformer: this can be considered as a whole model that helps in making output data using some input data. This can be our machine learning algorithm, neural network, and algorithms for pre and post-processing routines.
Any simple implementation can make the intentions behind the development of this library clear but before all this, we need to install this library that requires Python 3.5 or above in the environment. If we have it we can install this library using the following lines of codes:
!pip3 install steppy
After installation, we are ready to use steppy for data science experiments. Let’s take a look at a basic implementation.
Implementation of Steppy
In this implementation of steppy, we will look at how we can use it for creating steps in a classification task.
Importing data
In this article we are going to sklearn provided iris dataset that can be imported using the following lines of codes:
from sklearn.datasets import load_iris
from sklearn.datasets import load_iris iris = load_iris() X = iris.data y = iris.target
Lets split the dataset into train and test.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)
One thing that we need to perform while using steppy is to put our data into dictionaries so that the step we are going to create can communicate with each other. We can do this in the following way:
data_train = {'input': {'X': X_train,'y': y_train,}}
data_test = {'input':{'X': X_test,'y': y_test,}}
Now we are ready to create steps.
Creating step
In this article, we are going to fit a random forest algorithm to classify the iris data which means for steppy we are defining random forest as a transformer.
from sklearn.ensemble import RandomForestClassifier import joblib class RandomForestTransformer(BaseTransformer): def __init__(self): self.estimator = RandomForestClassifier() def fit(self, X, y): self.estimator.fit(X, y) return self def transform(self, X, **kwargs): y_pred = self.estimator.predict(X) return {'y_pred': y_pred} def persist(self, filepath): joblib.dump(self.estimator, filepath) def load(self, filepath): self.estimator = joblib.load(filepath) return self
Here we have defined some of the functions that will help in initializing random forest, fitting and transforming data, and saving the parameters. Now we can fit the above transformer into the steps in the following ways:
from steppy.base import Step EXPERIMENT_DIR = './ex1' step = Step(name='RF_classifier', transformer=RandomForestTransformer(), input_data=['input'], experiment_directory=EXPERIMENT_DIR, is_fittable=True )
Training pipeline
We can train our defined pipeline using the following lines of codes.
preds_train = step.fit_transform(data_train);
In the output, we can see that what is the step has been followed to train the pipeline. Let’s evaluate the pipeline with test data.
preds_test = step.transform(data_test)
Preds_ test
Here we can see the testing procedure followed by the library. Let’s check the accuracy of the model.
from sklearn.metrics import accuracy_score
acc_test = accuracy_score(data_test['input']['y'], preds_test['y_pred'])
print('Test accuracy = {:.4f}'.format(acc_test))
Here we can see the results are good and also if you will use it anytime you will find out how light this library is.
Final words
In this article, we have discussed the steppy library which is an open-source, lightweight and easy way to implement machine learning pipelines. Along with this, we also looked at the need for such a library and implementation to create steps in a pipeline using a steppy library.