Machine Learning Pipelines at Ntropy
At Ntropy (we’re hiring) we are currently in the process of evaluating different machine learning pipeline platforms. In this article we will take a look at Elyra, a framework that makes it easy to create pipelines and run them on existing pipeline platforms (Kubeflow Pipelines and Apache Airflow as of writing).
Why Machine Learning Pipelines
When creating new Machine Learning models we often start experimenting in Jupyter notebooks. Usually, these notebooks fetch some data, do some feature extraction and feature engineering, train a model and evaluate it. This works very well for prototyping, but when it comes to productionizing models there are several disadvantages.
Every time we make a change in the notebooks, we will need to run the notebooks by hand and copy the resulting artifacts such as the trained models to where they are needed. Furthermore, we usually don’t automatically store any history of our notebook runs, and doing so would require extra work. It is easy to lose results and keep track of the history of a model.
To tackle some of these issues we can create machine learning pipelines. The pipelines consist of several steps that are usually run in sequence. Whenever there is a change in one step, we can rerun the entire pipeline. Alternatively, the pipeline could run automatically whenever there is a change, or perhaps periodically. Almost all pipeline systems also automatically store logs and any other results and artifacts, so it is easy to keep track of them when we make changes later.
Kubeflow is a popular Machine Learning platform that provides among other features Jupyter notebook servers and pipelines. Creating a pipeline is done with the Kubeflow Pipelines SDK. This involves writing code and is overall not necessarily an easy process.
Elyra allows you to use Jupyter notebooks as pipeline steps with a simple visual editor. Files can easily be passed on to steps by specifying output files in each step. In the end, Elyra creates a pipeline that we can run without any extra code required.
Elyra makes it easy even for data scientists without much technical knowledge to create machine learning pipelines that can be used in production.
Getting started with Elyra on an existing Kubeflow installation is relatively easy. The only thing we need to do is create a new notebook in Kubeflow and use Elyra’s Dockerhub image elyra/kf-notebook. Note that the resources we choose here aren’t terribly important as we can specify the resources for individual pipeline steps later.
To start learning how to use Elyra we will create a pipeline for training a customized transaction classification model which we recently released. Check out our post about it here.
Customized transaction classification pipeline
Our pipeline will consist of three steps:
- Fetch the training and test data
- Train and save a customized transaction classification model
- Evaluate the model and create visualizations
Each of the steps will have a corresponding notebook.
The full code is also available on GitHub: https://github.com/ntropy-network/ML-tools/tree/master/elyra
Step 1 — Fetching the data
Our example data contains transactions and their corresponding labels (eg. “software”, “food distributors”). It sits in a public S3 bucket so we will use the wgetcommand to download it. We will store it under thedatadirectory. Later when we create the pipeline we will have to tell Elyra to pass these files on to the next steps.
Step 2 —Model training
Here we use the data we fetched in the previous step and train a customized transaction classification model using the Ntropy SDK. First, we have to install the SDK and some other dependencies. Each step has its own dependencies so we will need to install the required dependencies in the following steps too.
Next, we can load the training data which we downloaded in the previous step.
Loading the training data we downloaded in the first step
Now we can train a `CustomTransactionClassifier` on it with the Ntropy SDK. After training, we save the model to the `artifacts` directory. We will have to tell Elyra about this directory later again to pass it on to the next step.
Step 3 —Evaluation and visualization
Now that we have a trained model we can evaluate it on our test data.
Loading the test data, loading the trained model and creating predictions for the test data
We can now create visualizations that will be displayed in the pipeline run later with Kubeflow’s built-in visualizations. All we need to do is create a JSON file called mlpipeline-ui-metadata.json with the schema as described in the documentation and it will be picked up automatically.
In this example, we choose to visualize our test predictions with a table containing all the descriptions, ground truth labels, and predictions as well as a confusion matrix.
Creating the pipeline with Elyra
Now that we have the notebooks for our individual pipeline steps we can create our pipeline that ties it all together.
Creating the pipeline file
We can create a pipeline within the Jupyter notebook environment by opening the launcher and choosing one of the pipeline types. For us, either Generic or Kubeflow pipelines will work.
Once created, we can see the visual editor for our pipeline. We can drop our notebooks from the file browser onto the pipeline editor which will create pipeline steps for them. They can then be connected together as desired.
We can right-click steps to change their properties. The interesting ones here for us are
- Runtime image: Docker image that will be used for this step. Instead of installing our dependencies within the notebooks we could also create docker images with them already installed.
- CPU/GPU/RAM: Resources for the pipeline step
- Output files: Local path to the output files of the step. These will be available in later steps. In our example, the getdatastep has train and test files and our train step has a model file.
Running the pipeline
Now that we finished creating our pipeline we can run it. To do that we need to tell Elyra something about our pipeline environment by creating a new Kubeflow pipeline runtime configuration.
Kubeflow Pipelines: where the pipeline will run and depends on your Kubeflow setup
Cloud Object Storage: where the pipeline will store its artifacts, any S3 compatible service should work
More information can be found in Elyra’s docs.
Elyra runtime configuration
Once this is done we can run our pipeline. If successful, this will create a new Kubeflow pipeline (or create a new version for an existing one) and also start a run.
Viewing our results
Once the run was started we can view it in Kubeflow’s UI. We can see the state of the individual steps and view their logs.
https://strapi-20230718162611720200000001.s3.us-east-2.amazonaws.com/image_0ff319d0fc.pngInspecting individual steps’ logs and results Select an Image
In our evaluation step, we can see the visualizations we created earlier.
Finally, all of our artifacts such as the model itself are stored in the object storage specified in Elyra’s runtime configuration.
Automating pipeline runs
Let’s say we now have more training data. We could manually run the pipeline again to generate an updated model. Instead of manually running the pipeline, we could also automatically run the pipeline once per week.
Fortunately, Kubeflow already provides this functionality. In Kubeflow all we have to do is create a run for our pipeline and set the run-type to “Recurring”.
We successfully created a machine learning pipeline using Elyra. We created notebooks for fetching data, training a model on it, and finally evaluating the model and creating visualizations. Then all we had to do was wire them up with Elyra and we could run it on our Kubeflow pipeline environment.
Hopefully, this gave you an idea of what pipelines can do and shows how easy it is to create them with the help of Elyra.