Technical

27 Apr 2022

Using Elyra to create Machine Learning pipelines on Kubeflow

Placeholder image

Author

Robin Kahlow

ML


Machine Learning Pipelines at Ntropy

At Ntropy (we’re hiring) we are currently in the process of evaluating different machine learning pipeline platforms. In this article we will take a look at Elyra, a framework that makes it easy to create pipelines and run them on existing pipeline platforms (Kubeflow Pipelines and Apache Airflow as of writing).


Why Machine Learning Pipelines

When creating new Machine Learning models we often start experimenting in Jupyter notebooks. Usually, these notebooks fetch some data, do some feature extraction and feature engineering, train a model and evaluate it. This works very well for prototyping, but when it comes to productionizing models there are several disadvantages.

Every time we make a change in the notebooks, we will need to run the notebooks by hand and copy the resulting artifacts such as the trained models to where they are needed. Furthermore, we usually don’t automatically store any history of our notebook runs, and doing so would require extra work. It is easy to lose results and keep track of the history of a model.

To tackle some of these issues we can create machine learning pipelines. The pipelines consist of several steps that are usually run in sequence. Whenever there is a change in one step, we can rerun the entire pipeline. Alternatively, the pipeline could run automatically whenever there is a change, or perhaps periodically. Almost all pipeline systems also automatically store logs and any other results and artifacts, so it is easy to keep track of them when we make changes later.


Kubeflow Pipelines

Kubeflow is a popular Machine Learning platform that provides among other features Jupyter notebook servers and pipelines. Creating a pipeline is done with the Kubeflow Pipelines SDK. This involves writing code and is overall not necessarily an easy process.


Elyra

Elyra allows you to use Jupyter notebooks as pipeline steps with a simple visual editor. Files can easily be passed on to steps by specifying output files in each step. In the end, Elyra creates a pipeline that we can run without any extra code required.

Machine learning pipeline in Elyra — notebooks wired together (image from https://elyra.readthedocs.io/en/latest/user_guide/pipelines.html)

Elyra makes it easy even for data scientists without much technical knowledge to create machine learning pipelines that can be used in production.


Setup

Getting started with Elyra on an existing Kubeflow installation is relatively easy. The only thing we need to do is create a new notebook in Kubeflow and use Elyra’s Dockerhub image elyra/kf-notebook. Note that the resources we choose here aren’t terribly important as we can specify the resources for individual pipeline steps later.

Setting up a notebook on KubeFlow with Elyra’s docker image

To start learning how to use Elyra we will create a pipeline for training a customized transaction classification model which we recently released. Check out our post about it here.


Customized transaction classification pipeline

Our pipeline will consist of three steps:

  1. Fetch the training and test data
  2. Train and save a customized transaction classification model
  3. Evaluate the model and create visualizations

Each of the steps will have a corresponding notebook.

File structure of our project

The full code is also available on GitHub: https://github.com/ntropy-network/ML-tools/tree/master/elyra

Step 1 — Fetching the data

Our example data contains transactions and their corresponding labels (eg. “software”, “food distributors”). It sits in a public S3 bucket so we will use the wgetcommand to download it. We will store it under thedatadirectory. Later when we create the pipeline we will have to tell Elyra to pass these files on to the next steps.

Fetching the train and test data from S3

Step 2 —Model training

Here we use the data we fetched in the previous step and train a customized transaction classification model using the Ntropy SDK. First, we have to install the SDK and some other dependencies. Each step has its own dependencies so we will need to install the required dependencies in the following steps too.

nstalling dependencies for our pipeline step

Next, we can load the training data which we downloaded in the previous step.

Loading the training data we downloaded in the first step

Now we can train a `CustomTransactionClassifier` on it with the Ntropy SDK. After training, we save the model to the `artifacts` directory. We will have to tell Elyra about this directory later again to pass it on to the next step.

Training a customized transaction classification model with Ntropy SDK

Step 3 —Evaluation and visualization

Now that we have a trained model we can evaluate it on our test data.

Loading the test data, loading the trained model and creating predictions for the test data

We can now create visualizations that will be displayed in the pipeline run later with Kubeflow’s built-in visualizations. All we need to do is create a JSON file called mlpipeline-ui-metadata.json with the schema as described in the documentation and it will be picked up automatically.

In this example, we choose to visualize our test predictions with a table containing all the descriptions, ground truth labels, and predictions as well as a confusion matrix.

Creating Kubeflow visualizations for our predictions by writing to mlpipipeline-ui-metadata.json

Creating the pipeline with Elyra

Now that we have the notebooks for our individual pipeline steps we can create our pipeline that ties it all together.

Creating the pipeline file

We can create a pipeline within the Jupyter notebook environment by opening the launcher and choosing one of the pipeline types. For us, either Generic or Kubeflow pipelines will work.

Creating a new pipeline in Jupyter notebook’s launcher

Once created, we can see the visual editor for our pipeline. We can drop our notebooks from the file browser onto the pipeline editor which will create pipeline steps for them. They can then be connected together as desired.

We can right-click steps to change their properties. The interesting ones here for us are

  • Runtime image: Docker image that will be used for this step. Instead of installing our dependencies within the notebooks we could also create docker images with them already installed.
  • CPU/GPU/RAM: Resources for the pipeline step
  • Output files: Local path to the output files of the step. These will be available in later steps. In our example, the getdatastep has train and test files and our train step has a model file.
Creating the pipeline with Elyra

Running the pipeline

Now that we finished creating our pipeline we can run it. To do that we need to tell Elyra something about our pipeline environment by creating a new Kubeflow pipeline runtime configuration.

Kubeflow Pipelines: where the pipeline will run and depends on your Kubeflow setup

Cloud Object Storage: where the pipeline will store its artifacts, any S3 compatible service should work

More information can be found in Elyra’s docs.

Elyra runtime configuration

Once this is done we can run our pipeline. If successful, this will create a new Kubeflow pipeline (or create a new version for an existing one) and also start a run.

Viewing our results

Once the run was started we can view it in Kubeflow’s UI. We can see the state of the individual steps and view their logs.

https://strapi-20230718162611720200000001.s3.us-east-2.amazonaws.com/image_0ff319d0fc.pngInspecting individual steps’ logs and results Select an Image

In our evaluation step, we can see the visualizations we created earlier.

Visualizations being displayed in the step’s visualizations tab

Finally, all of our artifacts such as the model itself are stored in the object storage specified in Elyra’s runtime configuration.

Pipeline run data stored on S3

Automating pipeline runs

Let’s say we now have more training data. We could manually run the pipeline again to generate an updated model. Instead of manually running the pipeline, we could also automatically run the pipeline once per week.

Fortunately, Kubeflow already provides this functionality. In Kubeflow all we have to do is create a run for our pipeline and set the run-type to “Recurring”.

Creating an automatic weekly run of our pipeline

Conclusion

We successfully created a machine learning pipeline using Elyra. We created notebooks for fetching data, training a model on it, and finally evaluating the model and creating visualizations. Then all we had to do was wire them up with Elyra and we could run it on our Kubeflow pipeline environment.

Hopefully, this gave you an idea of what pipelines can do and shows how easy it is to create them with the help of Elyra.

Join hundreds of companies taking control of their transactions

Ntropy is the most accurate financial data standardization and enrichment API. Any data source, any geography.