Bag of tricks for optimizing machine learning training pipelines

Author

Arseny Kravchenko

M L

At Ntropy, machine learning models are the core of our tech and product, and we spend a significant share of our engineering efforts improving them. Aiming for quicker iterations, we are constantly looking for ways to improve the efficiency of our machine learning pipelines, while keeping the budgets reasonable. In this post, we will share some of the techniques we use to speed up training, improve the machine learning engineer experience, and keep costs under control.

Basic infrastructure

Training deep learning models on large datasets requires adequate hardware. Our primary training infrastructure is based on Google Cloud Platform (GCP) because it offers a variety of GPUs. The cost of GPUs may seem intimidating at first, but it can be quite reasonable under certain conditions: choosing the right instance type and using preemptible (or “spot”) instances when possible.

Our workloads can generally be divided into three categories:

Debugging and prototyping, which we handle using small instances with a single, inexpensive GPU (usually a T4).
Training a regular-sized model, where we typically prefer a single A100 instance for maximum performance in terms of cost per flop (while avoiding the overhead of a multi-GPU setup and achieving the best performance per single GPU).
Training a large model, which requires a multi-GPU setup.

So we use three proper presets for the ML-oriented instance types.

Preemptible GCP instances are very important because of pricing. They are significantly (60–91%) cheaper than regular ones, but can be terminated at any time. This can be a problem if it interrupts your training process. Most modern frameworks support resuming training from a checkpoint, so it’s not too difficult to restart the process (e.g. in HuggingFace training framework it takes a single parameter). However, it’s still a good idea to save checkpoints regularly and have a way to monitor the training process and restart from a checkpoint if necessary. Most of Ntropy’s heavy training pipelines support checkpoint recovery through simple command line arguments, such as the — from_checkpoint flag. We don’t have a sophisticated monitoring system that will alert us if the training process is terminated (and, with properly designed pipelines, we shouldn’t care), but it can be easily implemented using simpler tools like custom shutdown scripts.

One more aspect of lean GPU usage is to avoid human mistakes like “spin up a large instance and forget about it”. We use a simple script that checks the lifetime of active instances and reminds us to shut them down if they are running for too long messaging relevant people in Slack.

Finally, one more interesting aspect of our training infrastructure is that we use a multi-cloud setup in practice. As it was told earlier, GCP is our main vendor for training instances for cost and powerful machines availability-related reasons, while our default production infrastructure is AWS. It means that sometimes we need to combine the two: e.g. taking data from AWS S3, training on GCP and finally putting training artifacts back to S3. We use Flyte to orchestrate this process. Flyte is a workflow management system that allows us to define a pipeline as a DAG of tasks. It is useful for us because it allows us to define a pipeline once and run its steps on different machines with different computational resources allocation, and it also provides a nice UI for monitoring the progress of the pipeline. Multi-cloud setup is not a universal recipe, e.g. for companies working with huge hardly compressible datasets (e.g. video processing) syncing data between clouds can be a costly bottleneck, but for us, it works well so that we can use the best of both worlds.

Docker images

We prefer to use Docker images to package our training code and dependencies. While we haven’t dockerized all of our training pipelines yet, it is a goal we are working towards. It allows us to easily reproduce the environment and be confident that we can retrain the model in the future, as well as simplify using different machines — for example, debugging on a cheaper GPU locally and then training on a more powerful instance.

When improving a machine learning pipeline, we often face conflicting requirements — on the one hand, the main training pipeline should be easy to reproduce with minimal effort, but on the other hand, it should be easy to experiment with new ideas. From our experience, we have found it easier to balance these two requirements by using a multitarget Dockerfile. This is a single Dockerfile that can be used to build multiple images, with one image being used purely for training jobs and another more suited for development. The development image is usually larger and includes more dependencies, such as Jupyter, which we include so it can be run on Kubeflow, the platform we use for experiments.

We use GitHub Actions to build Docker images for training pipelines on CI and push them to Google Container Registry (GCR). This means that every time our code is changed, a new image is built and pushed to GCR, and spinning up a new instance with the “latest” image tag will always give you the latest version of the code. At the same time, it is still possible to use a specific image tag to reproduce the results of a previous experiment. To reduce the overhead of building the image each time, we heavily cache the intermediate layers of the Dockerfile. Additionally, we only push images to GCR if the build is triggered by a push to the master branch or a specific command in Github, so we don’t waste time and resources building images for every single commit.

If you use Github Actions and want to follow the same strategy, you can use https://github.com/tj-actions/changed-files to check if relevant files have changed and only build the image if they did, e.g.

- name: Check if training files modified
    id: changed-files-training
    uses: tj-actions/changed-files@v34
    with:
        files: |
        project_path/**/*
        some_other_path/**/*
...
  - name: Build training image
    if: steps.changed-training.outputs.any_modified == 'true'
    uses: docker/build-push-action@v3
    target: training_job_image
    with:
        builder: ${{ steps.builder.outputs.builder }}
        context: .
        file: ./path/to/Dockerfile
        pull: true
        push: ${{ contains(github.event.pull_request.body, 'push_training_image') || github.event_name == 'push' }}

Training pipelines: profiling and optimization

When working with billions of transactions, it is important to make sure that the training pipeline is as efficient as possible. As many engineering teams, we tend to follow the pattern of “make it work, make it right, make it fast”. Many pipelines are initially quickly drafted, but not they are not necessarily efficient; although after some time, we can optimize them to run faster when it’s important to iterate quickly with experiments.

There is no silver bullet for optimizing training pipelines, but there are some common practices that can help, and using a profiler is a must. There are several patterns in imperfect training pipelines that are easy to detect and fix:

Using Pandas is good for prototyping, but it can be very slow when used in a training pipeline. Typical training pipelines use a lot of indexing and row-wise operations, and Pandas is not optimized for this. E.g. compare the performance of df.iloc[i] and array[i] to estimate the difference at scale of many millions of calls. When columns-wise operations are needed, we prefer to use polars — an optimized library with an API similar to Pandas, written in Rust.

# real optimization example we applied a while ago

from typing import List
import pandas as pd

def fast_isin(small: pd.Series, big: set) -> List[bool]:
    vals = small.values
    mask = set(vals) - big
    return [x not in mask for x in vals]

series = pd.Series(["0", "a", "b", "c", "d", "e", "f"])
big_set = {"a", "b", "c", "d"}
big_set.update(set(range(100000)))

res_a = series.isin(big_set).tolist()
res_b = fast_isin(series, big_set)

assert res_a == res_b

# pandas’ isin brings a lot of overhead, 
# and doesn’t bring any value related to its usual benefits like numpy vectorization

In [2]: %timeit series.isin(big_set).tolist()
5.51 ms ± 46.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [3]: %timeit fast_isin(series, big_set)
1.5 µs ± 3.16 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Preprocessing data on the fly instead of precomputing it during the initialization of the pipeline. This is especially important when the number of training steps is large. Even very cheap operations like basic string operations can be very expensive when done repeatedly in __getitem__ loop instead of precomputing them once in the __init__ method.

Cache things when possible. Imagine there is a classic Pytorch dataset class, which is used for both training and validation. Training datasets often use augmentation and other randomized transformations that make their output dynamic, while the validation one is usually static. It is a good idea to cache the output of the validation dataset, so it doesn’t have to be recomputed every time.

def __getitem__(self, i: int) -> dict:
    if self.augmentor is not None:
        row = self.rows[i]
        return self.preprocessor(self.augmentor(row))
    else:
        if self._cache[i] is None:
            row = self.rows[i]
            self._cache[i] = self.preprocessor(row)
        return self._cache[i]

When the data doesn’t fit in memory, reading it naively from disk can be still suboptimal, and specific data storages can be used to improve the performance. For example, for one of the pipelines where we need many embedding reads by keys, we built a simple wrapper around lmdb named EmbeDB — it mimics and extends the API of dict and is optimized for reading by keys.

All these optimizations are easy to implement, but it’s easy to forget about them, especially when one is rushing to build a new feature or implement a cool state-of-the-art machine learning paper. We periodically use a profiler to detect bottlenecks in the training pipeline to avoid this. None of them is a game changer, but they can add up to a significant speedup — e.g. one of our pipelines was recently accelerated by 3x by applying a number of small optimizations like these.

Model-level optimizations

We already covered several aspects of training pipeline optimization but didn’t mention the very heart of a deep learning pipeline — the model itself. There are many ways to improve the performance of a model and reduce the number of parameters, but we will focus on two of them that are most relevant to Ntropy’s use case: reusing a shared backbone and embedding pruning.

We have many models serving various needs, and we often find ourselves in a situation where we need to build a new model that is similar to an existing one. In this case, it is a good idea to reuse the existing model’s backbone, which is the part of the model that is responsible for extracting features from the input data. This is a common practice in transfer learning, and it is especially useful when the input data is similar, and so is the problem we solve. For example, we have a model assigning labels for consumer transactions as well as a model estimating the type of account holder (consumer or business). Both models take the same input data, and the only difference is the output layer. In this case, it makes sense to reuse the backbone of the first model and only train the output layer of the second model. This is a very straightforward technique, but it can significantly reduce the number of parameters and training time (and also inference time, but that’s out of scope for today’s post).

# very simplified example
user = context.get_user()
models = [x for x in inference_models if user.has_access(x)]

if models:
    representation = backbone(inputs)  # shared backbone, most computations happen here
    outputs = {model.name: model(representation) for model in models}  # running tiny models on the same representations

Another common technique is embedding pruning. This is a technique that is used to reduce the number of parameters in the embedding layer of a pretrained model. As most practitioners, we use pretrained models from the HuggingFace library, which are usually trained on a large corpus of text. These architectures are designed for more generic use cases comparing to our needs, thus their input preprocessing is designed to support various texts, which leads to a large number of tokens in the tokenizer and parameters in embedding layers. We can reduce the number of parameters by retraining the tokenizer and pruning the embedding layer, removing tokens that are not used in our datasets. This is a very simple technique, but it can significantly reduce the number of parameters and latency of the model. E.g. the model used for assigning labels to consumer transactions now has only 70% of the parameters it had before embedding layer pruning, and the inference time is reduced by 20% with no effect on the quality of the model.

Finally, we recently started to use our big unlabeled datasets for self-supervised pretraining. No need to reinvent the wheel, we use the masked language model approach, which is a popular technique for pretraining language models. This is a very powerful technique, it requires a lot of compute resources and time to train first, but it can significantly improve the performance of the downstream task and its convergence speed. We are still experimenting with this technique, but we are already seeing some promising results.

E.g. usual pipeline of training a text classification model looks like this:

public pretrained model => fine-tune on custom labeled dataset

We moved to the following pipeline:

public pretrained model => self-supervised pretraining on unlabeled dataset => fine-tune on custom labeled dataset

While the newly added step takes a lot of time and compute resources, it’s not something we need to perform every time we retrain the model. At the same time, the convergence speed of a fine-tuning step is significantly improved, and the model performance is also improved, so once we’re going to run tens of experiments, it will pay off.

Conclusion

We have covered a lot of ground in this post, but hopefully, it was useful and interesting. We are still improving our pipelines in many aspects, and this post only covers a small part of our work. In the future, we will continue to share our experience and best practices (spoiler alert: a bag of tricks on machine learning inference at scale is planned), so stay tuned!

Facebook

Twitter

Copy link