While tech companies have embraced Machine Learning (ML) for critical tasks in the past decade, running ML pipelines in production is still a pain point. It’s a cross-team effort that often starts in notebooks and simple scripts, that are expected to run efficiently and robustly in production. The tale of productionizing ML models or pipelines is a long one; from models to code, data and dependencies, there are a lot of ways (and tools) to streamline the ML development lifecycle, but at the end it’s always necessary to deploy the models, provision the essential infrastructure and monitor the system.
At the core of Ntropy API there are multiple large ML models that process all incoming transactions. Some of our customers use the API in real time for time-critical processes such as approving bank transactions, and need to enrich millions of transactions per day. These use-cases bring both latency, throughput and reliability requirements that make it challenging to run our own production-scale ML system.
In this post we will delve into how we provisioned our ML system to optimize for low latency while maintaining reasonable costs. We encourage you to follow our future blog posts, as this is the first article of a series that will cover the details of our production system — infrastructure, deployment, monitoring, backend, data management and so on.
To better understand our system it is important to have some context on what we do. The core of Ntropy’s system is processing bank transactions. We receive raw transaction data from customers (descriptions, dates, amounts, etc.) and run the transactions through a complex pipeline of large transformer models, database searches and various other operations. The system outputs structured information about each transaction such as the merchant name and transaction category. Although our system has more functionalities, the bulk of compute is spent in processing transactions, and other features built on top of the extracted data.
Our service is provided as a REST API served by a stateless backend server. All the transaction processing happens on a separate worker pool. In the diagram below you can get an idea of the components that a worker must execute to process a single transaction:
Each component requires one or more models to run, and all of these must run in sequence since they operate on the outputs of the previous model. The whole pipeline should take at most 200ms per transaction. The M1..Mn models are a variable subset of models that may run for each transaction depending on needed additional predictions such as the MCC code of the transaction or calculating insights at the account holder level (i.e., if a transaction is recurrent according to previously submitted transactions for that account).
While the focus of this article is the ML infrastructure component of our production system, as you can see above, the ML portion is closely linked to our server and worker system. As such, in some sections of this article we will provide some details about the rest of the system to contextualize our decisions and thought process.
One of the core principles we follow is to avoid over-engineering and premature optimization at all cost. A good start to avoid over-engineering is clearly defining both the requirements you are going to address with a particular system, and the ones you are not trying to address. Internally, we do this process through the use of design documents that require a clear specification of the requirements and non-requirements for any particular system, which must then be reflected in the solution.
For our ML system, besides the standard requirements for this type of systems, we’ve focused on the following:
- Latency — as real use-cases start to be applied to our system it became clear that many of customers processing transactions are doing so in a time-critical manner, for example, during transaction authorization processes. The system should take up to 20ms per model execution. This guarantees that, for the average case for each transaction, we still have approximately half of the 200ms to execute database operations and any other processing.
- Scalability — The system must have the capability to adapt to varying levels of usage, including both continuous throughput during the day as well as significant spikes in demand for certain use-cases (i.e., processing large amounts of historical data). It should be able to scale up and down accordingly.
- Pricing — in order to provide competitive pricing to our customers, given the significant amount of compute needed per transaction, we must be as efficient as possible with our ML infrastructure without compromising on any of the other requirements
- Reliability — Our models are critical for our API, which is used in key services of companies around the globe. Hence, high availability is key to providing a reliable service that can be part of critical decision processes. The system should target 99.95% uptime.
In the actual journey to build our system, we went through several iterations, always trying to match our needs with the requirements that were understood at the time. What we present in this article is the current iteration (surely not the last) of the system.
As the backbone of all our deployments we use kubernetes (k8s), particularly a managed deployment using AWS EKS. EKS simplifies the cluster deployment and upgrades, as well as provisioning new nodes, while retaining control over the system. We use k8s since it fulfills all our requirements, and most importantly we have plenty of in-house experience with it.
Our backend server is implemented in Python, with some performance-critical code written in Rust, and most of our code is asynchronous where possible. The worker pool is also Python based, implemented with celery, and shares a lot of the code with the backend. We use Redis as both the broker and result storage for celery.
Each model we run in production is the output of a training pipeline. It is serialized to the ONNX format and made available in our main repository (we use git lfs for keeping artifacts in our repository, allowing them to be easily synchronized with code). This is particularly useful since we build our container images to deploy to production on our CI/CD pipelines, allowing us to containerize the models and generate reproducible images with all available code and models.
For the ML infrastructure of our production models, we chose to utilize a pre-existing service, but deploy and manage it ourselves. Particularly, we use a model inference server in nodes with GPUs. By separating model inference from the rest of the computation we can optimize the resource usage, not wasting CPU during inference and ensuring the GPUs we are paying for are actively used for what they’re optimized for.
All our models are placed in a version-controlled model repository that is replaced with the newer versions every time the models are updated. The model repository is embedded in the container image of the inference server which is swapped on every deployment, ensuring all the components are kept in sync. The inference service is then exposed internally within the cluster for any component that requires model predictions.
You can take a look at the simplified view of our transaction processing system below:
There are many available tools for model serving, with different approaches to what they are most concerned about. Some of them are focused on ease of use and standardization (BentoML), ease of deployment (KServe, Cortex), deep integration with a particular framework (TorchServe, Tensorflow Serving) and so on.
As you can see on the image, we opted for using Triton Inference Server from Nvidia. The main reasons were:
- Out of all the inference servers mentioned, it was the fastest one for our use case, when properly configured
- It supports model ensembles (pipelines of models) which are more efficient than running two models in a row from code
- It has many smaller functionalities that are useful for us such as dynamic batching, simultaneous execution and inference priority
- The project is developed and maintained by Nvidia, providing assurance for its long-term maintenance and compatibility with hardware
Choosing a higher ownership solution as we did to to deploy your models is cheaper and provides more control but doesn’t come with batteries included.
To scale the Triton deployment we use the k8s horizontal auto-scaler with a combination of inference server queue sizes and average GPU utilization as metrics. In simple terms, scaling horizontally means increasing the throughput of the system by increasing the number of instances, whereas scaling vertically means increasing the capability of existing instances to increase throughput. The horizontal auto-scaler continuously monitors these metrics and requests a new Triton instance when the metrics go above certain defined thresholds.
When you deploy a service that scales automatically you need a method of load balancing to ensure that the workloads are distributed equally between all nodes that are online at any given time. Since by default we use the GRPC API of Triton, the standard load balancer provided by k8s doesn’t work correctly. It works at the connection level, but GRPC uses long-lasting connections, meaning that existing workers will not connect to any Triton nodes that are dynamically scheduled.
We introduced an additional service to load balance properly: we use istio (there are other methods that work). We add an istio side-cart to each triton pod and then define an istio service to route the requests between available triton pods. This load-balancing works at the application layer, meaning that workers connect to the istio service instead of the triton pods directly. In turn, istio routes any requests to a triton pod in the pool, while being aware of any new node that is added, ensuring that we can seamlessly scale Triton up and down in realtime.
To guarantee high availability of our Triton deployment we always run more than one Triton instance in on-demand compute nodes. We use a simple script to regularly check the health of each triton instance through k8s health probes, which automatically attempt to resolve the issue if the service is deemed unhealthy. This setup is pretty standard for any kubernetes service and we thought it would be enough. However, we had two instances of small downtime when particular availability zones in AWS had networking issues or went down which taught us a valuable lesson to always spread our instances over different availability zones if we’re chasing multiple 9s of uptime. We also added the capability to our workers to fallback to running the models locally if Triton is out of reach in order to ensure that the system slows down but doesn’t stop completely in case our ML infrastructure goes down temporarily.
Putting it all together, we get this zoomed-in diagram of the structure of our Triton deployment:
The correct choice for your company
As you’ve seen, Ntropy runs entirely on cloud infrastructure. There are many ways to accomplish the same task using even the same cloud provider. For example, In AWS if you want to serve your backend service, your options range from from fully serverless function engines (Lambda), to serverless compute engines (Fargate), managed cluster services (EKS) or self-managed compute virtual / bare metal nodes (EC2) and everything in between.
The right decision for the cloud infrastructure varies from company to company, and what works for us might not work for you. A good starting point to making any choice in cloud infrastructure to consider it as a two-dimensional plane of ownership vs cost. The less ownership you have the more expensive the solutions tend to be (managed is generally more expensive than self hosted). Having less ownership has both pros and cons. Running managed services requires less in-house expertise and less resources invested to run and maintain a service, but it limits configurability and access to low level features which might be key. Eventually, you will need to consider other factors that matter when you have a specific use case in mind (i.e., latency of each solution).
The same rule of thumb applies to ML infrastructure. Just on AWS, you have services that provide fully managed models (i.e. Forecast, Rekognition), to managed platforms to deploy your own models (i.e. Sagemaker). Or, on the other end of the spectrum you can roll your own solution deploying the models in CPU or GPU instances provisioned by EC2 as we do. Within this framework you should strongly consider two factors before making any choice:
- An extensive look at your requirements
- Your in-house expertise and experience with existing tools
You can then optimize your choice to fit with your own needs. Do you have a small team to manage the ML systems and your use-cases are standard? You should probably go for a fully managed solution. However, do you have an experienced devops team and require full control of your ML system? Then you might want to follow an approach closer to what we described here.
Moving ML models from development notebooks to production isn’t trivial. However, a lot of tooling has appeared in the last years to assist in it, ranging from managed solutions to self-hosted ones. What works best for each company is an exercise that requires understanding of both the use-case at hand and internal limitations. We suggest you consider the resources you have available when choosing between ownership and cost, and to not design systems for requirements that don’t exist yet.
For our system we went with a Triton inference server and a version controlled model repository, deployed with k8s on AWS GPU instances. We scaled, monitored and made the system reliable using standard k8s tools, and used istio for application level-load balancing.
We hope we could shed some light on one of the many ways you can serve your models in production, as well as the thought process behind making some of these choices. Stay tuned for more posts on our production system in the near future. If we’ve peaked your interest or you’re interested in using our API you can visit our website at ntropy.com.