19 Apr 2022

Customized Transaction Classification

Placeholder image


Ntropy Team


For a notebook version, please check out our Colab tutorial:

For an example of the absolute minimum code needed to create a model, please see our quickstart tutorial:

This post was authored by Ntropy’s Head of Product.

Figure 1: Overview of the Ntropy customization benefits.

In this post, we will walk through how to build custom models using Ntropy. In doing so, we will also present our own findings on the effectiveness of customization, measured across a variety of benchmarks and conditions. Finally, we will end with a real-world case study, in which one of our customers was able to achieve a +12% 🚀 increase in accuracy over using our core models alone!

Let’s square a few things away. Ntropy customization is NOT a direct mapping from the labels in our core models, but rather a separate model build on top of our core model that can be customized for individual sets of labels and transactions. In a future post we will explain why a label mapping is insufficient, but for now we hope the numbers will convince you of this fact. Customized models are also not trained from absolute scratch, but instead adapted from our core models; this combines the generality and robustness of Ntropy categorization with the accuracy of user specification.

Here’s what Ntropy customization endpoint is:

  • Fast and accurate.
  • Open-ended. You can create models for classic categories like debt and revenue, or you can try more niche categories like freelancer gig revenue.
  • Self-maintainable. You create your model once, and we’ll handle all performance updates due to core model changes on our end. When the core models improve, your custom models will automatically improve.
  • Lightweight. Once a model is trained, you can easily access it using its model ID. This allows you to train as many models as you like, with little to no overhead. Furthermore, models are compact and stored server-side.
  • Accessible. You do not need a background in Machine Learning or a dedicated MLOps team to understand and deploy models. You just need to feed it good transaction data, and we’ll handle the rest.

The other details and nuances we will handle during the tutorial. Let’s dive in.

Ntropy Customization Step-by-Step Guide

For a notebook version, please check out our Colab:

Broadly, the process for creating models will look something like this:

  1. Data Exploration [Optional]
  2. Creating your hierarchy/set of categories [Optional]
  3. External, internal, or Ntropy labeling [Optional]
  4. Train, test, and deploy your model. If necessary, return to step 2

If you already know the set of categories you want, and what transactions for those categories look like, you can skip the first two steps. Step 3 is obviously required if you have no labeled data. One option to gather data is to use the Ntropy labeling team, and this is handled on a per-request basis for customers. We have a vetted internal team of financial experts that and can provide this service free of charge, as part of the deployment process

Before starting, we’ve made public a small, synthetic dataset to use for testing. We will use this throughout the tutorial, and it can be downloaded from S3 at

1. Data Exploration [Optional]

Quite frequently, users come to us and don’t yet have a clear picture of what they want, other than wanting a deeper understanding of transactions. In these cases, the default is to use the outputs of our core classification model.

Instead, we’ll go one step further, and use the core model as an exploratory tool. We can use the outputs of the core model to get a rough idea of the distribution of labels in our dataset, and from there, narrow down where to look. For this part, we can use the existing Ntropy API to “enrich” a list of transactions (see the Colab for code). Enrichment is our term for a transaction that has passed through all of our services, which includes merchant-extraction and categorization. Upon enrichment, your data will look something like this:

Table 1. A snapshot of the result of Ntropy enrichment.

The labels field presents the ground truth categories for each transaction. Since we are still in the exploratory phase, just pretend that that column doesn’t exist yet.

The model_predictions field is the output of our core transaction classifier. This is the main field we will use for exploration, however, there are four other outputs from Ntropy enrichment (merchant, website, person, and location) that will be important as well.

The first thing we will do is get a feel for the data distribution. Let’s plot the top 10 most common categories.

Fig. 2. The top most frequent labels assigned by the Ntropy core model to our sample dataset.

We see that the for the most part, these fall into about three categories: food and drink, software, and inventory & vendor payments. Broadly, the two most common topics in the data are software and food and drink.

Next, let’s take look at the tail end of the distribution.

Table 2. Some examples of low-frequency categories.

In this dataset, it’s assumed that we will see a bunch of software and food and drink transactions, for whatever reason (maybe the transaction belong to startups or restaurant owners). The next most common categories were vendor payments and inventory which are both semantically related in the Ntropy hierarchy.

Looking through the vendor payment and inventory transactions, we start to notice a clear trend. These transactions all seem to be food distributors! Likewise, there appears to be a clear pattern amongst them. The amounts are all debits between about $5000–50,000. This is a perfect candidate for a new category!

At this point, we can solidify our 4 categories — food distributors, food and drink, software, and cybersecurity — and proceed to dataset construction. However, it’s worth pointing out one more thing. We can continue to look at the tail end of the distribution, and hunt for edge cases. For example, consider the following transaction:

Description: RYPE, INC. Entry type: outgoing Amount: 79.99 USD.

RYPE, INC. is a language learning software product, which seems to fall equally under both education and software. For any given schema, ambiguous transactions like these will always appear. Finding borderline cases like these is critical to tuning our model correctly. When we construct our schema, we have to decide if the software label, should or should not include these types of transactions.

In this case, we will make the interpretation that, yes, this qualifies as software. However, you could imagine another scenario where we may be interested in only software that aids on the developer side, but not software that aids on the business side. The choice is yours, and that’s beauty of Custom models!

2. Creating your datasets [Optional]

It says optional, but the reality is that you will likely need to alter (or at the very least inspect) your current datasets if you want to get great performance.

What makes a good dataset? Let’s continue building the one in the previous section and we will see how the thought process works.

1. After exploration, create a temporary label mapping from Ntropy’s core model to the categories that you are interested in. In this case, for food distributors, we would look at transactions that were labeled as inventory, vendor payment, or food and drink labels. Note, that the Ntropy label food and drink can map to more than one thing at this stage (food distributors plus food and drink).

2. Find clean, unambiguous, informative transactions first. Such examples give the model full signal to learn from. A good transaction will have a

  • (a) recognizable merchant correctly identified by Ntropy enrichment (see the merchant column)
  • (b) correct entry types and amounts. This is absolutely critical, and needs to be double checked when aggregating transaction sources.
  • (c) optionally a clear pattern such as (FOREIGN TX FEE) if (a) does not hold.

3. Find as many edge cases as you can. The previous example of Rype, Inc. is an excellent demonstration. We need to feed the model as many borderline cases as we can, in order to tilt it into the direction that we want.

4. Make a test set! You can skip this step if you like, but it’s the same for the training set step. This is a separate set of data that’s used to gauge how good your model is. Just make sure there is no overlap between train and test sets.

And that’s it. You should be able to make your categories. In our running example, this means we would have chosen food distributors, food and drink, software, and cybersecurity. Finally, you can repeat the same procedure for constructing a test set, to gauge your performance.

🚨 Warning: the Other and Not Enough Information categories

Customization is incredibly flexible, but there are two special cases of categories that you need to be aware of (beyond categories that are obviously poor, like special transactions). These are the Other and Not Enough Information categories.

The first can be considered an actual category. When labeling a transaction as Other, the assumption is that there IS enough information for the model to label the transaction, but that it is not one of the things we care about. These categories can be challenging as they will need more data for the model to learn. In future versions we will expose models that are optimized for this situation, however, as of now to keep things simple, let’s just supply examples of the Other category.

The second category, Not Enough Information, is a default category that triggers when… there is not enough information in the transaction. Currently, this is not yet supported by the API, but will be available in future versions.

5-8. Training, Testing, and Deploying

The training step is pretty easy, and is best shown in the quickstart Colab and the docs. For demonstration purposes, we show just how little code is required.

Code 1. The minimal code needed to train and evaluate a custom Ntropy model

Benchmarking, Case Studies, and Looking Under the Hood

Until now we’ve only described how to use the API. In this section, we will take the time discuss how the models work behind the scenes, as well as demonstrate performance across a number of tests and benchmarks that we’ve run. As a result, we hope that you will gain a greater understanding of the expected performance you can achieve, under what circumstances, and what steps you can do to make it better.

Machine Learning Black magic

There’s a famous quote by the physicist Isaac Newton “If I have seen further it is by standing on the shoulders of giants”, which is taken to mean that no progress occurs in isolation, it builds on that which came before it. Our customization model is no different. When you train a customized model, you are not training anything from scratch, but instead training on top of our core model. This means there are several things you don’t need to worry about

  • Custom models have access to the same databases and APIs as our core models do.
  • Custom models inherit general transaction knowledge from our core models.
  • Preprocessing and handling of input is all taken care of. Additional metadata can also be included to further customize.
  • Since the Custom models work off of our core model, they can succeed even in the low-data (~16 samples per class and even less) regime.
  • Any updates to our core model will automatically propagate to custom models. This means that over time your model will improve, even if you do nothing!

In terms of how make this happen, we are bringing to production advances in the Machine Learning subfield of Meta-Learning, which can cutely be described as “Learning to learn”. There are essentially two parts to this. The first part is a model architecture choice, where we teach our models to adapt quickly. This part is handled on our end, and for all intents and purposes can be thought of as black magic. The second part involves massive multi-task learning, whereby we train our models on a wide variety of tasks, each of which can contribute some amount information to each other. It’s a realization of something we’ve worked towards at Ntropy since founding; a Data Network. More importantly for the user, the takeaway is this:

Each custom model increases our core model’s understanding of transactions, which in turn raises the performance of all custom models.

Given that we understand a bit more about how the models work, let’s take a look at some actual experiments that quantify the previous claims.

Experiments, Benchmarking, and Performance

Before diving into the figures, we need to first decide what, exactly, we actually want to test. In previous blog posts (and more so in a future post), we spent a great deal of time explaining why transaction classification is difficult, when it can succeed, and when it fails. However, there is one crucial piece that we did not discuss, that renders those discussion partially academic.

How much data does a classifier need to learn from?

That’s the million dollar question. As it turns out, understanding this problem is closely related to understanding which categories are good, which are ambiguous, and which are devoid of information. The next point is so important, I’m going to section off and toss in emojis.

💡💡💡 There are generically two types of categories that exist, and it is absolutely critical to understand which type of category you have when thinking about performance.

The first are lower-level categories, that are well-defined simple things like gas station purchases and bowling alleys.

The second are higher-level categories, that carry with them human bias, and may require logic to understand. An example would be calling dinner with a client customer acquisition costs, as opposed to restaurant spend.

It shouldn’t be too hard to guess that lower-level categories learn quicker than higher-level ones. Indeed, that’s exactly what our results show. Before discussing our experiments, there’s one final point to clarify. When thinking about performance, we also need to think about groups of labels. Let’s say you have 4 well-defined labels like revenue, operating costs, loans, and rent. If you artificially condense this into two categories (revenue + operating costs) and (loans + rent), you’ve suddenly made the model much more difficult train, as you now need to associate two very different things (revenues and operating costs), as belonging to the same category. We will discuss how learning performs below.


We test our Customization endpoint across 6 different tasks, each testing a different type of problem to learn. In Table 1, we present our results across each of the 6 datasets, and display the mean accuracy averaged across 20 random seeds, the total number of training samples, and the number of classes (categories) in the dataset.

Table 3. Performance of Ntropy customization across 6 different benchmarks. See text below for details of each dataset.

Keyword dataset: this consists of 59 transactions that each contain one of three keywords, Direct deposit, Online, or Recurring, somewhere within the transaction. We chose the keywords to contain potentially relevant information to categorization, but also to be sufficiently nondescript as to require learning. Results show that this performs most poorly, for several reasons. First, the pattern is not semantic, but rather syntactical, whereas we have optimized our models to perform best on semantic tasks. Second, learning this pattern requires rewiring the attention patterns, which with only 59 examples, is not sufficient data.

  • Key takeaway: when making a category that spans multiple Ntropy categories (i.e. Online could appear in many transactions with different labels), make sure to either have a unique pattern or distinguishing name. Otherwise, you’ll need a lot of data.

Person dataset: this consists of 764 transactions across all categories, where the goal is to detect whether or not a person exists in the transaction. Note that this model is trained independently of our named-entity extraction model, and so there is no leakage.

  • Key takeaway: customization can perform surprisingly well (82.4% accuracy) on tasks totally unrelated to what it was trained on. Feel free to try out other adventurous binary categorization tasks!

Similar dataset: this was constructed by gathering ~5000 transactions whose labels are covered by our hierarchy, and then assigning them to one of 6 categories: debt, revenue, operating expenses, financial services, tax, and other. We are testing how well our customized models can condense the 188 labels in our general hierarchy, into a digestible 6 categories. We will have more to say when we look at the training curves. For now, we just note that this (and the following dataset) are “higher-level” tasks.

  • Key takeaway: the best results are obtained when the customized set of labels is a subset of the Ntropy labels.

Dissimilar dataset: this is very similar to the similar dataset, but now instead of creating condensed, consistent categories, we group together disparate categories into unnatural chunks. The chunks are (1) not_enough_information, (2) customer acquisition cost + government + insurance + revenues and inflows, (3) fees + employee spend + facilities +investment + gifts, (4) cost of goods sold + financial services + personnel,(5) tools + professional services + insurance + intellectual property + infrastructure. For this type of dataset, a clustering approach (where each new label is assigned to the cluster whose representative is most similar) would fail, whereas our approach succeeds, albeit with a slightly lower score than the similar dataset.

  • Key takeaway: you can group together disparate labels into the same group, but you will need to feed more data to the model.

Novel dataset: part of this dataset is publicly available online, and is what we used in the first half of this post. Here we created new categories outside of our hierarchy, and tested how quickly the model could adapt to something it’s never seen before. This falls into the the “lower-level” learning category, and we confirm our intuition.

  • Key takeaway: for well-defined, clean examples of a never-before-seen class, our customization models can rapidly adapt, and reach high accuracies.

Case Study: here we present the results of a real-world case study with one of our beta users. The dataset consists of 21 classes spread out of 12,558 data points. Of particular note, this dataset is extremely noisy (40% of entries are labeled incorrectly in the training set).

  • Key takeaway: noise can severely hinder performance, but even for very high levels of noise (40% error rate), we can achieve reasonably good accuracies (~75%) with enough data. Exploring noise robustness is on the roadmap for future releases!

let’s now take a look at training dynamics, and in particular we will focus on 3 of those datasets.

Ok cool 👍, but how much data do I really need 🧐?

Of course, more data is (usually) better. But time is finite, and we all want to know; how much do we really need? The answer depends critically on whether you are learning lower or higher level categories. To investigate, let’s look at performance as function of two variables: amount of data and amount of noise.

In Figure 2, we show the results of customization training for one lower-level dataset (Novel) and two higher level ones (Similar and Dissimilar).

Figure 2. Performance of our customized models as a function of data size, for varying levels of noise. Experiments were run with 20 random seeds. Lines represent the mean values, and shading the standard deviation

Immediately, we notice a few things:

  • The lower-level dataset can learn rapidly. With just 6 samples per class we can already reach 90% accuracy!
  • The higher-level dataset Similar, which condenses 188 categories into 6, can reach 90% accuracy within 34 samples per class.
  • The higher-level dataset Dissimilar, which condenses 188 categories into 5 disparate buckets, reaches 90% accuracy 185 samples per class, which is about 6 times greater than the Similar dataset. This confirms our hypothesis, that learning more complex patterns (grouping dissimilar elements) requires more data. What’s interesting, is that after enough data, the overall performances are only within a couple of points.

Another, maybe less interesting point, is that performance expectedly decreases with respect to noise levels. 10% noise is relatively tolerable, but beyond that things start to become quite difficult to learn.

As a side note, we remark that noise was introduced by first sampling latent confusion matrices for each class, and then randomly flipping classes. This way, the model can not as easily detect noise, since the error rates have correlated patterns. This models actual worker errors much better than white noise.

Ok, so how is that our models are able to learn so quickly, and so well? We’ve spoken a lot about how the customization models are built on top of our core model, but we can actually show what this looks like. Below are TSNE produced 2-dimensional embeddings of the transactions in our training data before beginning customization.

Figure 3. Initial TSNE embeddings before customization. Colors correspond to different classes. The classes are already quite separated in this space.

Different colors correspond to the different ground truth classes. What we see, is that our core categorization models are already exceedingly good at understanding transactions, and bucketing them. This makes the classifiers job much easier, and also explains why border line cases are so important. Our models know enough about what’s going on, to know that they should be confused. For a fun visual aid, we’ve also attached a GIF showing how these hidden states adapt over the course of training, ultimately becoming more clustered (note this is just for fun, and is not created using TSNE like above)

Figure 4. Evolution of a 2-dimensional bottleneck state over continuous training. Every 50 epochs we add more data to the training set, which is what results in the jitters.

Real world case study, putting it all together

Finally, let’s discuss a real-world case study. The biggest issue when deploying in the wild, is that we can no longer guarantee quality controls on the data. The main issue that we need to contend with, is noise in the data. For a practitioner, the natural question to ask is

At what point does Customization make more sense than just making a direct label mapping from Ntropy labels?

Below, we present our results from the Case Study dataset, which consists of 12,558 transactions, 21 categories, and a noise rate of 40% of training data points mislabeled.

Figure 5. Real world case study showing customization in action. Training runs are averaged over 20 random seeds for data sizes below 100 samples per class, and 2 seeds above.

For reference, we’ve provided what we call an optimized label mapping, whereby we use the training data to construct a mapping from Ntropy labels to user labels, that will maximize accuracy. Note, this sometimes produces some really questionable mappings, like payroll mapping to investment equity.

The main thing to note is the crossover point between the dashed line and the black curve. At just 6 samples per class, even with 40% noise, it becomes more advantageous to use the customization endpoint over a direct label mapping

With the full training set, we find +12% 🚀 increase in accuracy overall, which is quite a significant improvement, with what is otherwise unusable data.

Parting Thoughts and Future Work

If you’ve made it this far, the first thing you should do is head over to and check out the job board, because we’re hiring and would love to have you on the team!

Besides that, if you haven’t already, you should check out our Colab, and actually try this out for yourself.

Though we’re stoked to launch our customization endpoint, the work is not done. We’ve touched on a few of these things, but these are several areas we are actively working on for future releases, in order of priority

  • Support for Other and Not enough information categories
  • Large dataset training > 50,000 samples
  • Noise robust training
  • Borderline case detection
  • Customization for merchant and website identification
  • Customization for named entity recognition

Thanks for reading, we can’t wait to see what you will build 😄!

P.S.: the number of Kanye reference in the publicly available data is 2, try to find them! 😝

Join hundreds of companies taking control of their transactions

Ntropy is the most accurate financial data standardization and enrichment API. Any data source, any geography.