24 May 2022

The False Promise of General Transaction Categorization and the inadequacy of in-house models

Placeholder image


Ntropy Team



Yes, for a company that provides transaction categorization, we know how controversial our title sounds. We’re not saying you can’t build a really good general transaction classifier (we can and we have!), but the truth is, no matter what you do, you can’t build a solution that will satisfy all people at all times.

“Fine,” you might say, “then we’ll just build a transaction classifier in-house.” The appeal of owning a pipeline end-to-end, with the ability to tweak and tune as you see fit is enticing. Unfortunately, in-house models are inadequate and brittle. Transaction categorization is inherently a long tail problem. If I were to hand you 1 million transactions, you might find that 500,000 of those belong to the same 300 companies, but the other 500,000 belong to 100,000 different companies. This means you have two options:

  1. Build something quick and easy that hits those top 300 companies. You’ll have an impressive model that quickly understands half of your transaction data, but understanding 75% of your data could take years to achieve.
  2. Identify transaction patterns and write in rules to predict on the long tail. What might seem like a strong internal model can fall apart precipitously as data distributions change. And, the reality is that we live in an interconnected economy that aggregates financial data across disparate sources. Insights are found when looking at transactions across different times, geographies, accounts, aggregators, and industries.

Ironically, to build the best specialized model, you also need to build the best general model. Transaction classification requires access to expansive databases and knowledge graphs (continuously updated as businesses constantly come and go), as well as exposure to a diverse array of transaction formats and syntaxes. For any practitioner, this is a daunting task, leaving most with no choice but to opt for external general transaction categorization models, no matter how imperfect.

At Ntropy, we refuse to accept the status quo, and we’ve found a better way to understand transactions by discarding the premise that offering general transaction categorization is the only option. We’ve built and released Ntropy Custom Models, which allows users to build customized categorization models, combining both the power and expansiveness of the Ntropy General Models, with the finesse and precision of specialized in-house models. We encourage you to read more about how it works, get the main talking points, and see some code.

This post isn’t about customization though.

It’s about convincing you that general transaction categorization is the wrong way of understanding your financial data.

It’s about understanding the story that transactions tell.

It’s about understanding what it means to categorize a transaction.

By the end of this post, we hope you’ll understand why transaction categorization is hard, what information exists in a group of transactions, and how we can build smarter insights 💡 from transaction data.

The three sources of problems

At a high level, there are really three things that degrade performance of a transaction classifier.

  1. Schema-related problems. No general set of categories will be a perfect match for your particular use case. Translating between the two will always result in loss of information.
  2. Transaction-related problems. This includes everything related to building a good model, including data, databases, and machine learning tools.
  3. User-related problems. This has to do with how we define our categorization problem. Even with a perfect classifier, we can still define our categorization problem in way that makes it unsolvable.

Schema-related problems

The typical user approach is to map the output of our general transaction classifier to their internal set of categories. This process is both inefficient and error prone.

  • Your categories are more specific than ours. We can’t map just restaurants into say Mexican or Italian restaurants.
  • Your categories are more general than ours. Our label of “software consulting” would not neatly fit into either “consulting” or “software” alone.
  • Your categories are poorly defined. For example, “special transactions”.
  • Our interpretation and your interpretation of a category can differ (do government benefits include payments into social security or only pulling funds from it?)

Label mappings (mapping every category in our hierarchy to one in yours), can also fail when they depend on more than just the label. An important example is the “loans” category. If the user has both “loan payments” and “loan disbursements” as categories, then it is impossible to map from our label to theirs. To do so, they would need to add additional logic to check the transaction entry type, and assign disbursement for incoming and payment for outgoing transactions.

From a practical standpoint, label mappings carry infrastructure burdens as well. How should code adapt to changes in the Ntropy set of categories? In general, we never change categories once they are public, but we will add categories. In some cases this could break workflows, such as when the new category is not the child of an existing category in the user’s mapping.

Transaction-level problems

What are financial transactions?

They’re small records describing the exchange of money between two parties. A transaction can contain myriad types of data, but there are three principal fields that it absolutely must contain: description, amount and entry type. Here’s an example of a (fake) transaction:


Amount: $5000 USD

Entry type: Outgoing

which can be summarized as

“Door LLC sent $5000 to Wood LTD for building materials”.

Broadly, the goal of transaction categorization is to convert the above snippet into the below sentence. In doing so, there are three distinct steps that take place. In the first step, we need to extract all of the relevant information, including the sender and receiver of money, amount, entry type, persons names, dates, locations, and any natural language descriptions. In the second step, we have to use some kind of external database to figure out who and what the named entities we found are (e.g., we need to confirm with Google that Door LLC does in fact sell doors). Then, in the third and final step, we need to take all of that information to piece together what makes most sense logically and assign a category (in this case payment for building materials).

Why is transaction categorization tough?

Hopefully the above process was relatively straightforward. Unfortunately, nearly every single one of those steps can quickly become unsolvable. Here are three quick examples:

  1. Names can be obfuscated, corrupted, or truncated to the point that we can’t recover the original merchant name. Ex — British Airways vs British Air vs British.
  2. There are over 200 million companies in the world, and not enough unique names for all of them. Acronyms are a great example. For example, consider a debit transaction for “ACH PYMT TO NRA 4905588409”. You might be quick to judge that NRA stands for the National Rifle Association, but for a restaurant, it’s probably more likely that NRA is the National Restaurant Association. Without additional information, we cannot make any assumptions about the transaction, no matter how obvious they may seem.
  3. Many organizations provide more than one good or service. We assumed Wood LTD sells wood. But, what if they also sold woodworking products? Suddenly we don’t know if the purchase was for materials or for manufacturing equipment!

Fortunately, we can solve or get around most of these problems. (a) name corruptions have patterns that our models can learn (usually deletions that preserve the spoken sound like amazon -> amzn). (b) given an account holder industry, amount, location, and entry type, some results are much more likely. We can also filter based on the popularity of an organization. (c) given the amount and transaction history, we can nudge predictions towards materials instead of equipment. However, it’s still not totally obvious which to choose.

All of these solutions depend on one absolutely critical component that will differ for every single user: context. Even in solution (a), we are making the assumption that our transactions are in English, which can fail spectacularly if not true. Suppose we have a transaction “PIX PMT”. If we knew the transaction were in Portuguese, we could reasonably infer that PIX is the Brazilian payment processor, but if the transaction were in English, it might be more probable that pix is a shorthand for the word “pictures”. In solution (b), Spectrum is a major cable company in the U.S., but in India it’s a major clothing company. And in (c), we are making the (highly probable) assumption that a door company sending money to a wood company is for supplies to build doors. But this is still an assumption, and all categorization models need to decide what qualifies as a reasonable assumption. There’s always the possibility that this transaction could be for something else, like wood to make facilities improvements.

User-level ambiguities

The problem statement and the problematic statement

We still haven’t defined what it means to categorize a transaction. Let’s do that now. The problem is as follows:

Given a transaction consisting of a snippet of text as well as further pieces of metadata such as amount, entry type, date, account_holder, etc., label this transaction as one of several human-understandable categories.

We’ll handle these case by case.

Metadata (transactions usually have incomplete information):

More data equals better results. However, not all input fields are equally important. In order of most to least importance, the ranking looks something like:

Description => Entry Type => Amount => Account Holder Type (business, consumer,…) => ISO Currency Code and Country => Account Holder => Date.

Accuracy is proportional to how many of these fields you correctly supply, and, like a Jenga puzzle, as soon as you start removing items you risk collapsing the whole thing. We won’t discuss how to perform cleaning, deduplication, and error removal from input data; that deserves its own post. Here, we will focus on one sneaky field: account holder. Why is this field so sneaky? Well, it’s because it provides the much needed context, which we discussed in the previous sections. Without the account holder ID, we can only work at the single transaction level. This obviously restricts us from properly finding things that need an account history like recurrence or fraud.

Account holder is the most natural way to mark a particular set of transactions as different from another. It also suggests that the most accurate solutions should be customized at either the user or account holder level. Instead of supplying repetitive, but useful, metadata such as the account holder industry, it can be more effective and efficient to build custom classifiers per user/account holder.

One of (transactions don’t always fit neatly into one category)

Everybody wants one, and only one, category assigned to every transaction. It makes sense why; such a system is easy to understand, fits nicely in data pipelines, and allows one to unambiguously group transactions. The only problem is, it’s an impossible task.

In our first iteration of the Ntropy categorization API, this problem was our central concern. Our solution was to provide multi-label classification. Due to composability, this greatly increased the expressibility of our model. Instead of say C fixed classes, there now existed 2^C, possible classes, one for each combination of output labels (to put that in perspective, 100 classes would mean 1⁰³⁰ possible outputs.) In practice, we truncated to 4 labels per transaction though, so the number of classes was more like C⁴ which is still a whopping 1 million. To see this in action, consider a Netflix subscription. This would be composed of three labels: television, subscription, entertainment.

There was a glaring issue with that approach though. 1 million labels is too many. Ain’t nobody got time for that. The point of categorization is to make things simpler, not more complex.

In our second iteration of the Ntropy categorization API, we had no choice but to adopt the single-label classifier. Sure, no such classifier can ever reach 100% accuracy, but it doesn’t mean we can’t get close. There are two ways to hack 100% accuracy though, and they give us insight into the different tradeoffs involved.

  1. Everything is the same class. Equivalent to saying “All transaction are transactions.” 100% accuracy, but zero specificity and zero information.
  2. Every transaction is its own class. Equivalent to just checking for unique entries. Infinite specificity but also zero information.

It may seem surprising that such opposite limits can yield “100% accuracy”, but it tells us a lot about what we hope to achieve. A good set of categories should have just enough categories to be able to fit in a human’s memory (7 ± 2 seems to be the natural limit), while also containing meaning. Perhaps Operating Expenses is too general and Electric bills too specific, but Facilities Expenses might be just right. With enough thought and effort, it’s possible to build a reasonably self-consistent, broad, and usable set of categories. However, once doing so, eventually you’ll remember why the single-class method was iteration two of the Ntropy API

In a single-class categorization system, class overlaps are inevitable. Consider the two categories of wages and taxes. They seem rather distinct, until you come across a transaction like “GUSTO PAYRLL TAX 693557 5bmsdkgmmo MONEYMAN LLC.” (easter egg, grab some good headphones and lookup that company ;)). This is payroll tax. Does that count as wages or as taxes? Is wages supposed to include the overall cost of an employee (which would include their taxes as well), or simply the after tax compensation?

Thankfully, in this case, and also in the majority of other cases as well, this problem is solvable.

To fix the category overlap problem, all we need to do is tell our models how to make the choice! This is where customization shines. Machine learning is incredibly adept at finding patterns in data, and so long as enough such examples of classifying payroll tax are supplied, our algorithms can pick up on the pattern. The question of “enough” can be tough (and will be addressed more quantitatively later), but intuitively, it helps to think about how many samples of transactions a human would need to understand what’s going on.

Human-understandable categories (everybody disagrees on how to interpret a label)

When we try to explain this concept, we usually start with an anecdote. Over the many customers we’ve worked with, there is one category that we’ve seen consistently pop up: Special Transactions.

The principal difficulty with this transaction is that for you, for us, for anyone that didn’t make that category, we have no idea what it means. One person’s trash is another person’s treasure, and one person’s “special transaction” is another person’s “yeah who cares about that”. There is no hope that a general transaction model could ever classify this category correctly, and even if it could, it would necessarily come at the cost of classifying it incorrectly for someone else!

We can quickly see that customization is the only solution that detangles things. However, this section doesn’t stop here. That’s because even with model customization, it’s still difficult to categorize. That’s because it requires learning an unknown or indeterminate pattern.

To illustrate, suppose that I give you 300 plates of food, with 100 dishes of each of Japanese, Italian, and Moroccan cuisines. If I asked you to build a classifier that can taste 50 dishes from each cuisine and then predict the next 50, I imagine you would be able to build a pretty good model. The categories are well defined, and there should be patterns that distinguish one cuisine from another. Now let’s make it harder. Instead, let’s say I want you to additionally classify whether or not a dish is considered a “special dish”. Let’s consider three possible strategies for marking dishes as special, each of which will explain a different phenomenon:

A dish is special if I flip a coin 5 times, and it comes up heads 4 times.

  • This is just randomness. There is no pattern. It will be impossible to ever predict this class. However, there is one important case where we can do something. If all of the dishes marked special are at least different than the others, we can define an Other/Unknown category. This trick only works once though, but it’s as close as we can get to labeling transactions that have no connecting pattern amongst themselves.

Dorayaki, Lamb Tagine, and Cacio e Pepe are all considered special.

  • Example two is what we call indeterminate. It’s unclear if there is a pattern, or whether this is just memorization. And, if there is a pattern, it’s unclear if the samples that we’ve seen are fully representative of all future samples (the generalization error). When building a set of categories, we can’t teach a model about examples similar to A, and then expect it to successfully predict on examples similar to B, unless: (1) A and B are similar. (2) We tell the model that A and B are similar. Here are two examples of indeterminate transaction categories: 1. special transactions. If special transactions are anything pertaining to airfare, Thai food, and auto loans, but we only show examples of airfare, it has no hope of predicting auto loans. 2. Company Employees. This one is mostly memorization, but it’s possible that the set of relevant employees is (approximately) finite, and that there could be a pattern to how transactions look for internal employees vs. external contractors. In this case, we just need to teach the model enough examples to figure out what to focus on.

A dish is special if its name consists of two, and only two, words.

  • Example three is closely related to the metadata problem discussed earlier. There is a clear pattern here, which is good. The only problem is that the data containing the pattern is never given to our model! You could imagine that “special transactions” corresponds to any transaction labeled by Bob from accounting. We don’t know Bob. Neither does our model. The only solution here is to add more input data (i.e. name of annotator) to the model.

Away from general models and towards customization, realizing the Data Network

We’ve spent a lot of time trashing 🗑 general categorization models, and while it’s been fun doing so, we really wanted to do so to drive home why we’re so excited about our solution to all of the problems outlined above: model customization. Model customization is a realization something we’ve long strived for at Ntropy, the Data Network. Customization brings us one step closer, by allowing every user to create high performance individual classifiers that can be used to bump the performance of all other customized models in turn. For more information, we suggest you check out our blogs.

If all of this sounds cool, you can also drop us a line at, and we’d be happy to get you started with Ntropy Custom Models. Thanks for reading 👋!

Join hundreds of companies taking control of their transactions

Ntropy is the most accurate financial data standardization and enrichment API. Any data source, any geography.