Technical

05 Jul 2023

We asked GPT4 to categorize our financial data. A glance into the future of transaction enrichment

Author

Ilia Zintchenko

Co-founder and CTO

Introduction

At Ntropy, we help companies understand their customers from their financial data.

Information hidden inside banking data is valuable and can unlock a new generation of products and services. Extracting this information with anywhere close to human accuracy has not been possible until recently, due to advancements in natural language models. More on this here.‍

Financial data enrichment is a problem that may seem simple, but is actually quite complex to solve. If your goal is to have a basic solution, you can create your own rules-based approach in a few weeks, use ChatGPT or an enrichment engine that comes from data aggregators. However, in such cases, the accuracy of categories and merchants will only be around 60-70%.

For more complex applications, such as underwriting, credit scoring, or fraud detection, a wrongly categorized transaction can be the difference between making a loss or a profit from a decision. To achieve anywhere close to human-level accuracy, requires a different setup that would take years to build in-house from scratch. You will also need compute infrastructure (loads of GPUs!), a dedicated engineering team (that needs at least 2 large pizzas for lunch), a human labeling team (5-20 people), a solid merchant database (at least tens of millions of merchants) and loads of diverse transaction data (at least 100-200M transactions). Using a vendor specializing in financial data enrichment is nearly always the most cost and time-effective solution, that does not divert focus away from your main product.

In this post we showcase different parts of our transaction enrichment pipeline, including the models themselves, human labeling and QA, merchant databases and benchmarking. We will be as open as possible about our processes, and omit only some key proprietary details.

Transaction enrichment

A bank transaction is initiated by the sender or receiver of the money to satisfy a need. A sequence of such events reveals information about the consumer or business that owns the account. From each transaction, we can extract the merchant name, logo, website, location, MCC, transaction category, recurrence patterns, identifying all counter-parties, and much more. This information enables dynamic cash flow underwriting, fraud models that go beyond just capturing device ID-s and metadata, personalized products and prices, creation of look-alike audiences, automated tax filing and much more.

The three dimensions of an approach to transaction enrichment is latency, accuracy and cost. An optimal enrichment engine is one that can enrich a typical transaction at the smallest cost, while staying within latency and accuracy requirements. The effective value is the difference between the benefit that it brings and the cost of the enrichment. Areas where low latency is essential, typically also require high accuracy. And vice-versa.

For example, for payment authorization and fraud detection, latency requirements are sub-200ms. In this regime, enrichment can only be done one transaction at a time and the value of each transaction is high. For user onboarding and financial planning, latencies are in the order of minutes and enrichment can be done in batches of thousands of transactions. In market research and accounting, time-scales are even longer. In this regime, the main metrics are market-aggregate statistics, where the value per transaction is small. However, volumes are also very high and there is less sensitivity to accuracy.

At scale, the majority of the cost is cloud GPUs and caches (for merchant information and transactions). At higher latencies and lower accuracies, enrichment can be done at a significantly lower cost. For example, batches can be scheduled at more favorable time slots, when hardware is available and simpler models that require less compute can be used.

Value vs. latency of transaction enrichment

Models

As we wrote in a previous blog on data networks and general ML, the most powerful machine-learning models are also the most general ones, contrary to software, where specialized versions are better than off-the-shelf ones. This has further been evident from the recent success of large language models. Below, we will describe how we combine humans, rules, small-language models and large-language models internally to maximize the performance of our transaction enrichment API.

‍Humans

For high-value cases, like underwriting of large business loans or accounting, humans have traditionally been the state of the art. The reason for this, in addition to the high cost of mistakes, is the accountability. When a human makes a decision, they can be held accountable for it. They can provide a rationale, learn from any mistakes, and face consequences if they act irresponsibly or negligently.

At Ntropy, we use a team of expert human labelers to create ground truth for our models to train on and for quality assurance of our production API. Our engineering team are (still:) also humans and have regular quality-check sessions that cover important areas for improvement. ‍

Rule-based algorithms‍

The first iteration of a transaction enrichment pipeline is typically done with rules, lookup tables and humans-in-the-loop. Rules are quick and easy to create to get from zero to a mediocre accuracy. They are cheap to run and explainable. However, in the US alone, close to 5M businesses are started every year. For each payment processor each of these businesses uses, a separate set of rules needs to be maintained. At scale, rules also start to increase in complexity and interact with each other, which quickly becomes infeasible to maintain.

At Ntropy, we use rules in our personalization endpoint to adjust the output of the API in real time, by our QA team to fix critical errors in production without having to wait for the new model version, and as part of the dataset generation to efficiently transfer fundamental financial knowledge to the model.‍

Small language models (SLMs)

We have been using stacks of SLMs (up to 1B params) from day one, as the main engine behind the Ntropy API. We have found SLMs to be the only approach that can understand bank transaction data at a near human level of accuracy, while maintaining solid latency guarantees and at a price point that allows us to scale to billions of transactions per month.‍

Large language models (LLMs)

We also use LLMs internally to augment some of the labels created by our labeling team. This enables us to increase the amount and diversity of our training data on-demand without increasing the size of our labeling team.

We have found large LLMs (175B+ parameters) with a specialized prompt that includes specific instructions and optimal labeled examples are significantly more accurate than smaller fine-tuned and adapted open-source LLMs (7-30B parameters). Smaller LLMs, however, can be much faster.

The output of a general large-language model is entirely defined by the input prompt. The prompt that we use is composed of 3 parts:

Static part that describes the task
Variable part that contains a list of relevant, labeled examples
The input example itself

The algorithm that creates a prompt that maximizes accuracy and minimizes the latency of the model can be of similar complexity to a standalone enrichment model. For example, relevant labeled transactions in the prompt can boost the accuracy significantly. However, these examples need to be chosen carefully. The larger the set of pre-labeled transactions, the better the examples in the prompt can be, which will in turn increase overall accuracy. The model accuracy is also strongly correlated to the accuracy of these labels.‍

Cost vs. accuracy

As we discussed above, the cost per transaction is key to generate positive value. Any model will be updated regularly, so training costs are important too.

Training:

Human labelers need to learn about financial terms and general structure of transactions to get started. After the initial barrier, accuracy quickly improves over time.
Writing a few rules that handle some key patterns and result in mediocre accuracy is simple. However, to reach high accuracy, we need exponentially more rules to cover all edge cases and it quickly becomes unmanageable.
SLMs require lots of training data to start to produce decent results. To further increase the accuracy, even more data and better labels, as well as a number of other tricks around model architecture and pre-training are required.
As mentioned above, we are focusing only on prompt tuning for large LLMs in this post, as we have found this to result in higher accuracies than fine-tuning or adaptation of smaller models. An LLM is simple to get started with by just providing the transaction and required fields in the prompt. To increase accuracy further, we need similar tooling to SLMs, a large set of labeled examples and a tuned similarity function to pick relevant ones.

Inference:

Human labels are expensive. Great human labels are even more expensive.
Once deployed, rules have a very low computational footprint. More rules marginally increase this cost, as they can be used selectively, depending on the input.
SLMs are significantly more expensive than rules and require GPUs to reach an acceptable latency. This cost, however, is independent of accuracy, if the model size is kept the same.
LLMs are larger models than SLMs, and correspondingly more expensive to run. With a specialized prompt for higher accuracy, the cost and latency can be similar to a human labeler.‍

Data labeling and QA

Human labels are an essential component in our transaction enrichment pipeline. We use human labels to train all our models, monitor the quality of our production API, track model quality changes over time, and find errors that need fixing in our production data.

Labeling data can be a tedious process and consistency is paramount. The key metrics are accuracy, elasticity and cost. There are plenty of outsourced data labeling options on the market that provide similar elasticity to cloud compute. Nearly arbitrary labeling volumes at any time. Amazon Mechanical Turk is the classic example. However, given our high data security and accuracy requirements, we have found an in-house labeling team, that is specifically trained to understand banking data, to be unmatched.

To bring data labeling entirely in-house, we have built an extensive set of frontend and backend tools specialized for financial data. We have also borrowed some of the approaches from mobile gaming to incentivize our labeling team and improve consistency and quality. Our in-house team has labeled hundreds of thousands of transactions with very high accuracy since its inception.

Merchant data

Categorizing a transaction requires 3 main stages:

Recognize the key parts of the transaction description, like merchant name, location, etc.
If there is a merchant present, find its website, description, MCC, full name, etc.
Based on all the information above, find a suitable category for the transaction.‍

Accurate merchant data is critical for transaction understanding. Tens of thousands of new businesses are started daily in the world. A similar number shut down. This data therefore needs to be updated regularly to keep track of these changes. More on our approach to this here.

Maximizing accuracy, coverage and relevance of our merchant cache is essential. To achieve this, we combine multiple sources of data, including proprietary and public merchant DBs, search-engine result pages, data mined from LLMs and from transaction data. We maintain a cache of nearly 100M global merchants.‍

Merchant DBs

A merchant database can be obtained by either buying it from a vendor, or creating a custom one by scraping data from merchant aggregators with open access. There is no holy-grail merchant data around. Each entry in the database contains a subset of the website, location, MCC code, merchant name, contact details, parent company, etc. Across proprietary merchant DBs on the market, we have found the fraction of entries where all fields are correct to be typically less than 50%, with only a few DBs any higher than that. Overall coverage is usually far below that. We are combining multiple such databases through probabilistic entity linkage to increase this to > 90% in geographies that we cover.‍

SERP

Search engine results pages are another source of merchant data. The index of search engines can take anywhere from a few days to a few weeks to update. This data is more real-time than merchant databases and is also popularity-sorted. However, lots of tricks are involved to get search engine results to resolve to the correct merchant in the transaction.‍

LLMs

The model weights of LLMs may act as a better ranking mechanism than traditional search engine algos, as we bypass the SEO of search-engines, which tends to float up less relevant content, and can use the models’ reasoning ability to help retrieve the correct information. Although only information that was available before the knowledge cutoff of the model can be retrieved this way, much of it is still relevant. As merchant information is discrete and non-compressible, we have found that only the largest models, that are of similar size to the number of tokens in the training set, work reliably for this. To optimize latency, we pre-mine and cache this information asynchronously.‍

Transaction data

Some merchant information can be inferred directly from transaction data. For example, from the amount and recurrence patterns of transactions to a merchant we can see if they are offering a subscription service and what category of product it may be.

Benchmarking

Measuring how well a transaction enrichment performs is critical, whether it is a vendor you are evaluating or an internal solution. Without proper metrics, you can be blind to why products that are powered by the banking data are getting better or worse. However, benchmarking transaction enrichment APIs is a notoriously difficult and often ambiguous process.

Not surprisingly, private accuracy metrics are notorious for being completely fake. The vogue figure to use is somewhere in the 95-98% range. From what we have seen, this is the industry standard for what any vendor should say the accuracy is. 100% would not be believable. Much below 95% starts to look bad. 95-98% is just right. Part of the goal of this post is to help our current and future customers to navigate this space, rather than to muddy the waters with made-up numbers.

Below we will benchmark the accuracy of transaction categories returned by our API, vs. a trained human labeler and GPT-4. The category of a transaction describes its purpose of the money transfer wrt. the needs of the account holder. Assigning a category to a transaction is hard even for a human and requires financial knowledge, information about the context of a transaction and a general understanding about the world. Categorization is the foundation behind bank-driven income verification, risk scoring based on cash flow, financial management, and other applications.

Note, there is no such thing as a 100% objective metric. For example, the categorization accuracy depends on the specific transactions in the test set, the category hierarchy (usually, the more coarse-grained the hierarchy, the higher the accuracy) the ground truth (e.g. sometimes there can be multiple labels that are valid for a single transaction, even when human-labeled, it is not 100% accurate), the version of the model (next week a better model for this set of transactions may ship), the metric (f1, accuracy, precision, recall, etc.), the account holder (affects the transaction label in some cases), the amount of leakage from the train into the test set.

The dataset is available in the benchmark repo . All names, numbers, locations and amounts have been anonymized to maintain privacy. This benchmark is on purpose made to be difficult, consisting mostly of long-tail merchants and transfer types. Here are the numbers:‍

	Ntropy	GPT-4-0314 ⁠(base prompt*)	GPT-4-0314 ⁠(optimal prompt**)	Human labeler
Price [per 1k txs]	$1-$5^	$1.62^^^	$120	$90-$130^^
Latency [s]	0.05 - 0.2***	2	20	30 - 60
Accuracy	86%	71%	88%	90%

Categorization benchmark of US consumer transactions

* - basic prompt, zero-shot
⁠** - advanced prompt with relevant labeled examples and additional instructions
⁠*** - depending on caching
⁠^ - depending on configuration
⁠^^ - depending on transaction type
⁠^^^ - this is by far the best case, assuming the number of transactions in the batch size is large enough, so that the instruction part of the prompt gets amortized away
⁠“ - this is the average accuracy across our in-house labeling team, specifically trained to label bank transactions. The ground truth for each example in the test set has been triple-checked to be 100% correct.‍

It is interesting to note that the GPT-4 with optimal prompting is roughly similar to an average human across all parameters, even outperforming the Ntropy API. However, the 100x higher cost, 200x higher latency and no reliability guarantees is unlikely to be worth it for any real-world use case.

Conclusion

In this post, we discussed the key components of our finance data enrichment stack. Models, data labeling and QA, merchant data and benchmarking. Humans can think outside the box, but are slow and expensive. Rules are cheap at a small scale, but are infeasible beyond prototyping. SLMs are scalable and accurate, but can be expensive to train and iterate on. LLM APIs can get you started quickly without much prior knowledge, but are not usable in real-time and require large quantities of data, human labels and iterations to get right. At Ntropy, we combine all four, to deliver maximum accuracy at the lowest latency and cost.

As is the culture at our company, we are as open as possible with our methods, metrics and processes. We believe this is key to continue pushing the boundaries of what is possible to do with financial data. We hope the rest of the industry will follow suit.

Reach out to us to learn more or get up and running in a few minutes through our self-serve portal.

Facebook

Twitter

Copy link

The unreasonable effectiveness of combining datasets

24 Apr 2020