Evals Part One: Why you need to evaluate products built on top of language models

Author

Michael Jenkins

Product Marketing Lead

At Ntropy we have built our transaction enrichment capabilities on top of language models and we encourage our prospective customers to evaluate how our transaction enrichment models perform on real data.

However, some customers often will be misled by un-achievable marketing claims and not run an evaluation process. The aim of this three part series is to shine a light on evaluation processes and the need to evaluate products that are built on top of models.

Part one gives an introduction to why customers need to run an evaluation process when selecting a provider. Part two covers some broad options of how to evaluate and the third part covers what to expect with the evaluation process at Ntropy specifically.

Whilst our expertise is in transaction enrichment, the same rules and learnings can be applied when evaluating any product built on top of language models.

Probabilistic outputs drive the need for evaluations

When building with language models or transformers is becoming the default, as their capabilities are evolving rapidly, the need for evaluations becomes more and more prominent.
⁠
⁠Language models have probabilistic outputs, which means unlike a usual rule-based algorithm or system, or unlike buying a dashboard product with a deterministic software stack behind it, every time you use a language model or a product powered by it, your answer might be different and not hundred percent correct.

In a lot of cases there is no single source of truth. In these scenarios, particularly when evaluating vendors, benchmarking and evaluations are a necessary “evil”, moreover, they are becoming critical.

What does it mean to evaluate?

Before we get into answering the question of whether it is crucial to run evals and get it right, we should quickly explain what we mean when we say evaluating. We use this term to describe the process by which a prospective customer will compare the performance of our product with that of other solutions: existing in-house workarounds, other vendors and most importantly a “ground truth”, which is usually a human labelled and checked dataset that serves as a core point of reference.

A ground truth is the expected output that is known to be true. In other words, what is the correct output that is expected. Typically in the transaction enrichment world, the ground truth is what a human would be able to achieve. We strive for human-like accuracy but at scale and economics that are significantly lower. For this specific case even human decisions can vary and be subjective.

Lots of enrichment solutions publish very high accuracy statistics on their websites and in marketing materials but from our experience talking with customers, they are very often not achievable which leads to nearly every prospective customer doing some form of evaluation process.

Why should buyers run an evaluation process?

The three main reasons why customers carry out evals are their scepticism of self-reported accuracy metrics, the lack of applicability to their specific use-case and to enable a like-for-like comparison with other solutions. Every finance team signing off a new vendor is going to look for an improvement vs cost, and evaluate on a case by case basis.

This means enterprises choosing vendors that are offering language models as a service are going to disregard general benchmarks, published numbers and leaps made on common data sets and are going to focus on how a certain model performs on their use case, what the latency is and how much it costs. These dimensions are the key decision drivers.

Difficult to achieve publicly reported numbers

In our discussions with customers they often mention that they are sceptical of the 97%-99.9% accuracy numbers that are marketed. They seem too good to be true and artificially set high enough to be attractive but lower than 100% to be believable. Many prospective customers who are familiar with bank data and its intricacies have been jaded by previous experience.

Another common scenario is that a model performance is biased towards a special test set that it has been optimized for. However this model completely fails when presented with out of distribution cases and in generality. What this means is that as you acquire new customers who differ from your previous ones, your model will underperform if it was not trained on a broad enough dataset. This is often why internal models and solutions often don’t perform well when expanding to new segments or geographies, despite it being highly accurate in the beginning.

Accuracy is heavily dependent on data set

As most customers are aware and all learn very quickly, accuracy numbers depend on the data set and are not broadly applicable, so customers want to see the accuracy for the data that they typically receive.

Even if broadly the same, datasets can still be skewed. For example, all customers with US consumer transactions do not have the same distribution of merchants and categories

One customer might benchmark a provider and see 94% accuracy whilst another might only get 86% accuracy because the distribution of their US consumers is different. One might be from customers in a small region or be from a specific subset of the population.

How do you stack up?

Different vendors optimize for different things and have certain strengths and weaknesses. Most players in this market pick a specific use case and ICP and try to cater for that segment, making sure they get the best possible results. While in general this is a good approach, the returns are diminishing as the task itself always involves a long tail of input data that is not a good fit for specialized approaches.

There are vendors that would return the right information 99.9% of the time but would only be able to handle 20-30% of the data. There are others focused on coverage and will make sure you get an answer every single time and will share internal confidence scores for you to build decision engines on top of their outputs.

What we have learnt at Ntropy that the best possible scenario for a customer, the so called holy grail is a larger model that can generalize across use cases, languages, data types. There is no other way to offer an end to end solution here and avoid vendor stacking and over-engineering internal systems.

Next Up - Part 2

The next part in this series will outline a few of the key ways that customers evaluate before we dig deeper into what that process looks like at Ntropy.

Facebook

Twitter

Copy link