Evals Part Three: What to expect when evaluating Ntropy

Author

Michael Jenkins

Product Marketing Lead

This is the final part of our three part deep dive into benchmarking.

Part One covered some of the key reasons why customers should run an evaluation process when selecting providers who build their products/services on top of language models. To recap these reasons include;

Difficulty in achieving publicly reported numbers
Accuracy is dependent heavily on your dataset

In Part Two we outlined some of they key considerations to think about in running an evluation. These predominantly are;

Time/Resources
Availability of data

In this final part we will give an in-depth overview of what the evaluation process that we work with customers on looks like at Ntropy.

Kick-off

All evaluations begin with a kickoff call to align on what data you want to see as well as the goals and objectives for the process. Sometimes customers are just interested in testing accuracy of a few fields of our enrichment services whilst other customers need the full works.
⁠
⁠Aligning on the goals and objectives of the process is crucial to establish what customers consider to be a correct result and how they will assess the results. One key part of how the customer will assess the result is what they will compare our output to.
⁠
⁠The results of enrichment in general can be subjective so it is important to define what a correct result is.For example, in P2P transactions, our models by default would return an empty “merchant” field but would populate “intermediaries” (such as Cash App, Zelle, Venmo) and “person”. Some customers may consider this incorrect as they prefer to have the “intermediary” as the “merchant” which is why it's important to discuss success criteria at the beginning.

Data Transfer

Once the kickoff call has taken place and we are aligned with the customer, they then need to send us over the data. We will enrich files of up to 5,000 transactions for a evaluation. The input fields we require are available on our developer documentation.
⁠
⁠Although listed as optional, sending us the country of the transaction if it is available can have a meaningful impact on the accuracy of the output.
⁠
⁠Account holder type is another field that is listed as optional as our enrichment solution does work without it but is vastly more accurate if you let us know if the data you are sending. The options for this field are consumer, business or freelance data. We have unique category hierarchies for each so it is highly recommended to send us this field.
⁠
⁠Additionally many customers also send us Account Holder ID which allows us to store transactions for each business or consumer together so we can run higher level analysis on it, such as our Recurring Spend product which picks out which transactions are recurring or not. Without this ID, Recurring Spend is not possible.

Customers are able to send us data via secure file storage, email or via Slack.

Enrichment

Once we have the data, we will run it through the appropriate enrichment pipeline based on the needs and use-case of the customer that we discover during the kick-off call.

Quality Checks

Once we have the enriched output, our QA team will either take a random sample of 500 transactions or allow you to specify which 500 transactions and have them hand-labelled by our expert team of labellers to provide what we call a ground truth.

The output of our enrichment models will then be compared to what our human labellers produced to provide accuracy metrics which should represent the accuracy of the entire dataset based upon the previously agreed-upon criteria.

Review

After our team has performed the quality checks, we will then set up a call with the customer to review the results and answer any questions.
⁠
⁠After the review call, we will send across the output file to the customer so that they can also perform their own checks and evaluations on the result and compare it to any other providers or existing solutions.
⁠
⁠Depending on the size of the benchmark, customers typically either will

Eyeball check a sample and compare to intuition
Check a sample against an a ground truth
Check all results either by hand or in comparison to a ground truth
⁠
⁠Some customers already have enrichment results from an existing solution and will use this as the ground truth

TLDR

Evaluating any products built using AI or language models is imperative. For transaction enrichment specifically, customers should not rely on the accuracy numbers that are published on their websites and in marketing materials as they are not achievable.
⁠
⁠Evaluations do take time and resources but at Ntropy we do much of the heavy lifting for you and can run a benchmarking process in as little as 48 hours. We offer a free lightweight benchmarking via our free trial which is suitable for smaller companies with less resources.
⁠
⁠For larger customers we offer a full, comprehensive benchmarking option to provide confidence in the accuracy of our enrichment capabilities for your use case and the typical data your business deals with.