Technical

22 May 2023

Efficiency and Maintainability in Named Entity Recognition: A Trie-based Knowledge Base Approach

Author

Chady Dimachkie

Engineer

Named Entity Recognition (NER) is a crucial task in Natural Language Processing (NLP) that involves identifying and categorizing specific entities such as people, organizations, and locations within text data. However, traditional NER systems often struggle with complex patterns and distribution shifts during continuous updates, which can result in high maintenance needs and increased costs.

To address these challenges, we developed an architecture called Knowledge Base NER (KB-NER) which can be used to extend existing models (Transformer-based or any other). This architecture leverages a trie-based knowledge base (KB) containing hundreds of thousands of entities to improve the accuracy, speed, cost, and maintainability of developing a NER pipeline. By using the knowledge base as a source of hints, the model can inject hints into its prompts, resulting in better performance.

This article provides a comprehensive overview of the KB-NER architecture, including its unique features and advantages over traditional NER systems. We also discuss the training and testing procedures for the KB-NER model, and its use cases in the domain of financial transaction enrichment (although it can be applied to any domain). Finally, we explore the limitations and potential for further development of this architecture.

By the end of this article, you will have a better understanding of how KB-NER works, its potential applications, and how to possibly implement it for your own use case.

Quick Named Entity Recognition (NER) explanation

The following bank transaction contains recognizable entities:

amzn mktp: Amazon Marketplace

us: United States

amzn.com/bill: Amazon’s billing short URL

wa: Washington state in the US

The NER task recognizes and extracts these entities so they can later be used to find more information. In this case, we can use them to find information about the transaction such as the merchants, location, and any individuals involved, as well as the time of the transaction.

What is Knowledge Base NER (KB-NER)?

Knowledge Base NER (KB-NER) is an approach that allows a model to store information outside of its weights, which it can reuse later and also focus on other things internally, such as pattern matching instead of string matching, which is what we usually want our NER models to do.

KB-NER involves linking a knowledge base (KB) to a NER model, which stores important information outside of the model weights. The knowledge base can contain vast amounts of information related to the entities that the NER model should recognize.

The model is then trained on a combination of labeled data and the same data with information from the knowledge base injected into it. During inference, the model can use the KB to supplement its decision-making process, allowing it to recognize entities more accurately and efficiently.

Furthermore, KB-NER can also reduce the amount of computational resources required to train the model. By storing important information outside of the model, the model can be made more compact and accurate. It also enables the model to learn new information on the go without any retraining needed, thus minimizing distribution shifts.

Let’s talk architecture!

The ultimate goal of KB-NER was to extend a NER model in a way that would not affect inference time, would improve performance compared to the current production system, and significantly improve maintainability.

To achieve this goal, we introduced a trie-based knowledge base to the model.

General idea

We want to be able to store an arbitrary amount of data, it could be millions of items, and we want no latency impact to inference. We also want to be able to add any new items whenever we want with instant feedback to the model without having to retrain it.

A trie is a tree-like data structure that is used to store and retrieve data quickly. It is commonly used for storing strings and is particularly efficient for searching for strings using prefixes.

The approach involves injecting hints into the input prompt before feeding it to the model. By injecting hints into the prompt, we provide valuable information about specific spans and their potential types. These hints serve as guiding cues for the model, although it’s important to note that they can occasionally be incorrect if bad data happened to be added to the knowledge base. Later on, we’ll explore strategies to enhance the model’s resilience to such uncertainties during the training process of the model.

Exploring the KB Trie in Depth

Delving into the inner workings of the KB Trie reveals its simplicity in building and usability. By inserting items into the trie, they become immediately accessible. If needed, the KB can accommodate augmented variations of the items. For instance, when dealing with organization names, suffixes like LLC or INC can be added to aid in matching longer spans.

The process of the KB Trie can be broken down into three steps.

Step 1: Breaking Text into Sequential N-grams

The initial step involves breaking down the input text into sequential n-grams. To illustrate this process, let’s consider the input transaction: paypal * google new york us. This text will be decomposed into various n-grams, ranging from unigrams to a maximum of six-grams:

Unigrams:

Paypal
*
google
new
york
us

Bigrams:

paypal *
* google
google new
new york
york us

Trigrams:

paypal * google
* google new
google new york
new york us

4-grams:

paypal * google new
* google new york
google new york us

5-grams:

paypal * google new york
* google new york us

6-grams:

paypal * google new york us

Here is a possible quick implementation of this step using itertools (inspired by this SO thread):

from itertools import chain

def n_grams(seq: list[str], n: int=1):
    """Returns an iterator over the n-grams given a list of tokens"""

    shift_token = lambda i: ((j, el) for j, el in enumerate(seq) if j >= i)
    shifted_tokens = (shift_token(i) for i in range(n))
    tuple_n_grams = zip(*shifted_tokens)

    return tuple_n_grams

def range_ngrams(list_tokens: list[str], ngram_range: tuple[int, int]=(1, 2)):
    """Returns an iterator over all n-grams for n in range(ngram_range) given a list_tokens."""

    return chain(*(n_grams(list_tokens, i) for i in range(*ngram_range)))

This step is very fast and allows us to extract candidates that will be searched for in the knowledge base (KB). The sequential nature of the n-grams optimizes the use of the trie structure.

For very long texts, it is advisable to limit the size of the n-grams. You can easily set this limit based on the longest entity in the KB. In this simple case, the longest entity is New York US, so we would only consider trigrams. It should also be possible to dynamically find an approximate n-gram size by probing the trie first based on unigrams from the input text and looking at branches in the trie.

Step 2: Searching for Matches in Reverse Order

In this step, we take the previously extracted n-grams and search for them in reverse order, starting from the longest span and moving toward the shortest span in terms of word count. We examine if any matches can be found. Whenever a match is found, it is recorded, allowing us to prune shorter spans. Since the most precise possible hint has already been identified, there is no need to search those shorter spans again.

Let’s consider the example text: paypal * google new york us. Assume we have a KB Trie that only contains paypal, google, new york, and new york us. Here's how the process unfolds:

The KB Trie decomposes the input text in the same way as shown in step 1. It starts by checking spans in reverse order, beginning with paypal * google new york us, then paypal * google new york, and so on. The process continues until it encounters the trigram new york us. Once this span is reached, any subspans contained within new and us, inclusive, are pruned. Meaning that the New York entity will never be matched with any of the span candidates.

The KB Trie proceeds to test other spans until matches are found, if any. This is why it is beneficial to perform the search from longest to shortest spans.

A possible implementation using datrie looks like the following:

def longest_prefix(trie: datrie.Trie, ngram: str):
    """Get the longest prefix matching the ngram from the Trie"""

    # Find the longest matching prefix entity to the ngram using datrie
    found = trie.longest_prefix_item(ngram[1], default=("", None))
    
    # The score could use levenshtein distance or anything else, here we use an exact match for the sake of a simple example
    # It is possible to implement fuzzy matching so the KB is robust to typos and more
    # A more robust approach would be to use a Levenshtein automaton when constructing the trie
    score = ngram[1] == found[0]
    
    return (ngram, score, found)

def find_matching_ngrams(tx_filtered: str, ngram_range: tuple):
    """Find n-grams that match items recovered from the KB Trie"""

    # Use your own tokenization here
    tx_ngrams_n = [
        (ngrams[0][0], " ".join([n for _, n in ngrams]))
        for ngrams in range_ngrams(tx_filtered, ngram_range=ngram_range)][::-1]
    
    found_elems = sorted(
        filter(
            lambda r: r[1],
      # Previous function from the last section
            map(longest_prefix, tx_ngrams_n),
        ),
        # Sort by:
        # - number of n-grams
        # - match score (in this example it's only 1 or 0 since exact matching is used)
        # - len in letters of the word
        key=lambda x: (x[0][0], x[1], -len(x[0][1])),
    )

    return found_elems

Step 3: Hint boundaries injection into the prompt

Now that we were able to find all the candidate spans that can benefit from hints, we need to enclose them with hint boundaries.

But first, what form should those hints take? There are 2 ways to implement them, either you choose the simplest approach that requires the least amount of modification to already existing architectures (like RoBERTa, DeBERTa, etc) which is to just add pure text <beg_org> ... <end_org> which will describe where a hint span for an Organization entity starts and ends. However, one potential drawback is that these tokens may not resemble the usual tokens in your tokenizer and could be decomposed into numerous subtokens, resulting in a substantial increase in your prompt's length. Of course, here we chose a specific format that looks like this <beg_X> and <end_X>, but it can be anything that is recognizable and distinct enough from natural language or any other symbols.

The recommended second approach involves appending new tokens to your model’s existing tokenizer. These additional tokens will be recognized as single tokens by the tokenizer and will have their own token embeddings.

If you are using HuggingFace Transformers, here’s how you can extend the tokenizer’s vocabulary to accommodate these new tokens:‍

# Imports
from transformers import AutoTokenizer, AutoModelForTokenClassification

# Initialize the base tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# Here's a list of possible labels for your NER application
labels = ["ORG", "PER", "LOC", "DATE"] # Add more here

# Create the <beg_X> and <end_X> hint token strings
new_tokens = [f"<beg_{l.lower()}>" for l in labels]
new_tokens += [f"<end_{l.lower()}>" for l in labels]

# Extend the tokenizer's vocabulary
# `add_tokens` returns the number of newly added tokens
num_tokens = tokenizer.add_tokens(new_tokens)

print(f"We added {num_tokens} new tokens to the tokenizer's vocabulary!")

We also need to adjust the token embeddings of the model to accommodate the new tokens by resizing them:‍

# Initialize the base model
model = AutoModelForTokenClassification.from_pretrained(MODEL_NAME)

# `resize_token_embeddings` needs the final size of the tokenizer vocabulary
model.resize_token_embeddings(len(tokenizer))

# Let's assume the model is a simple `roberta-base` for simplicity
# This will differ depending on the base model you load
print(print("New token embedding size:", model.roberta.embeddings.word_embeddings.weight.shape[0]))

And finally, we can save the modified model and tokenizer which we can now use instead of the base ones for the future model:‍

tokenizer.save_pretrained(TOKENIZER_PATH)
model.save_pretrained(MODEL_PATH)

After the model and the tokenizer are ready, we can move on to injecting hints.

Injecting hints is all about string matching:‍

def inject_hints_to_text(
    tx: str,
    ngram_range: tuple[int, int]=(1, 4),
):
    """This adds the final hints to a piece of text"""

    # Use your own tokenization here
    # We only split the text for this example, but a lot more complex processing
    # could be done like separating letters from symbols, etc
    tx_filtered = tx.split()

    found_elems = find_matching_ngrams(tx_filtered, ngram_range)

    items_already_seen = set()
    for found_items in found_elems:
        index_item, score, item_label_found = found_items

        # Skip items we have already seen
        if index_item[0] in items_already_seen or not score:
            continue

        index_query, item_query = index_item
        len_query = len(item_query.split())

        # Compute end of the queried n-gram span
        end_index = index_query + len(item_query.split()) - 1

        # Record the items we have already seen
        for i in range(len_query):
            items_already_seen.add(index_item[0] + i)

        item_found, label_found = item_label_found

        # Example:
        # [BEG_TOKEN] uber eats [END_TOKEN]
        # Note: you can use any token you want here
        # I'm using: <beg_X> and <end_X> where X will be org, loc, per, date, etc
        tx_filtered[index_query] = f"<beg_{label_found}> {tx_filtered[index_query]}"
        tx_filtered[end_index] = f"{tx_filtered[end_index]} <end_{label_found}>"

    return " ".join(tx_filtered)

You can find the implemented KBTrie class here.

Now that this is implemented, the final labels need to be generated and there are several approaches. You could mark the special tokens as having the Other label, let’s take the same example as before paypal * google new york us with a KB that contains new york us as a location:

We’ll be using the BIO schema:

This works quite well and enables us to enclose the tags so the model can learn that those special tokens need to be ignored since they’re labeled with Other.

There is however an issue with the above approach and we’ll demonstrate this with another example. Let’s now remove new york us from the KB and add only new york, this is what happens:

As you can see, we have now broken the BIO scheme by leaving the us part outside the hint scope! Of course, it’d be possible to just remove any labels related to the hint tokens, but it’s not the ideal solution. This issue will happen during training where it might confuse the model about the structure of the BIO scheme, making it think that it’s allowed to break up the labels and this can incite the model to generate broken labels that shouldn’t be there.

A solution to this is to ask the model to treat the special tokens as normal tokens and tag them as what they should be. Let’s see what this looks like:

The above method will enable the KB to use partial hints, meaning that if you have this sentence the city of new york us, and only new york in your KB then the model will still be able to fix the labels and integrate sub-hints into the bigger picture of the full sentence.

This helps the KB become a more general contextual tool with better handling of unexpected cases.

Now we have a model that is ready to use an external knowledge base. Let’s move on to training such a model

Training the KB-NER architecture

As explained earlier, we needed an architecture that required minimal changes to the actual model itself. We also wanted the final model to be able to work exactly like a model trained without a KB. This means that even if the KB is empty or unable to provide relevant hints for the data at hand, the model will still perform optimally, avoiding excessive reliance on the KB and potential overfitting to specific hint patterns. It was also crucial for the KB to be contextual, ensuring that hints are considered based on their usefulness in a given context while disregarding irrelevant or nonsensical hints.

It’s worth noting that the optimal usage scenario for the model would depend on the intended application. If the goal is to have a compact model that relies heavily on the KB, then the KB should contain only verified, ground-truth elements. On the other hand, if the objective is to have a robust model that can perform well with or without the KB, the KB should be used primarily to teach the model about new entities without requiring retraining. This approach would make the model more stable, more resistant to adversarial attacks, and easier to maintain.

We will be explaining the training regime for the latter approach in the next section.

The Training Recipe 🧑‍🍳

Here’s a suggested training regime but you should keep in mind that this recipe offers flexibility, allowing you to adjust certain parameters to suit your specific needs and preferences.

The recipe consists of three key parts each contributing to the model’s overall robustness and adaptability:

Base Model Training: 60% of the training data comprises normal train data, representing the typical NER task without any hints related to the KB. This portion establishes the model’s baseline performance and ensures its competence in standard NER scenarios.
KB Task Training: To instill an understanding of the KB’s utility without over-reliance on it, we introduce hints gradually. 30% of the overall training data includes correct hints that emphasize the value of the KB as a useful resource, while also highlighting that it should not be considered an immutable source of information. Similarly, to encourage the model to adapt to evolving KB content, we employ a mixed approach. Specifically, 20% of the 30% of KB data incorporates current KB hints (let’s say you already have a KB with 10k elements, the hints will be limited to these only), compelling the model to align its labels with those found within the KB. The remaining 80% utilizes the standard ground truth (GT) labels converted to hints, allowing the model to plan for dynamic KB scenarios and retain its ability to interpret regular NER inputs.
Building Resilience: To enhance the model’s robustness against incorrect KB information, 10% of the overall training data contains wrong hints. By replacing correct labels with incorrect ones while preserving the correct GT, the model learns to discern and discount unreliable KB signals. Furthermore, we introduce random labels alongside correct GT labels, further fortifying the model’s resilience against erroneous KB input.

Here’s a quick summary of what happens in practice:

By training the model in this manner, we try to produce a model that maintains its original performance, possesses a comprehensive understanding of the KB task, and exhibits resilience when encountering unfamiliar scenarios. It is then up to you to populate the KB with good enough hints.

Training Results

After using the above training recipe, here are the f1 scores on our internal test set with a KB of 1k items.

The performance of the new KB-trained model remains consistent with the non-KB production model when the KB is empty. However, it shows a noticeable improvement of 2 points when the KB contains a clean 1,000 items. These 1k items were chosen based on popular transactions that we have processed, meaning that even if the selected samples used to fill the KB were chosen to not overlap with this test set’s transactions, some entities used in the test set can still be found in different contexts that will themselves appear in the test set. Examples of popular entities can be amazon or paypal for example, which are very common merchants.

Now, let’s explore the interesting outcome of the model with a simulated perfect KB. In this case, a synthetic KB is created by incorporating all the ground truth from the test set as KB hints. It is important to note that the model is not trained on this data; rather, the KB is populated with these hints.

Interestingly, although the model achieves a very high score, it does not reach a perfect F1 score of 1.0. This indicates two things: first, the model has not completely overfitted to the KB task, and second, it still exhibits some uncertainty in certain hints. This cautiousness suggests that the model is being careful or that some hints may have ambiguous meanings, which is actually a positive sign. It implies that by expanding the KB with high-quality items while maintaining good item selection criteria, the model's scores can be significantly improved. Furthermore, it is worth mentioning that adding items to the KB is a very fast process since, in the above implementation, it’s equivalent to adding an item to a Python dictionary (datrie uses dictionaries to implement the trie).

Let’s look at a few interesting test examples:

All the below examples are anonymized

Emerging Behavior

The model has been trained to always doubt the KB and make sure it uses the best of both its own trained weights and the hints from the KB when producing its output. However, this can sometimes backfire in some cases where the model just doesn’t believe the hint you provide!

Let’s look at an interesting behavior that can help circumvent this.

Let’s show an example:

Surprisingly, the model struggles to accept that ‘Cleveland Clark’ could be a person's name. While there is a location called Clark Fulton, which is a neighborhood in Cleveland and could potentially be abbreviated as ‘Clark Fulton Cleveland’ or even further simplified as ‘Clark Cleveland’ (as our models often encounter truncated, shortened, or obfuscated names), the model remains skeptical.

Interestingly, we have observed that the model can be persuaded to believe you by emphasizing the hints. Instead of providing <beg_X> ... <end_X>, you can offer <beg_X> <beg_X> ... <end_X> <end_X>, or even <beg_X> <beg_X> <beg_X> ... <end_X> <end_X> <end_X> and so on, even though it was not specifically trained in such a manner. However, we are currently not leveraging this technique in our production systems, as it deviates from the expected behavior of the KB. Nevertheless, we find it fascinating as an emerging behavior and believe that further investigation is warranted. Of course, there are cases where even accentuating the hints five times won't convince the model to believe you 😀

Summary: Advantages and Considerations of this Method

Let’s summarize the pros and cons of this method.

This one doesn’t rely on NER’s suggestions, meaning it is unbiased when giving a hint. It also requires minimal changes to the base NER model and enables you to basically apply instant fixes if you see something bad going on in your model inference. For example, in production, if we received user feedback we can instantly apply a fix to the model without worrying it’ll affect the rest of the distributions the model has learned after several fine-tuning rounds, otherwise known as distribution shift.

For example, here’s a subset of our UI that enables instant injection of hints for instant fixes:

With such tools on top of your models, it becomes very easy to maintain customer happiness which is crucial for every business.

Now, the hardest part is to populate the KB with plenty of useful and clean items by scraping or other approaches which can be domain dependent. Filling the KB with mostly bad items will definitely harm the model since it might confuse it even if the model is trained to be robust to this. Especially as we are using neural network-based models that are affected by adversarial attacks which we want to avoid at all costs.

One last thing to mention is that even if the KB is contextual when being used by the model it is not when doing the injection of hints in the prompt. This could also be an issue if you’re injecting words that could be different entities depending on the context but the KB only contains one label for it. However, there are ways to circumvent this, for example by implementing patterns or regex that would help inject a hint as one or another entity label depending on typical words usually found. Fortunately, those are usually very specific cases and we can definitely rely on the model to understand which one it should be depending on the context, even if the wrong hint is injected. In the worst case, fine-tuning on this specific case might be needed.

Further Improvements or Usages

The proposed implementation is quite straightforward and there are many possible ways to improve it:

Implement pattern and regex detection for finer-grained hint injection in the prompt.– Example: Let’s say your model doesn’t know zip codes, you can inject patterns that recognize this range of numbers
Implement fuzzy matching of items– Using Levenshtein distance– Or ideally using a Levenshtein automata– Mixing several types of matching like the above and also vector search for other kinds of metadata
This method was developed a while ago but, in an era of LLMs, this method could be adapted to inject knowledge with actual sentences or metadata hints instead of just labels. Example:– You could inject data based on keywords that the LLM might not know about or just inject more context. This could look like the following (prompts could be further optimized 😀) :

Do NER for the following bank transaction: Notes on some entities in this text: - TOKEN TRAN is an organization - LUXEKEY LLC is an organization ACH DEBIT TOTDENTA2 TOKEN TRAN LUXEKEY LLC B 3331231455643 187054

Or even:

Do NER for the following bank transaction:

Notes on some entities in this text:
- Token Transit lets you ride public transit with ease and convenience
- LuxeKey is a local and professionally managed boutique vacation rental and property management operation based in Paradise Valley and Scottsdale

ACH DEBIT TOTDENTA2 TOKEN TRAN LUXEKEY LLC B 3331231455643 187054

The End

Thank you for taking the time to read this article. We hope that it has provided valuable insights into why this method works well for us and could be useful to you as well.

Should you have any further questions or suggestions, please feel free to reach out!

Facebook

Twitter

Copy link

The unreasonable effectiveness of combining datasets

24 Apr 2020