Thymo ter Doest
Machine Learning Engineer
At Ntropy, we deal with a large volume of financial transactions represented as plain text. We train models on these transactions from hundreds of businesses and financial institutions, across a range of languages and from countries around the world. Accordingly, these transactions follow a multitude of different standards and formats. One technique that proved to be useful in our Machine Learning pipeline, is large scale sentence clustering. It can be used to
- Label data faster
- Reduce the number of labels needed
- Debug structural model mistakes
- Monitor data drift
- Help clean up noise from labelled datasets
In this article, we will first go into more detail on these practical use cases of sentence clustering in an ML setting. Then, we will explain how sentence clustering works and how we were able to scale up an existing sentence clustering algorithm that didn’t scale beyond 100k+ sentences, to one that can handle millions of sentences.
How to leverage sentence clustering
Many natural language processing tasks are related to data that is represented as a sentence or a short piece of text. For example, in our context, this is the textual description of a financial transaction. In the case of chatbots, this is a user question and in the context of search engines, these are user queries. Often, we need to train supervised learning models on this data. It can be incredibly useful to leverage sentence clustering before, during and after training these supervised models. With sentence clustering, we refer here to finding groups of semantically similar sentences.
A randomly ordered set of sentences transforms into a grouped set of semantically similar sentences.
Imagine we are building a classification model where we want to assign user search queries to a predefined list of topics. We could take an off-the-shelf language model and finetune a classification head on it. For any supervised model like this, we need labelled data. In general the more (high quality) labelled data, the better model performance. Labeling data is time-consuming and expensive; sentence clustering can help make this faster and more efficient.
Faster data labeling
One of the most time-consuming but important processes of any machine learning workflow is data labeling. Raw unlabelled data is widely available but high-quality data, annotated in an exact way for the specific task at hand, is scarce. Many ML teams either label data in-house or outsource it to an external labeling organization that distributes this to a pool of annotators. A naive approach is to annotate the data one by one, without any particular order, in a data labeling interface.
A smarter approach when dealing with sentence data is to first cluster the sentences and then show them in order, so that semantically similar sentences will be seen after each other. The assumption here is that semantically similar sentences (i.e. paraphrases) often have similar labels. Even if that assumption does not always hold, it will help a human annotator to label faster as there are fewer mental context switches.
An even faster approach to label in these cases would be a labeling UI that allows annotating multiple samples at the same time instead of one by one. Again, the clustering does not have to be perfect as a human annotator will still inspect the samples manually.
A potentially even faster way of labeling can be achieved by using (weak) labeling functions. This technique of creating heuristic rules to automatically label samples was popularized by the Snorkel framework. With this framework, annotators do not have to label samples individually but can leverage multiple rules of thumb, which can be noisy. For example, if the sentence contains the word “egg”, always assign it to the category “food”. A probabilistic Snorkel model will then be trained to solve conflicts and noise within these heuristic rules and can then be used to automatically annotate a large amount of unlabelled data.
Sentence clustering can be useful here too as it helps annotators find potential patterns that can be transformed in heuristics rules (i.e. weak labeling functions).
Reduction in the number of labels needed
The methods described above are ways to get a full unlabelled dataset annotated faster. However, maybe we do not need to label all samples from this dataset to get good model performance. Sentence clustering can be used as a way of active learning. It can help prioritize the data that should be labelled to have the highest impact for model training.
A naive but effective approach is to limit the number of sentences per cluster that should be annotated initially. For example, by picking the 5 most diverse sentences within each cluster. Especially for models that are based on large pre-trained language models, there is no need for hundreds of labels for semantically similar sentences to generalize well. The added value of labeling these paraphrases becomes marginal and can be easily avoided by using sentence clustering, saving a substantial amount of time and money. We also observed training times can be faster without losing performance, as there is less duplicate labelled data.
Building machine learning models is not a linear but cyclical process. A model can always be improved and needs to be re-trained when new (labelled) data comes in. Sentence clustering can help in a manual error analysis of the model assessing whether model mistakes are structural errors (i.e. recurring for similar patterns) or not. By manually checking the performance of the model on evaluation samples and grouping them by semantic similarity, we can potentially find “hard” clusters. This can then inform us about new model improvement ideas such as different model architectures or pre- or post-processing steps.
Data and model drift monitoring
An important issue with machine learning models that are used in production is data and model drift. This occurs when the distribution of the production data shifts relatively to the training data distribution. This could happen for many reasons, such as new types of customers, change in user behavior, or change in pre-processing pipelines of incoming data. This can be harmful as the models were not trained on this new type of data and might not be able to generalize well. It can often go unnoticed if not monitored correctly as ML teams often rely on fixed test sets which are not updated frequently to account for this distribution shift. Sentence clustering can be helpful here too. By checking whether sentence clusters in the production data align with the clusters in the training and test sets we have some proxy for detecting distribution shift.
The last application where sentence clustering proved to be useful for us is data cleaning. As machine learning engineers we often consider our labelled data as the ground truth or gold standard labels. However, in many cases, this ground truth data contains many mistakes and inconsistencies that could confuse our model during training. This is only natural as humans, like machine learning models, make mistakes. Often, large datasets are labelled at multiple time periods by distinct annotators, leading to inconsistencies.
We can employ noise-robust training techniques that try to reduce the impact of noise in a labelled dataset, but sometimes it is better to inspect our labelled dataset manually and remove or correct noisy labels. By grouping our labelled dataset by sentence similarity, we can quickly inspect thousands of samples and look for inconsistencies.
Now that we have considered some useful applications of sentence clustering, let’s see how this can be achieved.
To cluster our sentences we need to define a similarity function between two sentences. For this, it is useful to first get a numerical representation for our sentences. There are many text embedding techniques out there, but for our use case, the Sentence-BERT (SBERT) model seemed to perform the best.
SBERT is a modification of the pre-trained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity
SBERT is trained in such a way that it maps sentences to a vector space such that semantically similar sentences lie close together in this space. We can then leverage simple similarity metrics such as cosine-similarity to assess whether two sentences are semantically similar.
The SBERT authors provide an easy to use sentence_transformers library that can be used out of the box to embed sentences. It provides a large amount of pre-trained models with different model sizes, architectures and language support. We found the standard recommended all-MiniLM-L6-v2 to give good results, while being computationally fast.
Now that we have sentence embeddings, we can go on and try to find clusters of these embeddings. Often, we do not know the exact number of sentence clusters we have in our dataset, so the K-means algorithm might not be optimal to use here. The sentence_transformers library provides an example of a community detection algorithm that quickly finds local communities of similar sentences in a large list of sentence embeddings. It has two hyper-parameters that can be used to control the tightness and minimum size of the clusters: the threshold t and minimum community size k.
The algorithm works as follows; it starts by comparing the similarity of all sentences with all other sentences by calculating the pairwise cosine similarities of the sentence embeddings. Then, for each sentence, it selects all other sentences that have a cosine similarity score bigger than the predefined threshold t. It then sorts all sentences by cluster size (i.e. the number of other sentences within the threshold) and keeps the biggest clusters that do not overlap with an existing cluster and have at least k sentences in it. The first sentence in each cluster is the most central sentence of that cluster. All other sentences have a cosine similarity of at least t with this sentence.
Clustering millions of sentences
The out-of-the box clustering algorithm described above worked well initially for our use case when we were dealing with a small amount of data (less than 50k samples). Once we tried to cluster more data, we kept on hitting the maximum amount of RAM available on our instances. To cluster 100k samples we needed to use the ml.m5.16xlarge AWS instance that was barely able to run this with 256 GiB of RAM.
Clearly, the original community detection algorithm provided by the sentence_transformers library did not scale well. As this algorithm calculates the full pairwise similarity matrix, it scales quadratically O(n²) with n being the number of samples. We wanted to cluster millions of sentences, so we adjusted the original algorithm to be able to scale up to millions of sentences. We came up with an iterative and agglomerative approach that each iteration follows the steps outlined below:
- Split up the full dataset in random batches of batch size 2500 and apply the original algorithm on each batch.
- Find and merge overlapping clusters by clustering the most central sentence of each obtained cluster (cluster head)
- For each merged cluster, filter out all sentences that are not within the threshold of the cluster head anymore
- For all unclustered sentences, do a nearest neighbor search with all the cluster heads and assign to the nearest cluster if it is within the threshold
This iterative and agglomerative approach will not result in the globally optimal clustering assignments. However, in practice, it seems to work well and is relatively fast as we can parallelize all the steps over multiple cores. We provide a notebook here with the exact implementation of this scalable sentence clustering algorithm, and an example of clustering 1 million Bing queries from the MS Marco dataset.
Sentence clustering is not a holy grail, but is a very useful tool to have in your arsenal as a machine learning engineer. Now you can also easily cluster millions of sentences and optimize your ML workflow.
Btw, we’re hiring!