August 31, 2022

Use insights to reduce your label noise

In machine learning, the quality of the data used to train your model directly affects its performance.

To get high-performance models, you need to ensure that the training data is relevant, consistent and free of errors. The keys to achieving high data quality are to be aware of the common quality errors, implement processes to surface those errors and iterate to fix them.

Common data labeling quality errors

What may go wrong during data labeling differs from project to project. However, knowing these pitfalls and making a conscious effort to avoid them can help improve data quality so that your data is usable.

Here are common errors we’ve found that contribute to label noise:

Misinterpretation of instruction‍

Instructions not written with clear and direct language may cause annotators to misinterpret what and how to annotate your data. Creating crystal clear instructions typically requires multiple iterations. We’ve shared our tips on writing good instructions here.

Edge and corner cases‍

Edge and corner cases are examples of domain variability in the real world that happen at the long-tail. Edge cases show up when you least expect it because they are rare, and may not be covered in your instructions to be labeled accurately.

Not a good fit between use case and tool‍

There is no ultimate labeling tool that solves every use case in the world. Sometimes poor label quality can result from a tool being unable to deliver the precision required. It is always good practice to run a test batch when evaluating a new tool or solution provider.

Annotator bias‍

The demographic of annotators may introduce bias in the labeled data. This could be based on their education, background, and experience with the domain. Having a diverse and well-trained workforce can help reduce this bias.

Imbalanced dataset‍

Post-labeling, you might find that your dataset does not have the correct distribution of classes, which can reduce your model performance.

Surfacing label noise with quality insights

Now, even with an understanding of the common errors, your data might still face quality issues without a systematic approach to surface those errors. Below are a few methods used in the industry to bring to light label noise:

Consensus: Involves sending each labeling task to multiple annotators and then consolidating their annotations into a single label, uncovering annotator confusion in the process.

Live QA: Review completed tasks immediately at the beginning of the project to quickly catch errors, provide feedback to annotators, and avoid repeat errors.

Random sampling QA: Assess your overall data quality by randomly selecting from your dataset and analyze the error categories.

Efficient data filtering: Speed up your reviewing process by using label and metadata filters to isolate problematic areas.

Annotator Feedback: Review feedback and questions from annotators to clarify any ambiguity in the project’s instructions.

Outlier detection: Review tasks flagged by the system which have distinct features compared to the rest of the annotations with the same label.

Data analytics: Get a better understanding of your label distribution to prevent minority class identification errors caused by an imbalanced dataset.

Iterating to fix the errors

With the insights gathered from conducting the quality assessments above, there are multiple actions you can take to improve the quality of your labeled data in subsequent iterations.

Misinterpretation of instructions
Update instructions based on where annotators made mistakes, and include image examples for greater clarity.
Edge and corner cases
Provide feedback to annotators on edge cases and include these cases into the labeling instructions.

Mislabeling
Fix confusing taxonomy by clarifying label differences, breaking down a label into sub labels, or changing the wording.

Imbalanced dataset
Add or augment the dataset input to ensure a representative dataset.

‍Summary

With multiple factors affecting data labeling quality, it can be a manual and tedious chore to thoroughly review your labeled data (especially if it’s a large dataset!) to discover actionable insights for quality improvement.

We highlighted a few common data labeling errors and methods for surfacing label noise so that you can iterate your labeling instructions for better quality.

Improving data labeling quality requires feedback and multiple iterations. This is why we built SUPABOLT – a data labeling platform that makes this iterative process easier and faster for you with our Quality Insights tools.

Start a test project for free today and discover new ways of improving your labeled data quality.