0% found this document useful (0 votes)
56 views

Guide To Labeling Automation - Labelbox

The document discusses three types of data labeling automation: 1. Data unaware automation uses algorithms without understanding of the target domain or labels, which can help speed up some repetitive tasks but performance does not improve over time. 2. Data semi-aware automation uses generic datasets to pre-label data, providing initial performance gains but requiring label adjustments and stagnating after a few iterations. 3. Data fully aware automation trains on the specific dataset, is the most effective at improving model performance over time with less human input, and reduces labeling time and costs the most compared to the other methods.

Uploaded by

Zi Wei
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views

Guide To Labeling Automation - Labelbox

The document discusses three types of data labeling automation: 1. Data unaware automation uses algorithms without understanding of the target domain or labels, which can help speed up some repetitive tasks but performance does not improve over time. 2. Data semi-aware automation uses generic datasets to pre-label data, providing initial performance gains but requiring label adjustments and stagnating after a few iterations. 3. Data fully aware automation trains on the specific dataset, is the most effective at improving model performance over time with less human input, and reduces labeling time and costs the most compared to the other methods.

Uploaded by

Zi Wei
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

The defnitiie guide to

data labeling automation


00 Content
01 The case for data labeling automation ---------------------------------------------1
02 Three types of labeling automation ------------------------------------------------ 2
Data unaware ------------------------------------------------------------------- 2
Data semi-aware --------------------------------------------------------------- 3
Data fully aware ----------------------------------------------------------------4
03 Pre-labeling: A labeling strategy proven to reduce time and effort ------------- 5
Pre-labeling for ML projects across maturity levels --------------------------6
Real-world examples ----------------------------------------------------------  6
04 Automating beyond the labeling process ------------------------------------------ 8

Queueing data for labeling -----------------------------------------------------8
Creating custom workflows ----------------------------------------------------8
Seamless data transfers: SDKs & APIs --------------------------------------- 9
05 Why a data engine is essential for labeling automation -------------------------10
Automated labeling services: Risk factors to consider ---------------------11
06 Labeling automation with Labelbox -----------------------------------------------12
01 The case for data labeling
automation
Developing and maintaining a performant service using supervised machine learning
requires a vast amount of high-quality training data. Labeling this data is often the most
time-consuming task in the machine learning process.

Research has shown that models perform better when they are trained on more data
on a day-by-day, iteration-by-iteration basis. Training a model can take a long time,
which means that iterations can then take weeks to complete, just because of the time
it takes to prepare and label all the necessary training data. A model’s performance
must increase continuously as it learns new information, so faster iterations are key to a
highly performant model.

Machine learning projects are more likely to succeed when they iterate quickly, as this
allows teams to better identify and correct for any biases in datasets and add new
datasets as use cases expand and when changes in the real world may affect previous
distributions.

Labeling training data also requires a lot of human effort and expertise, so this part
of the process is typically also the most expensive. As the chart below illustrates,
costs increase linearly with the number of annotations, but model performance sees
diminishing returns as the number of labels increases. By way of example, if you can
get your model to 80% accuracy with five thousand labeled images, you might need an
additional ten thousand images to get the model to 90% accuracy.

Cost increase linearly with annotations Performance sees diminishing returns


Model Performance
Labeling cost

# of human annotations # of eAam:les

Labeling vast quantities of data quickly, efficiently, and accurately is an immense challenge, but advanced
machine learning teams have found a way to cut both labeling time and costs with an innovative solution:
labeling automation.

1
Ideally, every iteration should improve your model as much as possible, but traditional
workflows often have teams labeling a lot of data for small gains in model performance.
Advanced AI teams are finding ways to reduce the amount of training data required with
techniques like active learning and building data engines that can produce high quality
training data quickly. Automating parts of the labeling process is a key component of an
efficient data engine.

In this guide, you will learn about three proven categories of automation and how
each of them can improve — and sometimes hinder — the labeling process. We’ll also
cover pre-labeling, a promising automation strategy that’s already benefiting advanced
machine learning teams across industries.

02 The 3 types of labeling


automation
There are three general categories of labeling automation based on how much the
method knows about the data you need to label. These three categories represent the
most promising commercially proven methods to automate data labeling.

1. Data unaware methods have no model of reality informing its understanding


of the target domain; they’re usually just tools that help create labels and
annotations
2. Data semi-aware methods take into consideration a basic understanding of the
task at hand through generic, public datasets
3. Data fully aware methods have an application-specific model of reality
informed by ground truths — that is, it knows exactly what your model needs to
be trained on

There are myriad underlying algorithms that can be used for each method in each of
these categories. In many cases, multiple algorithms are used together to increase the
effectiveness of the method.

When we state that a method has an “awareness” of data, it refers to how much the
underlying algorithm(s) “know” about the data and patterns of interest that the human
labeler perceives and annotates on the data. In other words, the extent to which the
method understands the end state of the labeled data, i.e. the ground truth label.

Data unaware
Data unaware methods typically automate the process of drawing segmentation sub-
sections between contrasted color boundaries. The algorithms used for this method,
such as graph cut, are unaware about what the data actually is, or what the labeler
wants to label. It is entirely concerned with the numeric values of contiguous pixels.
This type of tool can make the process go a little faster for your labeling team, though
ML teams should be wary of spending more time correcting automatically generated
segments than actually drawing them from scratch.

2
Often, data unaware methods don’t account for several important factors, including
the skill of the labeler, the optimal segment type, and how well the labels translate
to production data. For ML teams and labelers confident in their skills and ontology,
however, a data unaware solution can help accelerate some of the repetitive tasks in the
labeling process. Though data unaware automation techniques do not always increase
efficiency over time, they can still be a great starting point for teams of any maturity
level looking to accelerate their labeling process.

Here are some common implementations of the data unaware method:

• Auto-segmentation
• Superpixel
• Extreme clicking
• Watershed

The performance of the data unaware method varies depending on the nature of
the data and how closely the labels align to contrasted regions. For a given data set,
the performance will also vary on each data point — and this variance is completely
independent of the sequence that the data is labeled. So the average performance of
the data unaware method is static on a given dataset, which means that as more data is
labeled, the performance of this method also remains static.

Data semi-aware
Data semi-aware algorithms are loosely structured for your use case. These tools
typically use a generic or public set of labeled data (like COCO) similar to the data
required to train your model to automate labeling for a new dataset. The labeling team
can then correct or adjust the generic labels instead of doing all the labeling on their
own — which saves them hours of time.

Model performance is usually better when machine learning teams use a data semi-
aware algorithm as opposed to a data unaware one, but each image in the training
dataset still needs to be adjusted for your model’s exact needs, and this process will
still take more labeling time as the model iterates, since it will require exponentially
more training data with every iteration.

In practice, the data semi-aware method can provide immediate performance for some
data, and enables ML teams to leap over some of the initial labeling work required
when beginning a new project. With this approach, the automation performance is
usually static over the data or varies (sometimes unfavorably) after a few iterations and
the labeling task moves away from common object labeling and towards the machine
learning system’s evolving needs. The method can be useful to get your model to a
passable level of performance — say, to 80% accuracy — but improving it beyond that
might require a more data-aware automation strategy.

3
Data fully aware
We’ve established that the more aware of your data your labeling automation method
is, the more successful it will be — reducing time by hundreds of hours, and cutting
significant costs. The data fully aware method uses algorithms that have been
trained on your dataset. Data aware methods are the most adept at improving model
performance with less human input over time.

Typically, an ML model is continuously retrained on the labeled data with each iteration,
so that automation performance goes up over time. With data unaware and data
semi-aware methods, the human time required to label data stays the same iteration
after iteration; with a data fully aware method, human effort reduces as the number of
iterations increases.

More work
Human work required

Data un-aware

Data :e76-aware

Data Gu??C-aware

Less work

# of human annotations

As the number of iterations increase, the data fully-aware method is the only automation method that
consistently reduces human work.

This is particularly useful when labeling requires expensive expertise. For example, to
train a model to detect cancer cells, a pathologist might mark a contiguous group of
cancer cells on an image as a first step, and then a less-skilled group of workers might
outline the malignant cells.

In this case, a predictive model trained on the pathologist-labeled data can learn to pre-
label groups of cells as cancerous, speeding up the work of the pathologist. A second
predictive model trained on the data in which groups of cells have been outlined, can
pre-draw segmentation boundaries around those cells, speeding up the second step as
well.

Here’s another example of the data fully aware method: a computer vision model built
to diagnose lung cancer might need to be trained on lung scans labeled by experienced
radiologists. The labeling task is expensive and time consuming for the first dataset, but
over time, the model itself can be used to pre-label the scans. Once the pre-labeling
is accurate enough, the task can be turned over to less skilled – and less expensive –
workers to verify the automated labels, greatly reducing time and expenses.

Both of these examples show the benefits of pre-labeling, where models are used to
train models. Models used for pre-labeling can be previous versions of the model you
are currently training or any model trained to perform a similar task. As long as the pre-
labels they generate reduce the amount of time a labeler has to spend on an asset, they
can help accelerate the labeling process.

4
03 Pre-labeling: a labeling
automation strategy proven
to reduce time and effort
The benefits of pre-labeling are clear and significant.

1. It’s faster. With each iteration, the human work required for labeling decreases
even as the amount of training data required increases
2. It’s less expensive. In use cases that require specialized experts to label data,
teams can cut costs by 50% - 70%
3. It delivers better model performance. Pre-labeling is the only automation
method that speeds up the labeling process over time, enabling faster
iterations and leading to a more accurate, performant model

Pre-labeling is the most effective labeling automation method. We know that labeling
effort scales linearly with the number of annotations, and that the amount of labeled
data required increases exponentially with each iteration of the model — presenting a
clear incentive for ML teams to increase labeling efficiency. But no automation method
is going to know your specific use case better than the model you’re building.

Your model, trained on your data will train faster and become much more performant
than any off-the-shelf algorithm trained on a generic dataset.

Model-assisted labeling Iterate with your model to


lowers labeling costs improve performance faster
Cumulative labeling cost

Model Performance

# of human annotations Labeling cost

Unassisted labeling Editor-assisted labeling

pre-labeling with model v0 pre-labeling with model v˜ pre-labeling with model v£

Pre-labeling increases model performance, reduces the number of human annotations required, and
reduces costs with each iteration, unlike less aware automation methods.

Pre-labeling for ML projects across


maturity levels
While pre-labeling might seem at the onset to be a viable solution only for advanced ML
teams who already have models in production, it can actually be used for any project.
Even programmatically generated labels — in which a software program makes an initial

5
attempt at the labeling task — can shave off seconds or minutes from labeling time per
data asset, translating to hours of time savings on labeling a whole dataset. ML teams
can also use an off-the-shelf model created for a similar or less complex version of your
own use case as a starting point for generating pre-labels. Once your team has an initial
model, you can switch out pre-labels created via the generic model or program for
labels generated by the model in training.

Real-world examples
Pre-labeling with heuristics: crops vs. weeds
A large agtech enterprise is training a machine learning model to differentiate between
types of crops and weeds based on images of fields. These images usually also show
dirt and rocks. At first, to produce segmentation masks, they had a labeling workforce
do a first pass to label the crops and weeds in the images. This was followed by a
second pass by a team of botanists, who reviewed and corrected the labels.

This soon became an extremely time consuming process as the model required more
and more labeled data, so they adopted two pre-labeling solutions to speed up the
process, using Labelbox’s model-assisted labeling workflow.

First, they imported a mask generated by their NDVI sensor, which picks up organic
plant matter indiscriminately. Then they had their workforce correct and further
differentiate between weeds and crops, as they now knew where to focus their efforts.

Second, they trained a model using this labeled data and imported these model
predictions into Labelbox. Their labelers and botanists were then able to complete final
reviews and fine edits, allowing their model to improve even more.

NDVI
SENSOR

MODEL HUMAN
RELEASE LABELING

HUMAN
REVIEW

MODEL*'& MODEL*'.

This agtech company used pre-labeling, via the model-assisted labeling workflow in Labelbox, to speed up
the labeling process and cut their costs in half.

6
Today, they continue to import these model predictions back into Labelbox, so that the
labeling process goes faster with every iteration as their model becomes more accurate.
By using model-assisted labeling, this agtech company was able to cut their labeling
costs in half.

Pre-labeling with related models: ships in the sea


An ML team was training a model to detect moving ships in satellite images of large
swaths of the ocean. At the beginning, their labeling team was presented with huge
images, where ships appeared as tiny, hard-to-spot dots. The task of labeling them was
tedious and difficult.

They soon realized, however, that moving ships leave wakes, which form large white
patches on a sea of blue, making them easier to spot than the ships themselves. Using
a combination of heuristics (like finding a white patch on blue) and human labeling, the
ML team was able to produce weak labels to find wakes on the images, which they used
to train a wake detection model.

With this model labeling wakes on the satellite images, the labelers only had to confirm
that there was a ship at the end of the wake. This strategy helped the team drastically
cut down labeling time and improve accuracy.

The team then used this data to train their ship detection model, which they used in
turn to train data for the next iterations, speeding up the labeling process even more.

WAKE DETECTION MODEL

HUMAN &
HEURISTI
LABELING


MODEL HUMA
RELEASE LABELING


HUMA
REVIEW

SHIP CLASSI5ICATION MODEL SHIP DETECTION MODELING

In the example above, this ML team first trained their ship-finding model with another, similar model — and
then used their own model to generate pre-labeled data for further iterations.

7
04 Automating beyond the labeling
process
While automating your labeling process, especially with pre-labeling, can result in
significant time and cost savings for your AI team, it’s not the only kind of automation
that you can leverage as you label data. There are several steps in the process that
often require manual intervention:

• Getting unstructured data in front of labelers


• Incorporating quality management methods such as consensus, benchmarking,
and SME reviews
• Pulling data from storage for labeling and moving labeled data into your model
training backend environment

All these processes can be optimized and accelerated with automation that eliminates
the necessity of human interference. By ensuring that people only interact with data
when they absolutely must, and that these interactions are easy and efficient, AI teams
can realize even more in time and cost savings.

Queueing data for labeling


Many AI teams rely on filesharing practices that quickly become unwieldy as their data
labeling efforts scale. While sending data to your labeling team via a flash drive and a
spreadsheet for tracking can work for small datasets, the extra manual effort required
will be exorbitant for larger AI projects. Even with small datasets, however, such a
workflow will often have your labelers idling as they wait for their next dataset to label.

An automated queueing system can assign work to labelers in real time without
requiring any manual filesharing and ensure that labelers will be assigned a new
labeling task as soon as they finish one, eliminating extra time spent waiting for
assignments.

Creating custom workflows


The queueing system can further be optimized with customizable workflows for the
specific quality assurance practices required for each AI project. Teams relying on
consensus, for example, can have one datarow assigned to multiple labelers. Those
who require reviews for subject matter experts can incorporate these reviewers into
the queueing system as well. Different slices of data — for example, all datarows from
a particular source — may require special treatment for quality assurance. A queueing
system that enables custom workflows for these data slices can also help AI teams
produce high-quality training data without the extra delays that such requirements
would typically cause.

8
LABELING TASK REWORK TASK

REVIEW TASK 

REVIEW TASK 2

DONE

Custom workflows can help teams produce high-quality training data, fast.

Custom workflows provides a systematic way to understand the status of datarows


and can help teams easily find which datarows are ready to be used for model training,
eliminating the need for manually tracking this information with spreadsheets or
other ad-hoc tools. Teams can also create multiple groups — for example, your core
labeling team, or subject matter experts — and present data to them at specific stages
in the workflow by configuring custom review steps. Datarows that are rejected during
review tasks can automatically be sent into a rework queue without any further manual
intervention, so teams will only need to manually revisit a review task if there’s a
specific issue that arises.

Seamless data transfers: SDKs & APIs


Manually transferring data — uploading it into your labeling tool, distributing it to
your labelers, downloading it from the labeling tool and into your model training
backend — can be slow. Worse, it’s a task that no data scientist or ML engineer wants
to be stuck with. While a data queueing system takes care of distributing data to your
labelers, using SDKs and APIs to programmatically move data seamlessly through
your AI infrastructure is integral to accelerating your data labeling and model training
processes.

9
05 Why a data engine is essential for
automating your AI processes
While the benefits of automating parts of the labeling process are clear, it does present
its own challenges. A machine learning team that chooses this method must find a way
to connect their ML systems securely with the labeling tools and workflows, so that
labeled data can be easily fed to the model and the model’s output data can be brought
back to the labelers.

Pre-label Send to Complete Train Get model


assets labeling team labels model results

MODEL-ASSISTED LABELS

The model-assisted labeling workflow pulls model output back into the labeling process for review and
correction.

This is why labeling automation, including pre-labeling, is best achieved through the use
of a data engine, the foundational infrastructure for how team members interface with
data and models in order to build better AI products. A data engine effectively creates
a loop between the labeling process and machine learning systems, which results in a
quick and efficient iteration cycle — a key component to improving model performance
and gaining competitive advantage through powerful AI solutions.

A data engine enables and accelerates three vital functions: data prioritization, labeling
and annotation, and model evaluation and training.

Data prioritization. A data engine empowers AI teams with a way to visualize all
their data, both labeled and raw. Visualizing your data can help your AI team innovate
new use cases, identify patterns and gaps, and find the data that can best improve
your model at its current performance level. Pairing a carefully curated dataset with
automation techniques like pre-labeling and auto-segmentation can save significant
amounts of time — not only for your labelers, but for your AI training process as a whole.

Labeling and annotation. A data engine provides seamless queueing with multiple
customizable workflows, intuitive labeling editors with built-in collaboration tools for
any modality, and live metrics for quality management and labeler performance. These
features not only ensure a smoother workflow for the entire labeling team — they also
make it possible for teams to quickly generate high-quality training data for all AI
projects and use cases.

Model evaluation and training. A key part of training an ML model is to understand its
strengths and weaknesses after every iteration. A data engine empowers AI teams with
powerful model diagnosis tools to visualize and pinpoint areas of low confidence so you
can find and fix labeling errors.

10
A data engine helps machine learning teams manage their training data workflows in
one place. For the purpose of labeling automation, this essential solution provides two
important benefits:

• Fast and easy setup for a pre-labeling workflow. Creating this workflow can take
hours or even days of effort without the right technology. A data engine can have
your pre-labeling workflow set up seamlessly within a matter of minutes.
• Smart labeling tools like automated segmentation. Whenever your labelers need
to create new segments or correct pre-labeled ones, these smart tools can help
them do it quickly and easily.

No matter how you train your model — whether it’s with traditional supervision, active
learning, weak supervision via pre-labels, or another method — using a data engine will
help your data labeling and model training processes move faster and more efficiently.

Model training methods

Traditional supervision Subject matter experts hand-label


training data. This method is expensive
and time consuming — but usually
results in higher quality training data
and a more performant model.

Active learning Train the model only on the most


informative data — usually, the
examples that the model is having the
most difficulty with.

Transfer learning A model developed for a task is reused


as the starting point for a model on a
second related task or domain.

Weak supervision This is a way to use lower quality


labels. ML teams could:
1. Have non-experts create labels,
which are then checked over by experts
2. Use heuristics to speed up the initial
labeling
3. Use another model (off-the-shelf or
a previous version of the model in
training) to pre-label data, which may
then be corrected by labelers/experts.

Automated labeling services: risk factors


to consider
It is possible to do pre-labeling without using a data engine. Many vendors now offer
automated labeling services built around data semi-aware automation methods that
quickly become fully aware as their model learns more about your labeling needs. While
these services often seem like efficient solutions, they present risks that ML teams
should be vigilant about.

11
When a service automates labeling for your model, your model will only ever become
as performant as theirs — meaning that it’s limited by the level of service they provide.
By using disparate tools throughout the training and iteration process, your team might
also be unintentionally making it more difficult than necessary to place quality control
measures such as benchmarking and reviewing. They may need to manually deliver
the labeled data from the automated service to the reviewers, or to a separate labeling
team for benchmarking, and then again transfer it to the model itself.

Your data and model are important pieces of IP. An automated modeling service will
have access to this data, and if the vendor is training their model on your dataset, they
will effectively be creating another instance of your model that you have no control or
visibility over. While speeding up the training process and reducing costs are important
factors to consider for ML teams, it’s certainly more important to secure your work
against potential competitors.

Using a data engine to implement model-assisted automation will ensure that your
model’s performance will never depend on the tools or services you use. It will also
give your team full visibility and control over every aspect of the training process, from
the dataset to the labeling team to the new pre-labeled data generated by your model.
Neither the data engine nor the labeling team will ever get a copy of your model.

06 Labeling automation with


Labelbox
Labelbox makes it simple and fast to integrate labeling automation with your existing
training data processes. Our state-of-the-art labeling UI includes editor-assisted tools
that can jump start labeling, saving significant labeling time and costs.

Detecting...

The auto-segmentation tool allows labelers to create segments in seconds. Examining and correcting these
automatically created segments instead of drawing them from scratch saves valuable labeler time.

12
The Labelbox pen tool makes image segmentation quick and easy.
Time to label

without automation
with automation

Using the auto-segmentation and pen tools can save minutes of labeling time, which can translate to hours
or days when applied to large labeling projects.

It’s also easy to implement a pre-labeling workflow with the Labelbox. This data engine
offers a flexible model-assisted labeling solution that integrates this process into your
labeling operation in three simple steps.

1. Bulk upload pre-labeled data into Labelbox using our Python SDK with just a few
lines of code.
2. Your labeling team will then have access to the new dataset and can adjust or
correct the annotations.
3. Loop the training dataset with edited or approved labels back into training your
model. Then repeat.

With every iteration of your model, your labelers will have less work to do as the pre-
labels become more accurate and precise.

13
Many Labelbox customers have also implemented creative uses for model-assisted
labeling. Some AI teams, such as Blue River Technology and VirtuSense, have
developed models specifically for generating pre-labels. The former was able to cut
their labeling time in half. The latter saved 20% in labeling time and were empowered
to significantly scale up their labeling operations while maintaining only a small internal
team of labelers.

You can find more detailed instructions on the process in the Labelbox documentation.

Alongside enabling model-assisted labeling, Labelbox also provides a suite of tools and
services for all your training data needs.

Annotate Model
Label data across all data modalities Improve your data and your models

AI assisted labeling Quality control Data versioning Active learning & error analysis

Internal teams & external vendors Training integration Data management

Catalog
Data, metadata, and model predictions

Search Explore Visualize Analyze

Data sources

The Labelbox data engine delivers all the capabilities you need to manage your collaborators and
workflows, annotate your datasets, and iterate on your model.

Try Labelbox today


Get started for free or see how Labelbox can
fit your specific needs by requesting a demo.

Try Labelbox for free

14

You might also like