Data Collection For ML Papaer
Data Collection For ML Papaer
Yuji Roh, Geon Heo, Steven Euijong Whang, Senior Member, IEEE
Abstract—Data collection is a major bottleneck in machine learning and an active research topic in multiple communities. There are
largely two reasons data collection has recently become a critical issue. First, as machine learning is becoming more widely-used, we
are seeing new applications that do not necessarily have enough labeled data. Second, unlike traditional machine learning, deep
learning techniques automatically generate features, which saves feature engineering costs, but in return may require larger amounts
arXiv:1811.03402v2 [cs.LG] 12 Aug 2019
of labeled data. Interestingly, recent research in data collection comes not only from the machine learning, natural language, and
computer vision communities, but also from the data management community due to the importance of handling large amounts of data.
In this survey, we perform a comprehensive study of data collection from a data management point of view. Data collection largely
consists of data acquisition, data labeling, and improvement of existing data or models. We provide a research landscape of these
operations, provide guidelines on which technique to use when, and identify interesting research challenges. The integration of
machine learning and data management for data collection is part of a larger trend of Big data and Artificial Intelligence (AI) integration
and opens many opportunities for new research.
1 I NTRODUCTION
Fig. 1: A high level research landscape of data collection for machine learning. The topics that are at least partially
contributed by the data management community are highlighted using blue italic text. Hence, to fully understand the
research landscape, one needs to look at the literature from the viewpoints of both the machine learning and data
management communities.
the research landscape where the topics that have contribu- either the best-performing or most recent ones. The key
tions from the data management community are highlighted audience of this survey can be researchers or practitioners
with blue italic text. Traditionally, labeling data has been a that are starting to use data collection for machine learning
natural focus of research for machine learning tasks. For and need an overall landscape introduction. Since the data
example, semi-supervised learning is a classical problem collection techniques come from different disciplines, some
where model training is done on a small amount of labeled may involve relational data while others non-relational data
data and a larger amount of unlabeled data. However, as (e.g., images and text). Sometimes the boundary between
machine learning needs to be performed on large amounts operations (e.g., data acquisition and data labeling) is not
of training data, data management issues including how clear cut. In those cases, we will clarify that the techniques
to acquire large datasets, how to perform data labeling at are relevant in multiple operations.
scale, and how to improve the quality of large amounts
Motivating Example To motivate the need to explore the
of existing data become more relevant. Hence, to fully
techniques in Figure 1, we present a running example on
understand the research landscape of data collection, one
data collection based on our experience with collaborating
needs to understand the literature from both the machine
with the industry on a smart factory application. Suppose
learning and data management communities.
that Sally is a data scientist who works on product quality
While there are many surveys on data collection that for a smart factory. The factory may produce manufacturing
are either limited to one discipline or a class of techniques, components like gears where it is important for them not to
to our knowledge, this survey is the first to bridge the have scratches, dents, or any foreign substance. Sally may
machine learning (including natural language processing want to train a model on images of the components, which
and computer vision) and data management disciplines. can be used to automatically classify whether each product
We contend that a machine learning user needs to know has defects or not. This application scenario is depicted
the techniques on all sides to make informed decisions on in Figure 3. A general decision flow chart of the data
which techniques to use when. In fact, data management collection techniques that Sally can use is shown in Figure 2.
plays a role in almost all aspects of machine learning [4], [5]. Although the chart may look complicated at first glance, we
We note that many sub-topics including semi-supervised contend that it is necessary to understand the entire research
learning, active learning, and transfer learning are large landscape to make informed decisions for data collection.
enough to have their own surveys. The goal of this sur- In comparison, recent commercial tools [6]–[8] only cover a
vey is not to go into all the depths of these sub-topics, subset of all the possible data collection techniques. When
but to focus on breadth and identify what data collection using the chart, one can quickly narrow down the options
techniques are relevant for machine learning purposes and in two steps by deciding whether to perform one of data
what research challenges exist. Hence, we will only cover acquisition, data labeling, or existing data improvements,
the most representative work of the sub-topics, which are
3
Start
Fig. 2: A decision flow chart for data collection. From the top left, Sally can start by asking whether she has enough data.
The following questions lead to specific techniques that can be used for acquiring data, labeling data, or improving existing
data or models. This flow chart does not cover all the details in this survey. For example, data labeling techniques like self
learning and crowdsourcing can be performed together as described in Section 3.2.1. Also, some questions (e.g., “Enough
labels for self learning?”) are not easy to answer and may require an in-depth understanding of the application and data.
There are also techniques specific to the data type (images and text), which we detail in the body of the paper.
TABLE 1: A classification of data acquisition techniques. Some of the techniques can be used together. For example, data
can be generated while augmenting existing data.
2.1 Data Discovery through Fusion Tables on the Web can be crawled by search
engines and show up in search results. The datasets are
Data discovery can be viewed as two steps. First, the gener-
therefore primarily accessible through Web search. Fusion
ated data must be indexed and published for sharing. Many
Tables has been widely used in data journalism for creat-
collaborative systems are designed to make this process
ing interactive maps of data and adding them in articles.
easy. However, other systems are built without the intention
In addition, there are many data marketplaces including
of sharing datasets. For these systems, a post-hoc approach
CKAN [15], Quandl [16], and DataMarket [17] where users
must be used where metadata is generated after the datasets
can buy and sell datasets or find public datasets.
are created, without the help of the dataset owners. Next,
someone else can search the datasets for their machine Collaborative and Web More recently, we are seeing a merg-
learning tasks. Here the key challenges include how to ing of collaborative and Web-based systems. For example,
scale the searching and how to tell whether a dataset is Kaggle [18] makes it easy to share datasets on the Web
suitable for a given machine learning task. While most of the and even host data science competitions for models trained
data discovery literature came from the data management on the datasets. A Kaggle competition host posts a dataset
community for data science and data analytics, they are also along with a description of the challenge. Participants can
relevant in a machine learning context. However, another then experiment with their techniques and compete with
challenge in machine learning is data labeling, which we each other. After the deadline passes, a prize is given to the
cover in Section 3. winner of the competition. Kaggle currently has thousands
of public datasets and code snippets (called kernels) from
2.1.1 Data Sharing competitions. In comparison to DataHub and Fusion Tables,
the Kaggle datasets are coupled with competitions and are
We study data systems that are designed with dataset
thus more readily usable for machine learning purposes.
sharing in mind. These systems may focus on collaborative
analysis, publishing on the Web, or both.
2.1.2 Data Searching
Collaborative Analysis In an environment where data sci-
entists are collaboratively analyzing different versions of While the previous data systems are platforms for sharing
datasets, DataHub [9]–[11] can be used to host, share, datasets, as a next logical step, we now explore systems that
combine, and analyze them. There are two components: a are mainly designed for searching datasets. This setting is
dataset version control system inspired by Git (a version common within large companies or on the Web.
control system for code) and a hosted platform on top of it,
Data Lake Data searching systems have become more popu-
which provides data search, data cleaning, data integration,
lar with the advent of data lakes [19], [75] in corporate envi-
and data visualization. A common use case of DataHub is
ronments where many datasets are generated internally, but
where individuals or teams run machine learning tasks on
they are not easily discoverable by other teams or individu-
their own versions of a dataset and later merge with other
als within the company. Providing a way to search datasets
versions if necessary.
and analyze them has significant business value because the
Web A different approach of sharing datasets is to publish teams or individuals do not have to make redundant efforts
them on the Web. Google Fusion Tables [12]–[14] is a cloud- to re-generate the datasets for their machine learning tasks.
based service for data management and integration. Fusion Most of the recent data lake systems have come from the
Tables enables users to upload structured data (e.g., spread- industry. In many cases, it is not feasible for all the dataset
sheets) and provides tools for visually analyzing, filtering, owners to publish datasets through one system. Instead, a
and aggregating the data. The datasets that are published post-hoc approach becomes necessary where datasets are
5
processed for searching after they are created, and no effort cally extract the useful ones [32]–[34]. One of the most
is required on the dataset owner’s side. successful systems is WebTables [24], [25], which automat-
As an early solution for data lakes, IBM proposed a ically extracts structured data that is published online in
system [19] that enables datasets to be curated and then the form of HTML tables. For example, WebTables extracts
searched. IBM estimates that 70% of the time spent on an- all Wikipedia infoboxes. Initially, about 14.1 billion HTML
alytic projects is concerned with discovering, cleaning, and tables are collected from the Google search web crawl. Then
integrating datasets that are scattered among many business a classifier is applied to determine which tables can be
applications. Thus, IBM takes the stance of creating, filling, viewed as relational database tables. Each relational table
maintaining, and governing the data lake where these pro- consists of a schema that describes the columns and a set
cesses are collectively called data wrangling. When analyzing of tuples. In comparison to the above data lake systems,
data, users do not perform the analytics directly on the WebTables collects structured data from the Web.
data lake, but extract data sets and store them separately. As Web data tends to be much more diverse than say
Before this step, the users can do a preliminary exploration those in a corporate environment, the table extraction tech-
of datasets, e.g., visualizing them to determine if the dataset niques have been extended in multiple ways as well. One
is useful and does not contain anomalies that need further direction is to extend table extraction beyond identifying
investigation. While supporting data curation in the data HTML tags by extracting relational data in the form of ver-
lake saves users from processing raw data, it does limit the tical tables and lists and leveraging knowledge bases [27],
scalability of how many datasets can be indexed. [28]. Table searching also evolved where, in addition to key-
More recently, scalability has become a pressing issue word searching, row-subset queries, entity-attribute queries,
for handling data lakes that consists of most datasets in a and column search were introduced [29]. Finally, techniques
large company. Google Data Search (G OODS) [20] is a system for enhancing the tables [30], [31] were proposed where
that catalogues the metadata of tens of billions of datasets entities or attribute values are added to make the tables
from various storage systems within Google. G OODS infers more complete.
various metadata including owner information and prove- Recently, a service called Google Dataset Search [26]
nance information (by looking up job logs), analyzes the was launched for searching repositories of datasets on the
contents of the datasets, and collects input from users. At Web. The motivation is that there are thousands of data
the core is a central catalog, which contains the metadata repositories on the Web that contain millions of datasets that
and is indexed for data searching. Due to Google’s scale, are not easy to search. Dataset Search lets dataset providers
there are many technical challenges including scaling to the describe their datasets using various metadata (e.g., author,
number of datasets, supporting a variety of data formats publication date, how the data was collected, and terms
where the costs for extracting metadata may differ, updating for using the data) so that they become more searcheable.
the catalog entries due to the frequent churn of datasets, In comparison to the fully-automatic WebTables, dataset
dealing with uncertainty in metadata discovery, comput- providers may need to do some manual work, but have
ing dataset importance for search ranking, and recovering the opportunity to make their datasets more searcheable.
dataset semantics that are missing. To find datasets, users In comparison to G OODS, Dataset Search targets the Web
can use keywords queries on the G OODS frontend and view instead of a data lake.
profile pages of the datasets that appear in the search results.
In addition, users can track the provenance of a dataset to 2.2 Data Augmentation
see which datasets were used to create the given dataset and
Another approach to acquiring data is to augment exist-
those that rely on it.
ing datasets with external data. In the machine learning
Finally, expressive queries are also important for search- community, adding pre-trained embeddings is a common
ing a data lake. While G OODS scales, one downside is that way to increase the features to train on. In the data man-
it only supports simple keyword queries. This approach is agement community, entity augmentation techniques have
similar to keyword search in databases [76], [77], but the been proposed to further enrich existing entity information.
purpose is to find datasets instead of tuples. The D ATA Data integration is a broad topic and can be considered
C IVILIZER system [21], [22] complements G OODS by focus- as data augmentation if we are extending existing datasets
ing more on the discovery aspect of datasets. Specifically, with newly-acquired ones.
D ATA C IVILIZER consists of a module for building a linkage
graph of data. Assuming that datasets have schema, the 2.2.1 Deriving Latent Semantics
nodes in the linkage graph are columns of tables while
A common data augmentation is to derive latent semantics
edges are relationships like primary key-foreign key (PK-
from data. A popular technique is to generate and use
FK) relationships. A data discovery module then supports
embeddings that represent words, entities, or knowledge. In
a rich set of discovery queries on the linkage graph, which
particular, word embeddings have been successfully used to
can help users more easily discover the relevant datasets.
solve many problems in natural language processing (NLP).
D ATARAMAN [23] specializes in extracting structured data
Word2vec [35] is a seminal work where, given a text corpus,
from semi-structured log datasets in data lakes automati-
a word is represented by a vector of real numbers that
cally by learning patterns. A URUM [78], [79] supports data
captures the linguistic context of the word in the corpus. The
discovery queries on semantically-linked datasets.
word vectors can be generated by training a shallow two-
Web As the Web contains large numbers of structured layer neural network to reconstruct the surrounding words
datasets, there have been significant efforts to automati- in the corpus. There are two possible models for training
6
word vectors: Continuous Bag-of-Words (CBOW) and Skip- 2.3 Data Generation
gram. While CBOW predicts a word based on its surround-
If there are no existing datasets that can be used for training,
ing words, Skip-gram does the opposite and predicts the
then another option is to generate the datasets either manu-
surrounding words based on a given word. As a result, two
ally or automatically. For manual construction, crowdsourc-
words that occur in similar contexts tend to have similar
ing is the standard method where human workers are given
word vectors. A fascinating application of word vectors is
tasks to gather the necessary bits of data that collectively
performing arithmetic operations on the word vectors. For
become the generated dataset. Alternatively, automatic tech-
example, the result of subtracting the word vector of “king”
niques can be used to generate synthetic datasets. Note that
by that of “queen” is similar to the result of subtracting the
data generation can also be viewed as data augmentation if
word vector of “man” by that of “woman”. Since word2vec
there is existing data where some missing parts needs to be
was proposed, there have been many extensions including
filled in.
GloVe [36], which improves word vectors by also taking into
account global corpus statistics, and Doc2Vec [80], which
generates representations of documents. 2.3.1 Crowdsourcing
Another technique for deriving latent semantics is la-
Crowdsourcing is used to solve a wide range of problems,
tent topic modeling. For example, Latent Dirichlet Alloca-
and there are many surveys as well [81]–[84]. One of the
tion [37] (LDA) is a generative model that can be used
earliest and most popular platforms is Amazon Mechanical
to explain why certain parts of the data are similar using
Turk [85] where tasks (called HITs) are assigned to human
unobserved groups.
workers, and workers are compensated for finishing the
2.2.2 Entity Augmentation tasks. Since then, many other crowdsourcing platforms have
been developed, and research on crowdsourcing has flour-
In many cases, datasets are incomplete and need to be filled
ished in the areas of data management, machine learning,
in by gathering more information. The missing information
and human computer interaction. There is a wide range of
can either be values or entire features. An early system
crowdsourcing tasks from simple ones like labeling images
called Octopus [30] composes Web search queries using
up to complex ones like collaboratively writing that involve
keys of the table containing the entities. Then all the Web
multiple steps [86], [87]. Another important usage of crowd-
tables in the resulting Web pages are clustered by schema,
sourcing is data labeling (e.g., the ImageNet project), which
and the tables in the most relevant cluster are joined with
we discuss in Section 3.2.
the entity table. More recently, InfoGather [31] takes a
In this section, we narrow the scope and focus on crowd-
holistic approach using Web tables on the Web. The entity
sourcing techniques that are specific to data generation
augmentation is performed by filling in missing values of
tasks. A recent survey [88] provides an extensive discus-
attributes in some or all of the entities by matching multiple
sion on the challenges for data crowdsourcing. Another
Web tables using schema matching. To help the user decide
survey [89] touches on the theoretical foundations of data
which attributes to fill in, InfoGather identifies synonymous
crowdsourcing. According to both surveys, data generation
attributes in the Web tables.
using crowdsourcing can be divided into two steps: gather-
2.2.3 Data Integration ing data and preprocessing data.
Data integration can also be considered as data augmenta- Gathering data One way to categorize data gathering tech-
tion, especially if we are extending existing data sets with niques is whether the tasks are procedural or declarative.
other acquired ones. Since this discipline is well established, A procedural task is where the task creator defines explicit
we point the readers to some excellent surveys [40], [41]. steps and assigns them to workers. For example, one may
More recently, an interesting line of work relevant to ma- write a computer program that issues tasks to workers.
chine learning [42]–[44] observes that in practice, many TurKit [45] allows users to write scripts that include HITs us-
companies use relational databases where the training data ing a crash-and-return programming model where a script
is divided into smaller tables. However, most machine learn- can be re-executed without re-running costly functions with
ing toolkits assume that a training dataset is a single file side effects. A UTO M AN [46] is a domain-specific language
and ignore the fact that there are typically multiple tables embedded in Scala where crowdsourcing tasks can be in-
in a database due to normalization. The key question is voked like conventional functions. D OG [47] is a high-
whether joining the tables and augmenting the information level programming language that compiles into MapReduce
is useful for model training. The Hamlet system [38] and tasks that can be performed by humans or machines. A
its subsequent Hamlet++ systems [39] address this problem declarative task is when the task creator specifies high-
by determining if key-foreign key (KFK) joins are necessary level data requirements, and the workers provide the data
for improving the model accuracy for various classifiers that satisfy them. For example, a database users may pose
(linear, decision trees, non-linear SVMs, and artificial neural an SQL query like “SELECT title, director, genre, rating
networks) and propose decision rules to predict when it is FROM MOVIES WHERE genre = ’action”’ to gather movie
safe to avoid joins and, as a result, significantly reduce the ratings data for a recommendation system. D ECO [48] uses
total runtime. A surprising result is that joins can often be a simple extension of SQL and defines precise semantics for
avoided without negatively influencing the model’s accu- arbitrary queries on stored data and data collected by the
racy. Intuitively, a foreign key determines the entire record crowd. CrowdDB [49] focuses on the systems aspect of using
of the joining table, so the features brought in by a join do crowdsourcing to answer queries that cannot be answered
not add a lot more information. automatically.
7
Another way to categorize data gathering is whether using tools like scikit learn [91]. In addition, there are more
the data is assumed to be closed-world or open-world. advanced techniques like Generative Adversarial Networks
Under a closed-world assumption, the data is assumed to (GANs) [2], [57], [61], [62] and application-specific gener-
be “known” and entirely collectable by asking the right ation techniques. We first provide a brief introduction of
questions. A SK I T ! [51] uses this assumption and focuses GANs and present synthetic data generation techniques
on the problem of determining which questions should be on relational data. We then introduce recent augmentation
directed to which users, in order to minimize the uncertainty techniques using policies. Finally, we introduce image and
of the collected data. In an open-world assumption, there text data generation techniques due to their importance.
is no longer a guarantee that all the data can be collected.
GANs The key approach of a GAN is to train two contesting
Instead, one must estimate if enough data was collected.
neural networks: a generative network and a discriminative
Statistical tools [52] have been proposed for scanning a
network. The generative network learns to map from a
single table with predicates like “SELECT FLAVORS FROM
latent space to a data distribution, and the discriminative
ICE CREAM.” Initially, many flavors can be collected, but
network discriminates examples from the true distribution
the rate of new flavors will inevitably slow down, and
from the candidates produced by the generative network.
statistical methods are used to estimate the future rate of
The training of a GAN can be formalized as:
new values.
Data gathering is not limited to collecting entire records min max V (D, G)
G D
of a table. CrowdFill [53] is a system for collecting parts
of structured data from the crowd. Instead of posing spe- V (D, G) = E [logD(x)] + E [log(1 − D(G(z))]
x∼pdata (x) z∼pz (z)
cific questions to workers, CrowdFill shows a partially-
filled table. Workers can then fill in the empty cells and where pdata (x) is the distribution of the real data, pz (z)
also upvote or downvote data entered by other workers. is the distribution of the generator, G(z) is the generative
CrowdFill provides a collaborative environment and allows network, and D(x) is the discriminative network.
the specification of constraints on values and mechanisms The objective of the generative network is to increase
for resolving conflicts when workers are filling in values the error rate of the discriminative network. That is, the
of the same record. ALFRED [54] uses the crowd to train generative network attempts to fool the discriminative net-
extractors that can then be used to acquire data. ALFRED work into thinking that its candidates are from the true
asks simple yes/no membership questions on the contents distribution. GANs have been used to generate synthetic
of Web pages to workers and uses the answers to infer the images and videos that look realistic in many applications.
extraction rules. The quality of the rules can be improved by GANs have recently been used to generate synthetic
recruiting multiple workers. relational data. A MED GAN [58] generates synthetic patient
records with high-dimensional discrete variable (binary or
Preprocessing data Once the data is gathered, one may want
count) features based on real patient records. While GANs
to preprocess the data to make it suitable for machine
can only learn to approximate discrete patient records, the
learning purposes. While many possible crowd operations
novelty is to also use an autoencoder to project these records
have been proposed, the ones that are relevant include
into a lower dimensional space and then project them back
data curation, entity resolution, and joining datasets. Data
to the original space. A TABLE -GAN [59] also synthesizes
Tamer [55] is an end-to-end data curation system that
tables that are similar to the real ones, but with a focus
can clean and transform datasets and semantically inte-
on privacy perservation. In particular, a metric for infor-
grate with other datasets. Data Tamer has a crowdsourcing
mation loss is defined, and two parameters are provided to
component (called Data Tamer Exchange), which assigns
adjust the information loss. The higher the loss, the more
tasks to workers. The supported operations are attribute
privacy the synthetic table has. A TGAN [60] focuses on
identification (i.e., determine if two attributes are the same)
simultaneously generating values for a mixture of discrete
and entity resolution (i.e., determine if two entities are the
and continuous features.
same). Corleone [56] is a hands-off crowdsourcing system,
which crowdsources the entire workflow of entity resolution Policies Another recent approach is to use human-defined
to workers. CrowdDB [49] and Qurk [50] are systems for policies [63], [64] to apply transformations to the images as
aggregating, sorting, and joining datasets. long as they remain realistic. This criteria can be enforced
For both gathering and preprocessing data, quality con- by training a reinforcement learning model on a separate
trol is an important challenge as well. The issues include validation set.
designing the right interface to maximize worker produc-
Data-specific We now introduce data-specific techniques for
tivity, managing workers who may have different levels
generation. Synthetic image generation is a heavily-studied
of skills (or may even be spammers), and decomposing
topic in the computer vision community. Given the wide
problems into smaller tasks and aggregating them. Several
range of vision problems, we are not aware of a comprehen-
surveys [82]–[84] cover these issues in detail.
sive survey on synthetic data generation and will only focus
on a few representative problems. In object detection, it is
2.3.2 Synthetic Data Generation possible to learn 3D models of objects and give variations
Generating synthetic data along with labels is increasingly (e.g., rotate a car 90 degrees) to generate another realistic
being used in machine learning due to its low cost and image [65], [66]. If the training data is a rapid sequence of
flexibility [90]. A simple method is to start from a probability images frames in time [67] the objects of a frame can be
distribution and generate a sample from that distribution assumed to move in a linear trajectory between consecutive
8
frames. Text within images is another application where one An alternative approach is to newly generate less
can vary the fonts, sizes, and colors of the text to generate than perfect labels (i.e., weak labels), but in large
large amounts of synthetic text images [68], [69]. quantities to compensate for the lower quality. Re-
An alternative approach to generating image datasets cently, the latter approach is gaining more popularity
is to start from a large set of noisy images and select the as labeled data is scarce in many new applications.
clean ones. Xia et al. [70] searches the Web for images with
Table 2 shows where different labeling approaches fit
noise and then uses a density-based measure to cluster the
into the categories. In addition, each labeling approach can
images and remove outliers. Bai et al. [71] exploits large
be further categorized as follows:
click-through logs, which contains queries of users and the
images that were clicked by those users. A deep neural • Machine learning task: In supervised learning, the
network is used to learn representations of the words and two categories are classification (e.g., determining
images and compute word-word and image-word similar- whether a piece of text has a positive sentiment) and
ities. The noisy images that have low similarities to their regression (e.g., estimating the salary of a person).
categories are then removed. Most of the data labeling research has been focused
Generating synthetic text data has also been studied on classification problems rather than regression
in the natural language processing community. Paraphras- problems, possibly because data labeling is simpler
ing [72] is a classical problem of generating alternative in a classification setting.
expressions that have the same semantic meaning. For ex- • Data type: Depending on the data type (e.g., text,
ample “What does X do for a living?” is a paraphrase of images, and graphs) the data labeling techniques
“What is X ’s job?”. We briefly cover two recent methods differ significantly. For example, fact extraction from
– one is syntax-based and the other semantics-based – that text is very different from object detection on images.
uses paraphrasing to generate large amounts of synthetic
text data. Syntactically controlled paraphrase networks [73]
3.1 Utilizing existing labels
(SCPNs) can be trained to produce paraphrases of a sentence
with different sentence structures. Semantically equivalent A common setting in machine learning is to have a small
adversarial rules for text [74] (SEARs) have been proposed amount of labeled data, which is expensive to produce
for perturbing input text while preserving its semantics. with humans, along with a much larger amount of unla-
SEARs can be used to debug a model by applying them beled data. Semi-supervised learning techniques [143] ex-
on training data and seeing if the re-trained model changes ploit both labeled and unlabeled data to make predictions.
its predictions. In addition, there are many paraphrasing In a transductive learning setting, the entire unlabeled data
techniques that are not covered in this survey. is available while in an inductive learning setting, some
unlabeled data is available, but the predictions must be on
unseen data. Semi-supervised learning is a broad topic, and
3 DATA L ABELING we focus on a smaller branch of research called self-labeled
Once enough data has been acquired, the next step is to label techniques [96] where the goal is to generate more labels by
individual examples. For instance, given an image dataset trusting one’s own predictions. Since the details are in the
of industrial components in a smart factory application, survey, we only provide a summary here. In addition to the
workers can start annotating if there are any defects in general techniques, there are graph-based label propagation
the components. In many cases, data acquisition is done techniques that are specialized for graph data.
along with data labeling. When extracting facts from the
Web and constructing a knowledge base, then each fact is 3.1.1 Classification
assumed to be correct and thus implicitly labeled as true. For semi-supervised learning techniques for classification,
When discussing the data labeling literature, it is easier to the goal is to train a model that returns one of multiple
separate it from data acquisition as the techniques can be possible classes for each example using labeled and unla-
quite different. beled datasets. We consider the best-performing techniques
We believe the following categories provide a reasonable
in a survey that focuses on labeling data [96], which are
view of understanding the data labeling landscape:
summarized in Figure 4. The performance results are similar
• Use existing labels: An early idea of data labeling regardless of using transductive or inductive learning.
is to exploit any labels that already exist. There is The simplest class of semi-supervised learning tech-
an extensive literature on semi-supervised learning niques train one model using one learning algorithm on one
where the idea is to learn from the labels to predict set of features. For example, Self-training [92] initially trains
the rest of the labels. a model on the labeled examples. The model is then applied
• Crowd-based: The next set of techniques are based on to all the unlabeled data where the examples are ranked
crowdsourcing. A simple approach is to label indi- by the confidences in their predictions. The most confident
vidual examples. A more advanced technique is to predictions are then added into the labeled examples. This
use active learning where questions to ask are more process repeats until all the unlabeled examples are labeled.
carefully selected. More recently, many crowdsourc- The next class trains multiple classifiers by sampling
ing techniques have been proposed to help workers the training data several times and training a model for
become more effective in labeling. each sample. For example, Tri-training [93] initially trains
• Weak labels: While it is desirable to generate correct three models on the labeled examples using Bagging for the
labels all the time, this process may be too expensive. ensemble learning algorithm. Then each model is updated
9
TABLE 2: A classification of data labeling techniques. Some of the techniques can be used for the same application. For
example, for classification on graph data, both self-labeled techniques and label propagation can be used.
image is labeled as a dog, then similar images down the Uncertain Examples Uncertainty Sampling [103] is the sim-
graph can also be labeled as dogs with some probability. plest in active learning and chooses the next unlabeled
The further the distance, the lower the probability of label example that the model prediction is most uncertain. For
propagation. Graph-based label propagation has applica- example, if the model is a binary classifier, uncertainty
tions in computer vision, information retrieval, social net- sampling chooses the example whose probability is nearest
works, and natural language processing. Zhu et al. [101] to 0.5. If there are more than three class labels, we could
proposed a semi-supervised learning based on a Gaussian choose the example whose prediction is the least confident.
random field model where the unlabeled and labeled ex- The downside of this approach is that it throws away the
amples form a weighted graph. The mean of the field is information of all the other possible labels. So an improved
characterized in terms of harmonic functions and can be version called margin sampling is to choose the example
efficiently computed using matrix methods or belief prop- whose probability difference between the most and second-
agation. The MAD-Sketch algorithm [102] was proposed to most probable labels is the largest. This method can be fur-
further reduce the space and time complexities of graph- ther generalized using entropy as the uncertainty measure
based SSL algorithms using count-min sketching. In partic- where entropy is an information-theoretic measure for the
ular, the space complexity per node is reduced from O(m) to amount of information to encode a distribution.
O(log m) under certain conditions where m is the number Query-by-Committee [104] extends uncertainty sam-
of distinct labels, and a similar improvement is achieved for pling by training a committee of models on the same labeled
the time complexity. Recently, a family of algorithms called data. Each model can vote when labeling each example,
EXPANDER [100] were proposed to further reduce the space and the most informative example is considered to be
complexity per node to O(1) and compute the MAD-Sketch the one where the most models disagree with each other.
algorithm in a distributed fashion. More formally, this approach minimizes the version space,
which is the space of all possible classifiers that give the
same classification results as (and are thus consistent with)
3.2 Crowd-based techniques the labeled data. The challenge is to train models that
The most accurate way to label examples is to do it manu- represent different regions of the version space and have
ally. A well known use case is the ImageNet image classifi- some amount of disagreement. Various methods have been
cation dataset [146] where tens of millions of images were proposed [105], but there does not seem to be a clear winner.
organized according to a semantic hierarchy by WordNet One general method is called query-by-bagging [106], which
using Amazon Mechanical Turk. However, ImageNet is an uses bagging as an ensemble learning algorithm and trains
ambitious project that took years to complete, which most models on bootstrap samples. There is no general agreement
machine learning users cannot afford for their own applica- on the best number of models to train, which is application-
tions. Traditionally, active learning has been a key technique specific.
in the machine learning community for carefully choosing Both uncertainty sampling and query-by-committee fo-
the right examples to label and thus minimize cost. More cus on individual examples instead of the entire set of ex-
recently, crowdsourcing techniques for labeling have been amples, they run into the danger of choosing examples that
proposed where there can be many workers who are not are outliers according to the example distribution. Density
necessarily experts in labeling. Hence, there is more empha- weighting [107] is a way to improve the above techniques by
sis on how to assign tasks to workers, what interfaces to use, choosing instances that are not only uncertain or disagree-
and how to ensure high quality labels. Recent commercial ing, but also representative of the example distribution.
tools vary in what services they provide for labeling. For Decision Theoretic Approaches Another line of active learning
example, Amazon SageMaker [8] supports labeling based performs decision-theoretic approaches. Decision theory is
on active learning, Google Cloud AutoML [6] provides a a framework for making decision under uncertainty using
manual labeling service, and Microsoft Custom Vision [7] states and actions to optimize some objective function. In
requires labels from the user. While crowdsourcing data the context of active learning, the objective could be to
labeling is closely related to crowdsourcing data acquisition, choose an example that maximizes the estimated model
the individual techniques are different. accuracy [108]. Another possible objective is reducing gen-
eralization error [109], which is estimated as follows: if the
3.2.1 Active Learning measure is log loss, then the entropy of the predicted class
distribution is considered the error rate; if the measure is 0-
Active learning focuses on selecting the most “interesting”
1 loss, the maximum probability among all classes is the
unlabeled examples to give to the crowd for labeling. The
error rate. Each example to label is chosen by taking a
workers are expected to be very accurate, so there is less
sample of the unlabeled data and choosing the example that
emphasis on how to interact with those with less expertise.
minimizes the estimated error rate.
While some references view active learning as a special case
of semi-supervised learning, the key difference is that there Regression Active learning techniques can also be extended
is a human-in-the-loop. The key challenge is choosing the to regression problems. For uncertainty sampling, instead of
right examples to ask given a limited budget. One downside computing the entropy of classes, one can compute the out-
of active learning is that the examples are biased to the put variance of the predictions and select the examples with
training algorithm and cannot be reused. Active learning the highest variance. Query-by-committee can also be ex-
is covered extensively in other surveys [105], [147], and we tended to regression [110] by training a committee of models
only cover the most prominent techniques here. and selecting the examples where the variance among the
11
committee’s predictions is the largest. This approach is said interaction with workers, evaluating workers so they are
to work well when the bias of the models is small. Also, this reliable, reducing any bias that the workers may have,
approach is said to be robust to overspecification, which and aggregating the labeling results while resolving any
reduces the chance of overfitting. ambiguities among them.
Self and Active Learning Combined The data labeling tech- User Interaction A major challenge in user interaction is
niques we consider are complementary to each other and to effectively provide instructions to workers on how to
can be used together. In fact, semi-supervised learning and perform the labeling. The traditional approach is to provide
active learning have a history of being used together [111]– some guidelines for labeling to the workers up front and
[114], [148]. A key observation is that the two techniques then let them make a best effort to follow them. However,
solve opposite problems where semi-supervised learning the guidelines are often incomplete and do not cover all
finds the predictions with the highest confidence and adds possible scenarios, leaving the workers in the dark. Re-
them to the labeled examples while active learning finds volt [115] is a system that attempts to fix this problem
the predictions with the lowest confidence (using uncer- through collaborative crowdsourcing. Here workers work in
tainty sampling, query-by-committee, or density-weighted three steps: Voting where workers vote just like in traditional
method) and sends them for manual labeling. labeling, Explaining where workers justify their rational for
There are various ways semi-supervised learning can labeling, and Categorize where workers review explanations
be used with active learning. McCallum and Nigam [111] from other workers and tag any conflicting labels. This
improves the Query-By-Committee (QBC) technique and information can then be used to make post-hoc judgements
combines it with Expectation-Maximization (EM), which of the label decision boundaries. Another approach is to pro-
effectively performs semi-supervised learning. Given a set vide better tools to assist workers to organize their concepts,
of documents for training data, active learning is done which may evolve as more examples are labeled [117].
by selecting the documents that are closer to others (and
In addition, providing the right labeling interface is
thus representative), but have committee disagreement, for
critical for workers to perform well. The challenge is that
labeling. In addition, the EM algorithm is used to further
each application may have a different interface that works
infer the rest of the labels. The active learning and EM
best. We will not cover all the possible applications, but
can either be done separately or interleaved. Tomanek
instead illustrate a line of research for the problem of entity
and Hahn [112] propose semi-supervised active learning
resolution where the goal is to find records in a database
(SeSAL) for sequence labeling tasks, which include POS
that refer to the same real-world entity. Here the label is
tagging, chunking, or named entity recognition (NER). Here
whether two (or more) records are the same or not. Just
the examples are sequences of text. The idea is to use
for this problem, there is a line of research on providing
active learning for the subsequences that have the highest
the best interface for comparisons. CrowdER [118] provides
training utility within the selected sentences and use semi-
two types of interfaces to compare records: pair-based and
supervised learning to automatically label the rest of the
cluster-based. Qurk [50] uses a mapping interface where
subsequences. The utility of a subsequence is highest when
multiple records on one side are matched with records on
the current model is least confident about the labeling.
the other side. Qurk uses a combination of comparison or
Zhou et al. [113] proposes the semi-supervised active rating tasks to accelerate labeling.
image retrieval (SSAIR) approach where the focus is on
image retrieval. SSAIR is inspired by the co-training method Quality control Controlling the quality of data labeling by
where initially two classifiers are trained from the labeled the crowd is important because the workers may vary sig-
data. Then each learner passes the most relevant/irrelevant nificantly in their abilities to provide labels. A simple way to
images to the other classifier. The classifiers are then re- ensure quality is to repeatedly label the same example using
trained with the additional labels, and their results are multiple workers and perhaps take a majority voting at the
combined. The images that still have low confidence are end. However, there are more sophisticated approaches as
selected to be labeled by humans. well. Get another label [119] and Crowdscreen [120] actively
Zhu et al. [114] combines semi-supervised and active solicit labels while Karger et al. [121] passively collects
learning under a Gaussian random field model. The labeled data and runs the expectation maximization algorithm. Vox
and unlabeled examples are represented as vertices in a Populi [122] proposes techniques for pruning low-quality
graph where edges are weighted by similarities between workers that can achieve better labeling quality without
examples. This framework enables one to compute the next having to repeatedly label examples.
question that minimizes the expected generalization error Scalability Scaling up crowdsourced labeling is another
efficiently for active learning. Once the new labels are added important challenge. While traditional active learning tech-
to the labeled data, semi-supervised learning is performed niques were proposed for this purpose, more recently the
using harmonic functions. data management community has started to apply sys-
tems techniques for further scaling the algorithms to large
3.2.2 Crowdsourcing datasets. In particular, Mozafari et al. [116] proposes active
In comparison to active learning, the crowdsourcing tech- learning algorithms that can run in parallel. One algorithm
niques here are more focused on running tasks with many (called Uncertainty) selects examples that the current classi-
workers who are not necessarily labeling experts. As a fier is most uncertain about. A more sophisticated algorithm
result, workers may make mistakes, and there is a heavy (called MinExpError) combines the current model’s accuracy
literature [81]–[84], [115], [149], [150] on improving the with the uncertainty. A key idea is the use of bootstrap the-
12
ory, which makes the algorithms applicable to any classifier <Label Function Generation>
and also enables embarassingly-parallel processing. Weak
<Crowd> def lf1(x): ... Generative
Labels
Regression In comparison to crowdsourcing research for Model Discriminative
def lf2(x): ...
classification tasks, less attention has been given to regres- Majority Model
sion tasks. Marcus et al. [123] solves the problem of selec- def lf3(x): ...
Voting
tivity estimation in a crowdsourced database. The goal is to
estimate the fraction of records that satisfy some property Fig. 5: A workflow of using data programming for a smart
by asking workers questions. factory application. In this scenario, Sally is using crowd-
sourcing to annotate defects on component images. Next,
3.3 Weak supervision the annotations can be automatically converted to labeling
As machine learning is used in a wider range of applica- functions. Then the labeling functions are combined either
tions, it is mostly the case that there is not enough labeled into a generative model or using majority voting. Finally,
data. For example, in a smart factory setting, any new the combined model generates weak labels that are used to
product will have no labels for training a model for quality train a discriminative model.
control. As a result, weak supervision techniques [151]–
[153] have become increasingly popular where the idea is
to semi-automatically generate large quantities of labels that which defeats the purpose of majority voting. Instead, la-
are not as accurate as manual labels, but good enough for beling functions that are more correlated with each other
the trained model to obtain a reasonably-high accuracy. will have less influence on the predicted label. In addition,
This approach is especially useful when there are large labeling functions that are outliers are also trained to have
amounts of data, and manual labeling becomes infeasible. less influence in order to cancel out the noise. Theoretical
In the next sections, we discuss the recently proposed data analysis [3] shows that, if the labeling functions are reason-
programming paradigm and fact extraction techniques. ably accurate, then the predictions made by the generative
model becomes arbitrarily close to the true labels.
3.3.1 Data Programming Several systems have been developed for data program-
As data labeling at scale becomes more important especially ming. DeepDive [124] is a precursor of data programming
for deep learning applications, data programming [126] has and supports fast knowledge base construction on dark
been proposed as a solution for generating large amounts data using information extraction. DeepDive effectively uses
of weak labels using multiple labeling functions instead of humans to extract features, implement supervision rules
individual labeling. Figure 5 illustrates how data program- using a declarative language (similar to labeling func-
ming can be used for Sally’s smart factory application. A tions), and supports incremental maintenance of inference.
labeling function can be any computer program that either DDLite [125] is the first system to use data program-
generates a label for an example or refrains to do so. For ming and supports rapid prototyping of labeling functions
example, a labeling function that checks if a tweet has a through an interactive interface. The primary application
positive sentiment may check if certain positive words ap- is information extraction, and DDLite has been used to
pear in the text. Since a single labeling function by itself may extract chemicals, diseases, and anatomical named entities.
not be accurate enough or not be able to generate labels for Compared to DeepDive, it has a simpler Python syntax
all examples, multiple labeling functions are implemented and does not require complex setup involving databases.
and combined into a generative model, which is then used Snorkel [127], [155] is the most recent system for data pro-
to generate large amounts of weak labels with reasonable gramming. Compared to DDLite, Snorkel is a full-fledged
quality. Alternatively, voting methods like majority voting product that is widely used in the industry. Snorkel en-
can be used to combine the labeling functions. Finally, a ables users to use weak labels from all available weak
noise-aware discriminative model is trained on the weak label sources, supports any type of classifier, and provides
labels. Data programming has been implemented in the rapid results in response to the user’s input. More recently,
state-of-the-art Snorkel system [127], which is becoming Snorkel has been extended to solve massively multi-task
increasingly popular in the industry [128], [129]. learning [130].
Data programming has advantages both in terms of Data programming is designed for classification and
model accuracy and usability. A key observation for gener- does not readily support regression. To make an extension to
ating weak labels is that training a discriminative model on regression, the labeling functions must return real numbers
large amounts of weak labels may result in higher accuracy instead of discrete values. In addition, the probabilistic
than training with fewer manual labels. In terms of usability, graphical model used for training generative models must
implementing labeling functions can be an intuitive process be able to return a continuous distribution of possible values
for humans compared to feature engineering in traditional for labels.
machine learning [154].
What makes data programming effective is the way it 3.3.2 Fact Extraction
combines multiple labeling functions into the generative Another way to generate weak labels is use fact extraction.
model by fitting a probabilistic graphical model. A naı̈ve Knowledge bases contain facts that are extracted from var-
approach to combine labeling functions is to take a majority ious sources including the Web. A fact could describe an
vote. However, this approach cannot handle pathological attribute of an entity (e.g., hGermany, capital, Berlini). The
cases where many labeling functions are near identical, facts can be considered positively-labeled examples, which
13
can be used as seed labels for distant supervision [156] Task Techniques
when generating weak labels. It is worth mentioning that Data Cleaning [158]–[166]
Improve Data
fact extraction roots from the broader topic of informa- Re-labeling [119]
tion extraction where the goal is to extract structured data Robust Against Noise [167]–[171]
Improve Model
from the Web [157]. Early work include RoadRunner [139], Transfer Learning [172]–[178]
which compares HTML pages to generates wrappers and
KnowItAll [140], which uses extraction rules and a search TABLE 3: A classification of techniques for improving exist-
engine to identify and rank facts. Since then, the subsequent ing data and models.
works have become more sophisticated and also attempt to
organize the extract facts into knowledge bases.
4.1 Improving Existing Data
We now focus on fact extraction techniques for knowl-
edge bases. If precision is critically important, then manual A major problem in machine learning is that the data
curation should be part of the knowledge base construc- can be noisy and the labels incorrect. This problem occurs
tion as in Freebase [131] and Google Knowledge Graph. frequently in practice, so production machine learning plat-
Otherwise, the extraction techniques depend on the data forms like TensorFlow Extended (TFX) [179] have separate
source. YAGO [132], [133] extracts facts from Wikipedia components [180] to reduce data errors as much as possible
using classes in WordNet. Ollie [134], ReVerb [135], and though analysis and validation. In case the labels are also
ReNoun [136] and open information extraction systems noisy, re-labeling the examples becomes necessary as well.
that apply patterns to Web text. Knowledge Vault [137] We explore recent advances in data cleaning with a focus on
also extracts from Web content, but combines facts from machine learning and then techniques for re-labeling.
text, tabular data, page structure, and human annotations.
Biperpedia [138] extracts the attributes of entities from a 4.1.1 Data Cleaning
query stream and Web text. It is common for the data itself to be noisy. For example,
The Never-Ending Language Learner (NELL) sys- some values may be out of range (e.g., a latitude value is
tem [141], [142] continuously extracts structured in- beyond [-90, 90]) or use different units by mistake (e.g., some
formation from the unstructured Web and constructs intervals are in hours while other are in minutes). There
a knowledge base that consists of entities and re- is a heavy literature on various integrity constraints (e.g.,
lationships. Initially, NELL starts with a seed on- domain constraints, referential integrity constraints, and
tology, which contains entities of classes (e.g., per- functional dependencies) that can improve data quality as
son, fruit, and emotion) and relationships among well. HoloClean [158] is a state-of-art data cleaning system
the entities (e.g., playsOnT eam(athlete, sportsT eam) that uses quality rules, value correlations, and reference data
and playsInstrument(musician, instrument)). NELL an- to build a probabilistic model that captures how the data
alyzes hundreds of millions of Web pages and identifies was generated. HoloClean then generates a probabilistic
new entities in the given classes as well as entity pairs of program for repairing the data. In addition, various inter-
the relationships by matching patterns on their surrounding active data cleaning tools [162]–[165] have been proposed to
phrases. The resulting entities and relationships can then be convert data into a better form for machine learning.
used as the next training data for constructing even more An interesting line of recent work is cleaning techniques
patterns. NELL has been collecting facts continuously since with the explicit intention of improving machine learning
2010. The extraction techniques can be viewed as distant results. ActiveClean [159] is a model training framework
supervision generating weak labels. that iteratively suggests samples of data to clean based
on how much the cleaning improves the model accuracy
and the likelihood that the data is dirty. An analyst can
then perform transformations and filtering to clean each
sample. ActiveClean treats the training and cleaning as a
4 U SING E XISTING DATA AND M ODELS form of stochastic gradiant descent and uses convex-loss
models (SVMs, linear and logistic regression) to guarantee
An alternative approach to acquiring new data and labeling global solutions for clean models. BoostClean [160] solves an
it is to improve the labeling of any existing datasets or important class of inconsistencies where an attribute value
improving the model training. This approach makes sense is outside an allowed domain. BoostClean takes as input
for a number of scenarios. First, it may be difficult to a dataset and a set of functions that can detect these errors
find new datasets because the application is too novel or and repair functions that can fix them. Each pair of detection
non-trivial for others to have produced datasets. Second, and repair functions can produce a new model trained
simply adding more data may not significantly improve on the cleaned data. BoostClean uses statistical boosting
the model’s accuracy anymore. In this case, re-labeling or to find the best ensemble of pairs that maximize the final
cleaning the existing data may be the faster way to increase model’s accuracy. Recently, TARS [161] was proposed to
the accuracy. Alternatively, the model training can be made solve the problem of cleaning crowdsourced labels using
more robust to noise and bias, or the model can be trained oracles. TARS provides two pieces of advice. First, given
from an existing model using transfer learning techniques. test data with noisy labels, it uses an estimation technique to
In the following sections, we explore techniques for im- predict how well the model may perform on the true labels.
proving existing labels and improving existing models. The The estimation is shown to be unbiased, and confidence
techniques are summarized in Table 3. intervals are computed to bound the error. Second, given
14
training data with noisy labels, TARS determines which suggests that it is worth training on easy and hard data
examples to send to an oracle in order to maximize the separately.
expected model improvement of cleaning each noisy label. Goodfellow et al. [171] take a different approach where
More recently, MLC LEAN [166] has been proposed to inte- they explain why machine learning models including neu-
grate three data operations: traditional data cleaning, model ral networks may misclassify adversarial examples. While
unfairness mitigation where the goal is to remove data bias previous research attempts to explain this phenomenon by
that causes model fairness, and data sanitization where the focusing on nonlinearity and overfitting, the authors show
goal is to remove data poisoning. that it is the model’s linear behavior in high-dimensional
spaces that makes it vulnerable. That is, making many small
4.1.2 Re-labeling changes on the features of an example can result in a large
Trained models are only as good as their training data, and change to the output prediction. As a result, generating
it is important to obtain high quality data labels. Simply large amounts of adversarial examples becomes easier using
labeling more data may not improve the model accuracy linear perturbation.
further. Indeed, Sheng et al. [119] shows that, if the labels Even if the labels themselves are clean, it may be the
are noisy, then the model accuracy plateaus from some point case that the labels are imbalanced. SM OT E [170] performs
and does not increase further, no matter how many more over-sampling for minority classes that need more exam-
labeling is done. The solution is to improve the quality of ples. Simply replicating examples may lead to overfitting, so
existing labels. The authors show that repeated labeling us- the over-sampling is done by generating synthetic examples
ing workers of certain individual qualities can significantly using the minority examples and their nearest neighbors.
improve model accuracy where a straightforward round He and Garcia [169] provide a comprehensive survey on
robin approach already give substantial improvements, and learning from imbalanced data.
being more selective in labeling gives even better results.
4.2.2 Transfer Learning
Transfer learning is a popular approach for training models
4.2 Improving Models
when there is not enough training data or time to train from
In addition to improving the data, there are also ways scratch. A common technique is to start from an existing
to improve the model training itself. Making the model model that is well trained (also called a source task), one
training more robust against noise or bias is an active area can incrementally train a new model (a target task) that
of research. Another popular approach is to use transfer already performs well. For example, a convolutional neural
learning where previously-trained models are used as a networks like AlexNet [182] and VGGNet [183] can be used
starting point to train the current model. to train a model for a different, but related vision problem.
Recently, Google announced TensorFlow Hub [174], which
4.2.1 Robust Against Noise and Bias enables users to easily re-use an existing model to train an
A common scenario in machine learning is that there is accurate model, even with a small dataset. Also, Google
a large number of noisy or even adversarial labels and Cloud AutoML [6] provides transfer learning as a service.
a relatively smaller number of clean labels. Simply dis- From a data management perspective, an interesting ques-
carding the noisy labels will result in reduced training tion is how these existing tools can be extended to index the
data, which is not desirable for complex models. Hence, metadata of models and provide search as a service, just like
there has been extensive research (see the survey [181]) for datasets. The metadata for models may be quite different
on how to make the model training still use noisy labels than metadata for data because one needs to determine if
by becoming more robust. For specific techniques, Xiao et a model can be used for transfer learning in her own ap-
al. [167] propose a general framework for training convolu- plication. In addition to using pre-trained models, another
tional neural networks on images with a small number of popular technique mainly used in Computer Vision is few-
clean labels and many noisy labels. The idea is to model shot learning [175] where the goal is to extend existing
the relationships between images, class labels, and label models to handle new classes using zero or more examples.
noises with a probabilistic graphical model and integrate Since transfer learning is primarily a machine learning topic
it into the model training. Label noise is categorized into that does not significantly involve data management, we
two types: confusing noise, which is caused by confusing only summarize the high-level ideas based on surveys [172],
content in the images, and pure random noise, which is [173]. There are studies of transfer learning techniques in
caused by technical bugs like mismatches between images the context of NLP [176], Computer Vision [177], and deep
and their surrounding text. The true labels and noise types learning [184] as well.
are treated as latent variables, and an EM algorithm is used An early survey of transfer learning [172] identifies three
for inference. Webly supervised learning [168] is a technique main research issues in transfer learning: what to transfer,
for training a convolutional neural network on clean and how to transfer, and when to transfer. That is, we need to
noisy images on the Web. First, the model is trained on top- decide what part of knowledge can be transferred, what
ranked images from search engines, which tend to be clean methods should be used to transfer the knowledge, and
because they are highly-ranked, but also biased in the sense whether transferring this knowledge is appropriate and
that objects tends to be centered in the image with a clean does not have any negative effect. Inductive transfer learning
background. Then relationships are discovered among the is used when the source task and target task are different
clean images, which are then used to adapt the model to while the two domains may or may not be the same. Here a
more noisier images that are harder to classify. This method task can be categorizing a document while a domain could
15
be a set of university webpages to categorize. Transductive is simply not worth the cost, and if the model training can
transfer learning is used when the source and target tasks are tolerate weak labels, then weak supervision techniques like
the same, but the domains are different. Unsupervised transfer data programming and label propagation can be used.
learning is similar to inductive transfer learning where the If Sally has existing labels, she may also want to make
source and target tasks are different, but uses unsupervised sure whether they can be improved in quality. If the data is
learning tasks like clustering and dimensionality reduction. noisy or biased, then the various data cleaning techniques
The three approaches above can also be divided based can be used. If there are existing models for product quality
on what to transfer. Instance-based transfer learning assumes through tools like TensorFlow Hub [174], they can be used
that the examples of the source can be re-used in the target to further improve the model using transfer learning.
by re-weighting them. Feature-representation transfer learning Through our experience, we also realize that it is not
assumes that the features that represent the data of the always easy to determine if there is enough data and labels.
source task can be used to represent the data of the target For example, even if the dataset is small or there are few
task. Parameter transfer learning assumes that the source and labels, as long as the distribution of data is easy to learn,
target tasks share some parameters or prior distributions then automatic approaches like semi-supervised learning
that can be re-used. Relational knowledge transfer learning will do the job better than manual approaches like active
assumes that certain relationships within the data of the learning. Another hard-to-measure factor is the amount
source task can be re-used in the target task. of human effort needed. When comparing active learning
More recent surveys [173], [178] classify most of the tra- versus data programming, we need to compare the tasks
ditional transfer learning techniques as homogeneous transfer of labeling examples and implementing labeling functions,
learning where the feature spaces of the source and tar- which are quite different. Depending on the application,
get tasks are the same. In addition, the surveys identify implementing a program on examples can range from trivial
a relatively new class of techniques called heterogeneous (e.g., look for certain keywords) to almost impossible (e.g.,
transfer learning where the feature spaces are different, but general object detection). Hence, even if data programming
the source and target examples are extracted from the same is an attractive option, one must determine the actual effort
domain. Heterogeneous transfer learning largely falls into of programming, which cannot be determined with a few
two categories: asymmetric and symmetric transformation. yes or no questions.
In an asymmetric approach, features of the source task Another thing to keep in mind is how the labeling
are transformed to the features of the target task. In a techniques tradeoff accuracy and scalability. Manual la-
symmetric approach, the assumption is that there is a com- beling is obviously the most accurate, but least scalable.
mon latent feature space that unifies the source and target Active learning scales better than the manual approach,
features. Transfer learning has been successfully used in but is still limited to how fast humans can label. Data
many applications including text sentiment analysis, image programming produces weak labels, which tend to have
classification, human activity classification, software defect lower accuracy than manual labels. On the other hand, data
detection, and multi-language text classification. programming can scale better than active learning assuming
that the initial cost of implementing labeling functions and
debugging them is reasonable. Semi-supervised learning ob-
5 P UTTING E VERYTHING TOGETHER viously scales the best with automatic labeling. The labeling
We now return to Sally’s scenario and provide an end- accuracy depends on the accuracy of the model trained on
to-end guideline for data collection (summarized as the existing labels. Combining self labeling with active learning
workflow in Figure 2). If there is no or little data to start with is a good example of taking the best of both worlds.
then Sally would need to acquire datasets. She can either
search for relevant datasets either on the Web or within the 6 F UTURE R ESEARCH C HALLENGES
company data lake, or decide to generate a dataset herself
Although data collection was traditionally a topic in the ma-
by installing camera equipment for taking photos of the
chine learning, as the amount of training data is increasing,
products within the factory. If the products also had some
data management research is becoming just as relevant, and
metadata, Sally could also augment that data with external
we are observing a convergence of the two disciplines. As
information about the product.
such, there needs to be more awareness on how the research
Once the data is available, then Sally can choose among
landscape will evolve for both communities and more effort
the labeling techniques using the categories discussed in
to better integrate the techniques.
Section 3. If there are enough existing labels, then self label-
ing using semi-supervised learning is an attractive option. Data Evaluation An open question is how to evaluate
There are many variants of self labeling depending on the whether the right data was collected with sufficient quantity.
assumptions on the model training as we studied. If there First, it may not be clear if we have found the best datasets
are not enough labels, Sally can decide to generate some us- for a machine learning task and whether the amount of
ing the crowd-based techniques using a budget. If there are data is enough to train a model with sufficient accuracy.
only a few experts available for labeling, active learning may In some cases, there may be too many datasets, and simply
be the right choice, assuming that the important examples collecting and integrating all of them may have a negative
that influence the model can be narrowed down. If there affect on model training. As a result, selecting the right
are many workers who do not necessarily have expertise, datasets becomes an important problem. Moreover, if the
general crowdsourcing methods can be used. If Sally does datasets are dynamic (e.g., they are streams of signals from
not have enough budget for crowd-based methods or if it sensors) and change in quality, then the choice of datasets
16
may have to change dynamically as well. Second, many the research effort has been focused on classification tasks
data discovery tools rely on dataset owners to annotate their and much less on regression tasks. An interesting question
datasets for better discovery, but more automatic techniques is which classification techniques can also be extended to
for understanding and extracting metadata from the data regression. It is also worth exploring if application-specific
are needed. techniques can be generalized further. For example, the
NELL system continuously extracts facts from the Web in-
While most of the data collection work assumes that the definitely. This idea can possibly be applied to collecting any
model training comes after the data collection, another im- type of data from any source, although the technical details
portant avenue is to augment or improve the data based on may differ. Finally, given the variety of techniques for data
how the model performs. While there is a heavy literature collection, there needs to be more research on end-to-end
on model interpretation [185], [186], it is not clear how to solutions that combine techniques for data acquisition, data
address feedback on the data level. In the model fairness labeling, and improvements of existing data and models.
literature [187], one approach to reducing unfairness is to fix
the data. In data cleaning, ActiveClean and BoostClean are
interesting approaches for fixing the data to improve model 7 C ONCLUSION
accuracy. A key challenge is analyzing the model, which
As machine learning becomes more widely used, it becomes
becomes harder as models become more complicated.
more important to acquire large amounts of data and label
Performance Tradeoff While traditional labeling techniques data, especially for state-of-the-art neural networks. Tradi-
focus on accuracy, there is a recent push towards gen- tionally, the machine learning, natural language processing,
erating large amounts of weak labels. We need to better and computer vision communities has contributed to this
understand the tradeoffs of accuracy versus scalability to problem – primarily on data labeling techniques including
make informed decisions on which approach to use when. semi-supervised learning and active learning. Recently, in
For example, simply having more weak labels does not the era of Big data, the data management community is also
necessarily mean the model’s accuracy will eventually reach contributing to numerous subproblems in data acquisition,
a perfect accuracy. At some point, it may be worth investing data labeling, and improvement of existing data. In this
in humans or using transfer learning to make additional survey, we have investigated the research landscape of how
improvements. Such decisions can be made through some all these technique complement each other and have pro-
trial and error, but an interesting question is whether there vided guidelines on deciding which technique can be used
is a more systematic way to do such evaluations. when. Finally, we have discussed interesting data collection
challenges that remain to be addressed. In the future, we
Crowdsourcing Despite the many efforts in crowdsourcing,
expect the integration of Big data and AI to happen not only
leveraging humans is still a non-trivial task. Dealing with
in data collection, but in all aspects of machine learning.
humans involves designing the right tasks and interfaces,
ensuring that the worker quality is good enough, and setting
the right price for tasks. The recent data programming ACKNOWLEDGMENTS
paradigm introduces a new set of challenges where work-
ers now have to implement labeling functions instead of This research was supported by the Engineering Research
providing labels themselves. One idea is to improve the Center Program through the National Research Foundation
quality of such collaborative programming by making the of Korea (NRF) funded by the Korean Government MSIT
programming of labeling functions drastically easier, say by (NRF-2018R1A5A1059921), by SK Telecom, and by a Google
introducing libraries or templates for programming. AI Focused Research Award.
[10] A. P. Bhardwaj, S. Bhattacherjee, A. Chavan, A. Deshpande, [37] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet alloca-
A. J. Elmore, S. Madden, and A. G. Parameswaran, “Datahub: tion,” Journal of Machine Learning Research, vol. 3, pp. 993–1022,
Collaborative data science & dataset version management at 2003.
scale,” in CIDR, 2015. [38] A. Kumar, J. Naughton, J. M. Patel, and X. Zhu, “To join or not
[11] S. Bhattacherjee, A. Chavan, S. Huang, A. Deshpande, and to join?: Thinking twice about joins before feature selection,” in
A. Parameswaran, “Principles of dataset versioning: Exploring SIGMOD, 2016, pp. 19–34.
the recreation/storage tradeoff,” PVLDB, vol. 8, no. 12, pp. 1346– [39] V. Shah, A. Kumar, and X. Zhu, “Are key-foreign key joins safe to
1357, Aug. 2015. avoid when learning high-capacity classifiers?” PVLDB, vol. 11,
[12] A. Y. Halevy, “Data publishing and sharing using fusion tables,” no. 3, pp. 366–379, Nov. 2017.
in CIDR, 2013. [40] M. Stonebraker and I. F. Ilyas, “Data integration: The current
[13] H. Gonzalez, A. Y. Halevy, C. S. Jensen, A. Langen, J. Madhavan, status and the way forward,” IEEE Data Eng. Bull., vol. 41, no. 2,
R. Shapley, and W. Shen, “Google fusion tables: data manage- pp. 3–9, 2018.
ment, integration and collaboration in the cloud,” in SoCC, 2010, [41] A. Doan, A. Y. Halevy, and Z. G. Ives, Principles of Data Integration.
pp. 175–180. Morgan Kaufmann, 2012.
[14] H. Gonzalez, A. Y. Halevy, C. S. Jensen, A. Langen, J. Madhavan, [42] S. Li, L. Chen, and A. Kumar, “Enabling and optimizing non-
R. Shapley, W. Shen, and J. Goldberg-Kidon, “Google fusion linear feature interactions in factorized linear algebra,” in SIG-
tables: web-centered data management and collaboration,” in MOD, 2019, pp. 1571–1588.
SIGMOD, 2010, pp. 1061–1066. [43] L. Chen, A. Kumar, J. F. Naughton, and J. M. Patel, “Towards
[15] “Ckan,” http://ckan.org. linear algebra over normalized data,” PVLDB, vol. 10, no. 11, pp.
[16] “Quandl,” https://www.quandl.com. 1214–1225, 2017.
[17] “Datamarket,” https://datamarket.com. [44] A. Kumar, J. F. Naughton, J. M. Patel, and X. Zhu, “To join or not
[18] “Kaggle,” https://www.kaggle.com/. to join?: Thinking twice about joins before feature selection,” in
SIGMOD, 2016, pp. 19–34.
[19] I. G. Terrizzano, P. M. Schwarz, M. Roth, and J. E. Colino, “Data
wrangling: The challenging yourney from the wild to the lake,” [45] G. Little, L. B. Chilton, M. Goldman, and R. C. Miller, “Turkit:
in CIDR, 2015. Human computation algorithms on mechanical turk,” in UIST,
2010, pp. 57–66.
[20] A. Halevy, F. Korn, N. F. Noy, C. Olston, N. Polyzotis, S. Roy, and
[46] D. W. Barowy, C. Curtsinger, E. D. Berger, and A. McGregor,
S. E. Whang, “Goods: Organizing google’s datasets,” in SIGMOD,
“Automan: A platform for integrating human-based and digital
2016, pp. 795–806.
computation,” in OOPSLA, 2012, pp. 639–654.
[21] R. Castro Fernandez, D. Deng, E. Mansour, A. A. Qahtan, W. Tao,
[47] S. Ahmad, A. Battle, Z. Malkani, and S. Kamvar, “The jabber-
Z. Abedjan, A. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani,
wocky programming environment for structured social comput-
M. Stonebraker, and N. Tang, “A demo of the data civilizer
ing,” in UIST, 2011, pp. 53–64.
system,” in SIGMOD, 2017, pp. 1639–1642.
[48] H. Park, R. Pang, A. G. Parameswaran, H. Garcia-Molina,
[22] D. Deng, R. C. Fernandez, Z. Abedjan, S. Wang, M. Stonebraker,
N. Polyzotis, and J. Widom, “Deco: A system for declarative
A. K. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani, and
crowdsourcing,” PVLDB, vol. 5, no. 12, pp. 1990–1993, 2012.
N. Tang, “The data civilizer system,” in CIDR, 2017.
[49] M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin,
[23] Y. Gao, S. Huang, and A. G. Parameswaran, “Navigating the “Crowddb: Answering queries with crowdsourcing,” in SIG-
data lake with DATAMARAN: automatically extracting structure MOD, 2011, pp. 61–72.
from log datasets,” in SIGMOD, 2018, pp. 943–958.
[50] A. Marcus, E. Wu, S. Madden, and R. C. Miller, “Crowdsourced
[24] M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang, databases: Query processing with people,” in CIDR, 2011, pp.
“Webtables: Exploring the power of tables on the web,” PVLDB, 211–214.
vol. 1, no. 1, pp. 538–549, 2008.
[51] R. Boim, O. Greenshpan, T. Milo, S. Novgorodov, N. Polyzotis,
[25] M. J. Cafarella, A. Y. Halevy, H. Lee, J. Madhavan, C. Yu, D. Z. and W. C. Tan, “Asking the right questions in crowd data
Wang, and E. Wu, “Ten years of webtables,” PVLDB, vol. 11, sourcing,” in ICDE, 2012, pp. 1261–1264.
no. 12, pp. 2140–2149, 2018. [52] M. J. Franklin, B. Trushkowsky, P. Sarkar, and T. Kraska, “Crowd-
[26] “Google dataset search,” https://www.blog.google/products/ sourced enumeration queries,” in ICDE, 2013, pp. 673–684.
search/making-it-easier-discover-datasets/. [53] H. Park and J. Widom, “Crowdfill: collecting structured data
[27] H. Elmeleegy, J. Madhavan, and A. Halevy, “Harvesting rela- from the crowd,” in SIGMOD, 2014, pp. 577–588.
tional tables from lists on the web,” The VLDB Journal, vol. 20, [54] V. Crescenzi, P. Merialdo, and D. Qiu, “Crowdsourcing large scale
no. 2, pp. 209–226, Apr. 2011. wrapper inference,” vol. 33, pp. 1–28, 2014.
[28] X. Chu, Y. He, K. Chakrabarti, and K. Ganjam, “Tegra: Table [55] M. Stonebraker, D. Bruckner, I. F. Ilyas, G. Beskales, M. Cherniack,
extraction by global record alignment,” in SIGMOD, 2015, pp. S. B. Zdonik, A. Pagan, and S. Xu, “Data curation at scale: The
1713–1728. data tamer system,” in CIDR, 2013.
[29] K. Chakrabarti, S. Chaudhuri, Z. Chen, K. Ganjam, and Y. He, [56] C. Gokhale, S. Das, A. Doan, J. F. Naughton, N. Rampalli,
“Data services leveraging bing’s data assets,” IEEE Data Eng. J. Shavlik, and X. Zhu, “Corleone: Hands-off crowdsourcing for
Bull., vol. 39, no. 3, pp. 15–28, 2016. entity matching,” in SIGMOD, 2014, pp. 601–612.
[30] M. J. Cafarella, A. Halevy, and N. Khoussainova, “Data integra- [57] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-
tion for the relational web,” PVLDB, vol. 2, no. 1, pp. 1090–1101, Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adver-
Aug. 2009. sarial nets,” in NIPS, 2014, pp. 2672–2680.
[31] M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri, “Info- [58] E. Choi, S. Biswal, B. Malin, J. Duke, W. F. Stewart, and J. Sun,
gather: Entity augmentation and attribute discovery by holistic “Generating multi-label discrete patient records using generative
matching with web tables,” in SIGMOD, 2012, pp. 97–108. adversarial networks,” in MLHC, 2017, pp. 286–305.
[32] N. N. Dalvi, R. Kumar, and M. A. Soliman, “Automatic wrappers [59] N. Park, M. Mohammadi, K. Gorde, S. Jajodia, H. Park, and
for large scale web extraction,” PVLDB, vol. 4, no. 4, pp. 219–230, Y. Kim, “Data synthesis based on generative adversarial net-
2011. works,” PVLDB, vol. 11, no. 10, pp. 1071–1083, 2018.
[33] P. Bohannon, N. N. Dalvi, Y. Filmus, N. Jacoby, S. S. Keerthi, [60] L. Xu and K. Veeramachaneni, “Synthesizing tabular data using
and A. Kirpal, “Automatic web-scale information extraction,” in generative adversarial networks,” CoRR, vol. abs/1811.11264,
SIGMOD, 2012, pp. 609–612. 2018.
[34] R. Baumgartner, W. Gatterbauer, and G. Gottlob, “Web data [61] I. J. Goodfellow, “NIPS 2016 tutorial: Generative adversarial
extraction system,” in Encyclopedia of Database Systems, Second networks,” CoRR, vol. abs/1701.00160, 2017.
Edition, 2018. [62] K. Kurach, M. Lucic, X. Zhai, M. Michalski, and S. Gelly, “The
[35] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, GAN landscape: Losses, architectures, regularization, and nor-
“Distributed representations of words and phrases and their malization,” CoRR, vol. abs/1807.04720, 2018.
compositionality,” in NIPS, 2013, pp. 3111–3119. [63] E. D. Cubuk, B. Zoph, D. Mané, V. Vasudevan, and Q. V. Le, “Au-
[36] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global toaugment: Learning augmentation policies from data,” CoRR,
vectors for word representation,” in EMNLP, 2014, pp. 1532–1543. vol. abs/1805.09501, 2018.
18
[64] A. J. Ratner, H. R. Ehrenberg, Z. Hussain, J. Dunnmon, and C. Ré, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot,
“Learning to compose domain-specific transformations for data and E. Duchesnay, “Scikit-learn: Machine learning in python,” J.
augmentation,” in NIPS, 2017, pp. 3239–3249. Mach. Learn. Res., vol. 12, pp. 2825–2830, Nov. 2011.
[65] X. Peng, B. Sun, K. Ali, and K. Saenko, “Learning deep object [92] D. Yarowsky, “Unsupervised word sense disambiguation rivaling
detectors from 3d models,” in ICCV, 2015, pp. 1278–1286. supervised methods,” in ACL, 1995, pp. 189–196.
[66] S. Kim, G. Choe, B. Ahn, and I. Kweon, “Deep representation of [93] Z.-H. Zhou and M. Li, “Tri-training: Exploiting unlabeled data
industrial components using simulated images,” in ICRA, 2017, using three classifiers,” IEEE TKDE, vol. 17, no. 11, pp. 1529–
pp. 2003–2010. 1541, Nov. 2005.
[67] T. Oh, R. Jaroensri, C. Kim, M. A. Elgharib, F. Durand, W. T. [94] Y. Zhou and S. A. Goldman, “Democratic co-learning,” in IEEE
Freeman, and W. Matusik, “Learning-based video motion mag- ICTAI, 2004, pp. 594–602.
nification,” CoRR, vol. abs/1804.02684, 2018. [95] A. Blum and T. Mitchell, “Combining labeled and unlabeled data
[68] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, “Syn- with co-training,” in COLT, 1998, pp. 92–100.
thetic data and artificial neural networks for natural scene text [96] I. Triguero, S. Garcı́a, and F. Herrera, “Self-labeled techniques
recognition,” CoRR, vol. abs/1406.2227, 2014. for semi-supervised learning: taxonomy, software and empirical
[69] A. Gupta, A. Vedaldi, and A. Zisserman, “Synthetic data for text study,” Knowl. Inf. Syst., vol. 42, no. 2, pp. 245–284, 2015.
localisation in natural images,” in CVPR, June 2016. [97] U. Brefeld, T. Gärtner, T. Scheffer, and S. Wrobel, “Efficient co-
[70] Y. Xia, X. Cao, F. Wen, and J. Sun, “Well begun is half done: regularised least squares regression,” in ICML, 2006, pp. 137–144.
Generating high-quality seeds for automatic image dataset con- [98] V. Sindhwani and P. Niyogi, “A co-regularized approach to semi-
struction from web,” in ECCV, 2014, pp. 387–400. supervised learning with multiple views,” in Proceedings of the
[71] Y. Bai, K. Yang, W. Yu, C. Xu, W. Ma, and T. Zhao, “Automatic ICML Workshop on Learning with Multiple Views, 2005.
image dataset construction from click-through logs using deep [99] Z.-H. Zhou and M. Li, “Semi-supervised regression with co-
neural network,” in MM, 2015, pp. 441–450. training,” in IJCAI, 2005, pp. 908–913.
[72] J. Mallinson, R. Sennrich, and M. Lapata, “Paraphrasing revisited [100] S. Ravi and Q. Diao, “Large scale distributed semi-supervised
with neural machine translation,” in EACL. Association for learning using streaming approximation,” in AISTATS, 2016, pp.
Computational Linguistics, 2017, pp. 881–893. 519–528.
[73] M. Iyyer, J. Wieting, K. Gimpel, and L. Zettlemoyer, “Adversar- [101] X. Zhu, Z. Ghahramani, and J. Lafferty, “Semi-supervised learn-
ial example generation with syntactically controlled paraphrase ing using gaussian fields and harmonic functions,” in ICML, 2003,
networks,” CoRR, vol. abs/1804.06059, 2018. pp. 912–919.
[74] M. T. Ribeiro, S. Singh, and C. Guestrin, “Semantically equivalent [102] P. P. Talukdar and W. W. Cohen, “Scaling graph-based semi
adversarial rules for debugging nlp models,” in ACL, 2018. supervised learning to large number of labels using count-min
[75] A. Y. Halevy, F. Korn, N. F. Noy, C. Olston, N. Polyzotis, S. Roy, sketch,” in AISTATS, 2014, pp. 940–947.
and S. E. Whang, “Managing google’s data lake: an overview of [103] D. D. Lewis and W. A. Gale, “A sequential algorithm for training
the goods system,” IEEE Data Eng. Bull., vol. 39, no. 3, pp. 5–14, text classifiers,” in SIGIR, 1994, pp. 3–12.
2016. [104] H. S. Seung, M. Opper, and H. Sompolinsky, “Query by commit-
[76] J. X. Yu, L. Qin, and L. Chang, “Keyword search in relational tee,” in COLT, 1992, pp. 287–294.
databases: A survey,” IEEE Data Eng. Bull., vol. 33, no. 1, pp. [105] B. Settles, Active Learning, ser. Synthesis Lectures on Artificial
67–78, 2010. Intelligence and Machine Learning. Morgan & Claypool Pub-
[77] S. Chaudhuri and G. Das, “Keyword querying and ranking in lishers, 2012.
databases,” PVLDB, vol. 2, no. 2, pp. 1658–1659, 2009. [106] N. Abe and H. Mamitsuka, “Query learning strategies using
[78] R. C. Fernandez, Z. Abedjan, F. Koko, G. Yuan, S. Madden, and boosting and bagging,” in ICML, 1998, pp. 1–9.
M. Stonebraker, “Aurum: A data discovery system,” in ICDE, [107] B. Settles and M. Craven, “An analysis of active learning strate-
2018, pp. 1001–1012. gies for sequence labeling tasks,” in EMNLP, 2008, pp. 1070–1079.
[79] R. C. Fernandez, E. Mansour, A. A. Qahtan, A. K. Elmagarmid, [108] B. Settles, M. Craven, and S. Ray, “Multiple-instance active learn-
I. F. Ilyas, S. Madden, M. Ouzzani, M. Stonebraker, and N. Tang, ing,” in NIPS, 2007, pp. 1289–1296.
“Seeping semantics: Linking datasets using word embeddings for [109] N. Roy and A. McCallum, “Toward optimal active learning
data discovery,” in ICDE, 2018, pp. 989–1000. through sampling estimation of error reduction,” in ICML, 2001,
[80] Q. V. Le and T. Mikolov, “Distributed representations of sentences pp. 441–448.
and documents,” in ICML, 2014, pp. 1188–1196. [110] R. Burbidge, J. J. Rowland, and R. D. King, “Active learning for
[81] “Crowdsourced data management: Industry and academic per- regression based on query by committee,” in IDEAL, 2007, pp.
spectives,” Foundations and Trends in Databases, vol. 6, 2015. 209–218.
[82] M. Allahbakhsh, B. Benatallah, A. Ignjatovic, H. R. Motahari- [111] A. McCallum and K. Nigam, “Employing em and pool-based
Nezhad, E. Bertino, and S. Dustdar, “Quality control in crowd- active learning for text classification,” in ICML, 1998, pp. 350–
sourcing systems: Issues and directions,” IEEE Internet Comput- 358.
ing, vol. 17, no. 2, pp. 76–81, March 2013. [112] K. Tomanek and U. Hahn, “Semi-supervised active learning for
[83] F. Daniel, P. Kucherbaev, C. Cappiello, B. Benatallah, and M. Al- sequence labeling,” in ACL, 2009, pp. 1039–1047.
lahbakhsh, “Quality control in crowdsourcing: A survey of qual- [113] Z.-H. Zhou, K.-J. Chen, and Y. Jiang, “Exploiting unlabeled data
ity attributes, assessment techniques, and assurance actions,” in content-based image retrieval,” in ECML, 2004, pp. 525–536.
ACM Comput. Surv., vol. 51, no. 1, pp. 7:1–7:40, Jan. 2018. [114] X. Zhu, J. Lafferty, and Z. Ghahramani, “Combining active
[84] G. Li, J. Wang, Y. Zheng, and M. J. Franklin, “Crowdsourced data learning and semi-supervised learning using gaussian fields and
management: A survey,” IEEE TKDE, vol. 28, no. 9, pp. 2296– harmonic functions,” in ICML 2003 workshop on The Continuum
2319, Sept 2016. from Labeled to Unlabeled Data in Machine Learning and Data Mining,
[85] “Amazon mechanical turk,” https://www.mturk.com. 2003, pp. 58–65.
[86] J. Kim, S. Sterman, A. A. B. Cohen, and M. S. Bernstein, “Me- [115] J. C. Chang, S. Amershi, and E. Kamar, “Revolt: Collaborative
chanical novel: Crowdsourcing complex work through reflection crowdsourcing for labeling machine learning datasets,” in CHI,
and revision,” in CSCW, 2017, pp. 233–245. 2017, pp. 2334–2346.
[87] N. Salehi, J. Teevan, S. T. Iqbal, and E. Kamar, “Communicating [116] B. Mozafari, P. Sarkar, M. Franklin, M. Jordan, and S. Madden,
context to the crowd for complex writing tasks,” in CSCW, 2017, “Scaling up crowd-sourcing to very large datasets: A case for
pp. 1890–1901. active learning,” PVLDB, vol. 8, no. 2, pp. 125–136, Oct. 2014.
[88] H. Garcia-Molina, M. Joglekar, A. Marcus, A. Parameswaran, and [117] T. Kulesza, S. Amershi, R. Caruana, D. Fisher, and D. X. Charles,
V. Verroios, “Challenges in data crowdsourcing,” IEEE TKDE, “Structured labeling for facilitating concept evolution in machine
vol. 28, no. 4, pp. 901–911, Apr. 2016. learning,” in CHI, 2014, pp. 3075–3084.
[89] Y. Amsterdamer and T. Milo, “Foundations of crowd data sourc- [118] J. Wang, T. Kraska, M. J. Franklin, and J. Feng, “Crowder: Crowd-
ing,” SIGMOD Rec., vol. 43, no. 4, pp. 5–14, Feb. 2015. sourcing entity resolution,” PVLDB, vol. 5, no. 11, pp. 1483–1494,
[90] N. Patki, R. Wedge, and K. Veeramachaneni, “The synthetic data 2012.
vault,” in DSAA, 2016, pp. 399–410. [119] V. S. Sheng, F. Provost, and P. G. Ipeirotis, “Get another label?
[91] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, improving data quality and data mining using multiple, noisy
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, labelers,” in KDD, 2008, pp. 614–622.
19
[120] A. G. Parameswaran, H. Garcia-Molina, H. Park, N. Polyzotis, [146] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li, “Imagenet:
A. Ramesh, and J. Widom, “Crowdscreen: algorithms for filtering A large-scale hierarchical image database,” in CVPR, 2009, pp.
data with humans,” in SIGMOD, 2012, pp. 361–372. 248–255.
[121] D. R. Karger, S. Oh, and D. Shah, “Iterative learning for reliable [147] F. Ricci, L. Rokach, and B. Shapira, Eds., Recommender Systems
crowdsourcing systems,” in NIPS, 2011, pp. 1953–1961. Handbook. Springer, 2015.
[122] O. Dekel and O. Shamir, “Vox populi: Collecting high-quality [148] Y. Gu, Z. Jin, and S. C. Chiu, “Combining active learning and
labels from a crowd,” in COLT, 2009. semi-supervised learning using local and global consistency,” in
[123] A. Marcus, D. R. Karger, S. Madden, R. Miller, and S. Oh, Neural Information Processing, C. K. Loo, K. S. Yap, K. W. Wong,
“Counting with the crowd,” PVLDB, vol. 6, no. 2, pp. 109–120, A. Teoh, and K. Huang, Eds. Springer International Publishing,
2012. 2014, pp. 215–222.
[124] C. Zhang, “Deepdive: A data management system for automatic [149] V. Crescenzi, P. Merialdo, and D. Qiu, “Crowdsourcing large scale
knowledge base construction,” Ph.D. dissertation, 2015. wrapper inference,” Distrib. Parallel Databases, vol. 33, no. 1, pp.
[125] H. R. Ehrenberg, J. Shin, A. J. Ratner, J. A. Fries, and C. Ré, “Data 95–122, Mar. 2015.
programming with ddlite: putting humans in a different part of [150] M. Schaekermann, J. Goh, K. Larson, and E. Law, “Resolvable
the loop,” in HILDA@SIGMOD, 2016, p. 13. vs. irresolvable disagreement: A study on worker deliberation
[126] A. J. Ratner, C. D. Sa, S. Wu, D. Selsam, and C. Ré, “Data in crowd work,” PACMHCI, vol. 2, no. CSCW, pp. 154:1–154:19,
programming: Creating large training sets, quickly,” in NIPS, 2018.
2016, pp. 3567–3575. [151] “Weak supervision: The new programming paradigm for ma-
[127] A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu, and C. Ré, chine learning,” https://hazyresearch.github.io/snorkel/blog/
“Snorkel: Rapid training data creation with weak supervision,” ws blog post.html.
PVLDB, vol. 11, no. 3, pp. 269–282, Nov. 2017. [152] A. J. Ratner, B. Hancock, and C. Ré, “The role of massively multi-
[128] S. H. Bach, D. Rodriguez, Y. Liu, C. Luo, H. Shao, C. Xia, task and weak supervision in software 2.0,” in CIDR, 2019.
S. Sen, A. Ratner, B. Hancock, H. Alborzi, R. Kuchhal, C. Ré, [153] Z.-H. Zhou, “A brief introduction to weakly supervised learn-
and R. Malkin, “Snorkel drybell: A case study in deploying weak ing,” National Science Review, vol. 5, 08 2017.
supervision at industrial scale,” in SIGMOD, 2019, pp. 362–375. [154] H. R. Ehrenberg, J. Shin, A. J. Ratner, J. A. Fries, and C. Ré, “Data
[129] E. Bringer, A. Israeli, Y. Shoham, A. Ratner, and C. Ré, “Osprey: programming with ddlite: Putting humans in a different part of
Weak supervision of imbalanced extraction problems without the loop,” in HILDA@SIGMOD, 2016, pp. 13:1–13:6.
code,” in DEEM@SIGMOD, 2019, pp. 4:1–4:11. [155] A. J. Ratner, S. H. Bach, H. R. Ehrenberg, and C. Ré, “Snorkel: Fast
[130] A. Ratner, B. Hancock, J. Dunnmon, R. Goldman, and C. Ré, training set generation for information extraction,” in SIGMOD,
“Snorkel metal: Weak supervision for multi-task learning,” in 2017, pp. 1683–1686.
DEEM@SIGMOD, 2018, pp. 3:1–3:4. [156] M. Mintz, S. Bills, R. Snow, and D. Jurafsky, “Distant supervision
[131] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor, “Free- for relation extraction without labeled data,” in ACL, 2009, pp.
base: A collaboratively created graph database for structuring 1003–1011.
human knowledge,” in SIGMOD, 2008, pp. 1247–1250. [157] E. Ferrara, P. D. Meo, G. Fiumara, and R. Baumgartner, “Web data
[132] F. M. Suchanek, G. Kasneci, and G. Weikum, “Yago: A core of extraction, applications and techniques: A survey,” Knowl.-Based
semantic knowledge,” in WWW, 2007, pp. 697–706. Syst., vol. 70, pp. 301–323, 2014.
[133] F. Mahdisoltani, J. Biega, and F. M. Suchanek, “YAGO3: A knowl- [158] T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré, “Holoclean: Holistic
edge base from multilingual wikipedias,” in CIDR, 2015. data repairs with probabilistic inference,” PVLDB, vol. 10, no. 11,
[134] Mausam, M. Schmitz, R. Bart, S. Soderland, and O. Etzioni, pp. 1190–1201, 2017.
“Open language learning for information extraction,” in EMNLP- [159] S. Krishnan, J. Wang, E. Wu, M. J. Franklin, and K. Goldberg,
CoNLL, 2012, pp. 523–534. “Activeclean: Interactive data cleaning for statistical modeling,”
[135] N. Sawadsky, G. C. Murphy, and R. Jiresal, “Reverb: Recom- PVLDB, vol. 9, no. 12, pp. 948–959, 2016.
mending code-related web pages,” in ICSE, 2013, pp. 812–821. [160] S. Krishnan, M. J. Franklin, K. Goldberg, and E. Wu, “Boostclean:
[136] M. Yahya, S. Whang, R. Gupta, and A. Y. Halevy, “Renoun: Fact Automated error detection and repair for machine learning,”
extraction for nominal attributes,” in EMNLP, 2014, pp. 325–335. CoRR, vol. abs/1711.01299, 2017.
[137] X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, [161] M. Dolatshah, M. Teoh, J. Wang, and J. Pei, “Cleaning Crowd-
T. Strohmann, S. Sun, and W. Zhang, “Knowledge vault: A web- sourced Labels Using Oracles For Statistical Classification,”
scale approach to probabilistic knowledge fusion,” in KDD, 2014, School of Computer Science, Simon Fraser University, Tech. Rep.,
pp. 601–610. 2018.
[138] R. Gupta, A. Halevy, X. Wang, S. E. Whang, and F. Wu, “Biperpe- [162] V. Raman and J. M. Hellerstein, “Potter’s wheel: An interactive
dia: An ontology for search applications,” PVLDB, vol. 7, no. 7, data cleaning system,” in VLDB, 2001, pp. 381–390.
pp. 505–516, Mar. 2014. [163] S. Kandel, A. Paepcke, J. M. Hellerstein, and J. Heer, “Wrangler:
[139] V. Crescenzi, G. Mecca, and P. Merialdo, “Roadrunner: Towards interactive visual specification of data transformation scripts,” in
automatic data extraction from large web sites,” in VLDB, 2001, CHI, 2011, pp. 3363–3372.
pp. 109–118. [164] S. Kandel, R. Parikh, A. Paepcke, J. M. Hellerstein, and J. Heer,
[140] O. Etzioni, M. J. Cafarella, D. Downey, S. Kok, A. Popescu, “Profiler: integrated statistical analysis and visualization for data
T. Shaked, S. Soderland, D. S. Weld, and A. Yates, “Web-scale quality assessment,” in AVI, 2012, pp. 547–554.
information extraction in knowitall: (preliminary results),” in [165] W. R. Harris and S. Gulwani, “Spreadsheet table transformations
WWW, 2004, pp. 100–110. from examples,” in PLDI, 2011, pp. 317–328.
[141] T. M. Mitchell, W. W. Cohen, E. R. H. Jr., P. P. Talukdar, J. Bet- [166] K. H. Tae, Y. Roh, Y. H. Oh, H. Kim, and S. E. Whang, “Data
teridge, A. Carlson, B. D. Mishra, M. Gardner, B. Kisiel, J. Krish- cleaning for accurate, fair, and robust models: A big data - AI
namurthy, N. Lao, K. Mazaitis, T. Mohamed, N. Nakashole, E. A. integration approach,” in DEEM@SIGMOD, 2019.
Platanios, A. Ritter, M. Samadi, B. Settles, R. C. Wang, D. Wijaya, [167] T. Xiao, T. Xia, Y. Yang, C. Huang, and X. Wang, “Learning from
A. Gupta, X. Chen, A. Saparov, M. Greaves, and J. Welling, massive noisy labeled data for image classification.” in CVPR.
“Never-ending learning,” in AAAI, 2015, pp. 2302–2310. IEEE Computer Society, pp. 2691–2699.
[142] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. H. Jr., and T. M. [168] X. Chen and A. Gupta, “Webly supervised learning of convolu-
Mitchell, “Toward an architecture for never-ending language tional networks,” in ICCV, 2015, pp. 1431–1439.
learning,” in AAAI, 2010. [169] H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE
[143] X. Zhu, “Semi-Supervised Learning Literature Survey,” Com- TKDE, vol. 21, no. 9, pp. 1263–1284, Sep. 2009.
puter Sciences, University of Wisconsin-Madison, Tech. Rep. [170] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer,
[144] D. Dheeru and E. Karra Taniskidou, “UCI machine learning “Smote: Synthetic minority over-sampling technique,” J. Artif.
repository,” 2017. Int. Res., vol. 16, no. 1, pp. 321–357, Jun. 2002.
[145] J. Alcal-Fdez, A. Fernndez, J. Luengo, J. Derrac, and S. Garca, [171] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and har-
“Keel data-mining software tool: Data set repository, integration nessing adversarial examples,” CoRR, vol. abs/1412.6572, 2014.
of algorithms and experimental analysis framework.” Multiple- [172] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE
Valued Logic and Soft Computing, no. 2-3, pp. 255–287. TKDE, vol. 22, no. 10, pp. 1345–1359, Oct. 2010.
20