Edge-based Discovery of Training Data for Machine Learning

Edge-based
Discovery of
Training Data
for Machine
Learning
Ziqiang (Edmond) Feng, Shilpa George, Jan
Harkes, Padmanabhan Pillai†, Roberta Klatzky,
Mahadev Satyanarayanan
Carnegie Mellon University and †Intel Labs
The New Yorker magazine April 20, 2018, p. 41

The Deep Learning Recipe
Collect a large amount of
data and label it
Select a model
and train a DNN
Deploy the DNN
for inference
2
TPOD @
CMU

DNNs for Domain Experts
Valuable in ecology, military intelligence, medical diagnosis, etc.
• Low base rate (prevalence) in the data
• Requires expertise to identify
Masked palm civet (Paguma larvata).
Transmitter of SARS during its 2003
outbreak in China.
BUK-M1. Believed to have shot down
MH17 and killed 298, 2014.
3
Nuclear atypia in cancer.

4Building a Training Set Is Hard
 Crowds are not experts
Crowd-sourcing (e.g., Amazon Mechanical Turk) are not applicable
 Access restriction of data
Patient privacy, business policy, national security, etc.
In the worst case, a single domain expert has to generate
an entire training set of 103 to 104 examples.
Masked palm civet
Red panda
Raccoon

Our Contribution: Eureka
 A system for efficient discovery of training examples from data
sources dispersed over the Internet
(focus on images in this paper)
 Goal: to effectively utilize an expert’s time and attention
 Key concepts:
 Early discard
 Iterative discovery workflow
 Edge computing
5
(positive)

Eureka’s Architecture
Expert with
domain-specific
GUI
cloudlet
Archival
Data
Source
LAN
cloudlet
LAN
cloudlet Live
Video
I
n
t
e
r
n
e
t
Archival
Data
Source
LAN
6
Executes early-discard code to
drop clearly irrelevant data
Only a tiny fraction of data along with
meta-data is transmitted and shown to
user, consuming little Internet
bandwidth.
High-bandwidth,
low-latency access

Example GUI: Finding Deer 7
Early-discard filters

Iterative Discovery Workflow
Explicit features, manual
weights (RGB histogram,
SIFT, perceptual hashing)
Explicit features, learned
weights (HOG + SVM)
Shallow transfer learning
(MobileNet + SVM)
Deep transfer learning
(Faster R-CNN finetuning)
Deep learning
100 101 102 103 104
Number of Examples (log scale)
Accuracy(nottoscale)
8

Finding Deer (after a few iterations) 9

System Design and Implementation
 Software generality: allow use of CV code written in
different languages, libraries and frameworks
(e.g., Python, Matlab, C++, TensorFlow, PyTorch, Scikit-learn)
 Empower experts with newest CV innovations quickly
 Encapsulate filters in Docker containers
 Runtime efficiency: be able to rapidly process and discard
large volume of data
 Exploit specialized hardware on cloudlets (e.g., GPU)
 Cache filter results to exploit temporal locality
10

Matching System to User
The system should deliver images to user at a rate the user can inspect them.
Wasting computation and precious
Internet bandwidth 
Suggestion
1. Restrict to fewer cloudlets
2. Bias filters towards precision rather
than recall
11
Too Fast

Matching System to User (cont’d)
The system should deliver images to user at a rate the user can inspect them.
Wasting expert time 
Obvious solution
Scale out to more cloudlets
(Edge computing is your friend)
Risk
“Junk” (false positives) causes user
annoyance and dissatisfaction
Rule of thumb
Focus on reducing false positive rate
before scaling out
12
Too Slow

Evaluation: Setup
YFCC100M: 99.2 million Flickr photos.
Real-life distribution of objects.
Evenly partitioned over the cloudlets.
Dataset
8 cloudlets with Nvidia GPUs, access data from
local SSDs.Edge
Connected to the cloudlets via the Internet.Client
13

Evaluation: Case Studies
Deer Taj Mahal Fire hydrant
0.07% 0.02% 0.005%Estimated
base rate
111 105 74Collected positives
in evaluation
7,447 4,791 15,379Images viewed
by user
14
2,104,076 2,542,889 2,734,070Images discarded
by Eureka

Eureka vs. Brute-force
1,000
10,000
100,000
1,000,000
Deer Taj Mahal Fire hydrant
Number of images the user viewed to collect
~100 true positives
Brute-force Single-iteration Eureka Eureka
Brute-force:
User views every image.
Single-iteration Eureka:
Early-discard without
iterative improvement.
15
Please refer to our paper for detailed results of each case study.

Iteratively Improving Productivity
The case of deer
0.4 0.36
1.49
4.24
4.77
1 2 3 4 5
Iteration in Eureka
Productivity (New true positives / minute)
16
~10X

Compute Must Co-locate with Data
0
200
400
600
800
1000
10 Mbps 25 Mbps 100 Mbps 1 Gbps
MachineProcessingThroughput
(#/sec)
Throttling bandwidth between .
RGB histogram filter
US average
connectivity:
18.7 Mbps (2017)
17

More in the Paper
• Detailed system design and implementation
• An analytic model relating user wait time to base rate,
filter accuracy, cloudlet processing speed, etc.
• Detailed results of individual case studies
18

Conclusion
Eureka combines early discard, iterative discovery workflow
and edge computing to help domain experts efficiently
discover training examples of rare phenomena from data
sources on the edge.
Eureka reduces human labeling effort by two orders of
magnitude compared to a brute force approach.
19

Thank you!
I will also present on tomorrow’s PhD Forum to discuss related ideas.
20

Edge-based Discovery of Training Data for Machine Learning

More Related Content

What's hot (20)

Similar to Edge-based Discovery of Training Data for Machine Learning (20)

Recently uploaded (20)

Edge-based Discovery of Training Data for Machine Learning

Editor's Notes