SlideShare a Scribd company logo
Edge-based
Discovery of
Training Data
for Machine
Learning
Ziqiang (Edmond) Feng, Shilpa George, Jan
Harkes, Padmanabhan Pillai†, Roberta Klatzky,
Mahadev Satyanarayanan
Carnegie Mellon University and †Intel Labs
The New Yorker magazine April 20, 2018, p. 41
The Deep Learning Recipe
Collect a large amount of
data and label it
Select a model
and train a DNN
Deploy the DNN
for inference
2
TPOD @
CMU
DNNs for Domain Experts
Valuable in ecology, military intelligence, medical diagnosis, etc.
• Low base rate (prevalence) in the data
• Requires expertise to identify
Masked palm civet (Paguma larvata).
Transmitter of SARS during its 2003
outbreak in China.
BUK-M1. Believed to have shot down
MH17 and killed 298, 2014.
3
Nuclear atypia in cancer.
4Building a Training Set Is Hard
 Crowds are not experts
Crowd-sourcing (e.g., Amazon Mechanical Turk) are not applicable
 Access restriction of data
Patient privacy, business policy, national security, etc.
In the worst case, a single domain expert has to generate
an entire training set of 103 to 104 examples.
Masked palm civet
Red panda
Raccoon
Our Contribution: Eureka
 A system for efficient discovery of training examples from data
sources dispersed over the Internet
(focus on images in this paper)
 Goal: to effectively utilize an expert’s time and attention
 Key concepts:
 Early discard
 Iterative discovery workflow
 Edge computing
5
(positive)
Eureka’s Architecture
Expert with
domain-specific
GUI
cloudlet
Archival
Data
Source
LAN
cloudlet
LAN
cloudlet Live
Video
I
n
t
e
r
n
e
t
Archival
Data
Source
LAN
6
Executes early-discard code to
drop clearly irrelevant data
Only a tiny fraction of data along with
meta-data is transmitted and shown to
user, consuming little Internet
bandwidth.
High-bandwidth,
low-latency access
Example GUI: Finding Deer 7
Early-discard filters
Iterative Discovery Workflow
Explicit features, manual
weights (RGB histogram,
SIFT, perceptual hashing)
Explicit features, learned
weights (HOG + SVM)
Shallow transfer learning
(MobileNet + SVM)
Deep transfer learning
(Faster R-CNN finetuning)
Deep learning
100 101 102 103 104
Number of Examples (log scale)
Accuracy(nottoscale)
8
Finding Deer (after a few iterations) 9
System Design and Implementation
 Software generality: allow use of CV code written in
different languages, libraries and frameworks
(e.g., Python, Matlab, C++, TensorFlow, PyTorch, Scikit-learn)
 Empower experts with newest CV innovations quickly
 Encapsulate filters in Docker containers
 Runtime efficiency: be able to rapidly process and discard
large volume of data
 Exploit specialized hardware on cloudlets (e.g., GPU)
 Cache filter results to exploit temporal locality
10
Matching System to User
The system should deliver images to user at a rate the user can inspect them.
Wasting computation and precious
Internet bandwidth 
Suggestion
1. Restrict to fewer cloudlets
2. Bias filters towards precision rather
than recall
11
Too Fast
Matching System to User (cont’d)
The system should deliver images to user at a rate the user can inspect them.
Wasting expert time 
Obvious solution
Scale out to more cloudlets
(Edge computing is your friend)
Risk
“Junk” (false positives) causes user
annoyance and dissatisfaction
Rule of thumb
Focus on reducing false positive rate
before scaling out
12
Too Slow
Evaluation: Setup
YFCC100M: 99.2 million Flickr photos.
Real-life distribution of objects.
Evenly partitioned over the cloudlets.
Dataset
8 cloudlets with Nvidia GPUs, access data from
local SSDs.Edge
Connected to the cloudlets via the Internet.Client
13
Evaluation: Case Studies
Deer Taj Mahal Fire hydrant
0.07% 0.02% 0.005%Estimated
base rate
111 105 74Collected positives
in evaluation
7,447 4,791 15,379Images viewed
by user
14
2,104,076 2,542,889 2,734,070Images discarded
by Eureka
Eureka vs. Brute-force
1,000
10,000
100,000
1,000,000
Deer Taj Mahal Fire hydrant
Number of images the user viewed to collect
~100 true positives
Brute-force Single-iteration Eureka Eureka
Brute-force:
User views every image.
Single-iteration Eureka:
Early-discard without
iterative improvement.
15
Please refer to our paper for detailed results of each case study.
Iteratively Improving Productivity
The case of deer
0.4 0.36
1.49
4.24
4.77
1 2 3 4 5
Iteration in Eureka
Productivity (New true positives / minute)
16
~10X
Compute Must Co-locate with Data
0
200
400
600
800
1000
10 Mbps 25 Mbps 100 Mbps 1 Gbps
MachineProcessingThroughput
(#/sec)
Throttling bandwidth between .
RGB histogram filter
US average
connectivity:
18.7 Mbps (2017)
17
More in the Paper
• Detailed system design and implementation
• An analytic model relating user wait time to base rate,
filter accuracy, cloudlet processing speed, etc.
• Detailed results of individual case studies
18
Conclusion
Eureka combines early discard, iterative discovery workflow
and edge computing to help domain experts efficiently
discover training examples of rare phenomena from data
sources on the edge.
Eureka reduces human labeling effort by two orders of
magnitude compared to a brute force approach.
19
Thank you!
I will also present on tomorrow’s PhD Forum to discuss related ideas.
20

More Related Content

What's hot (20)

PDF
Intel 2020 Labs Day Keynote Slides
DESMOND YUEN
 
PPT
[Seminar arxiv]fake face detection via adaptive residuals extraction network
KIMMINHA3
 
PDF
Best Practices for On-Demand HPC in Enterprises
geetachauhan
 
PDF
Deep learning for medical imaging
geetachauhan
 
PPT
[CVPRW 2020]Real world Super-Resolution via Kernel Estimation and Noise Injec...
KIMMINHA3
 
PDF
Machine Learning for Weather Forecasts
inside-BigData.com
 
PPTX
Machine Learning
Mahdi Hosseini Moghaddam
 
PDF
Using Simulation for Decision Support: Lessons Learned from FireGrid
gwickler
 
PPT
Making Sense of Information Through Planetary Scale Computing
Larry Smarr
 
PDF
II-SDV 2017: The Next Era: Deep Learning for Biomedical Research
Dr. Haxel Consult
 
PPTX
Open problems big_data_19_feb_2015_ver_0.1
Vijay Srinivas Agneeswaran, Ph.D
 
PDF
How to Scale from Workstation through Cloud to HPC in Cryo-EM Processing
inside-BigData.com
 
PPTX
AdClickFraud_Bigdata-Apic-Ist-2019
Neha gupta
 
PPTX
Machine Learning in Healthcare Diagnostics
Larry Smarr
 
PDF
ICIC 2017: The Next Era: Deep Learning for Biomedical Research
Dr. Haxel Consult
 
PPTX
The Rise of Machine Intelligence
Larry Smarr
 
PDF
IRJET- A Novel High Capacity Reversible Data Hiding in Encrypted Domain u...
IRJET Journal
 
PPTX
Virtualized high performance computing with mellanox fdr and ro ce
inside-BigData.com
 
PPT
CI image processing mns
Meenakshi Sood
 
PDF
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
Edge AI and Vision Alliance
 
Intel 2020 Labs Day Keynote Slides
DESMOND YUEN
 
[Seminar arxiv]fake face detection via adaptive residuals extraction network
KIMMINHA3
 
Best Practices for On-Demand HPC in Enterprises
geetachauhan
 
Deep learning for medical imaging
geetachauhan
 
[CVPRW 2020]Real world Super-Resolution via Kernel Estimation and Noise Injec...
KIMMINHA3
 
Machine Learning for Weather Forecasts
inside-BigData.com
 
Machine Learning
Mahdi Hosseini Moghaddam
 
Using Simulation for Decision Support: Lessons Learned from FireGrid
gwickler
 
Making Sense of Information Through Planetary Scale Computing
Larry Smarr
 
II-SDV 2017: The Next Era: Deep Learning for Biomedical Research
Dr. Haxel Consult
 
Open problems big_data_19_feb_2015_ver_0.1
Vijay Srinivas Agneeswaran, Ph.D
 
How to Scale from Workstation through Cloud to HPC in Cryo-EM Processing
inside-BigData.com
 
AdClickFraud_Bigdata-Apic-Ist-2019
Neha gupta
 
Machine Learning in Healthcare Diagnostics
Larry Smarr
 
ICIC 2017: The Next Era: Deep Learning for Biomedical Research
Dr. Haxel Consult
 
The Rise of Machine Intelligence
Larry Smarr
 
IRJET- A Novel High Capacity Reversible Data Hiding in Encrypted Domain u...
IRJET Journal
 
Virtualized high performance computing with mellanox fdr and ro ce
inside-BigData.com
 
CI image processing mns
Meenakshi Sood
 
"New Dataflow Architecture for Machine Learning," a Presentation from Wave Co...
Edge AI and Vision Alliance
 

Similar to Edge-based Discovery of Training Data for Machine Learning (20)

PPTX
PhD Thesis Proposal
Ziqiang Feng
 
PDF
Improving computer vision models at scale presentation
Dr. Mirko Kämpf
 
PDF
Improving computer vision models at scale presentation
Jan Kunigk
 
PPTX
Improving computer vision models at scale (Strata Data NYC)
Dr. Mirko Kämpf
 
PDF
Exascale Deep Learning for Climate Analytics
inside-BigData.com
 
PDF
“Developing Edge Computer Vision Solutions for Applications with Extreme Limi...
Edge AI and Vision Alliance
 
PPTX
Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...
MLconf
 
PDF
HPC Deployment / Use Cases (EVEREST + DAPHNE: Workshop on Design and Programm...
University of Maribor
 
PPTX
Spark and Deep Learning Frameworks at Scale 7.19.18
Cloudera, Inc.
 
PPTX
Machine Learning and Hadoop: Present and future
Cloudera, Inc.
 
PPTX
Hadoop and Machine Learning
joshwills
 
PPTX
nnU-Net: a self-configuring method for deep learning-based biomedical image s...
ivaderivader
 
PDF
Wildlife Detection System using Deep Neural Networks
IRJET Journal
 
PPTX
AutoML for user segmentation: how to match millions of users with hundreds of...
Institute of Contemporary Sciences
 
PPTX
Discovering exoplanets with Deep Leaning
Rafael Arana
 
PPTX
Machine Learning with ML.NET and Azure - Andy Cross
Andrew Flatters
 
PDF
The Analytics Frontier of the Hadoop Eco-System
inside-BigData.com
 
PDF
Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016
Universitat Politècnica de Catalunya
 
PDF
201907 AutoML and Neural Architecture Search
DaeJin Kim
 
PDF
Training and deploying an image classification model
Knoldus Inc.
 
PhD Thesis Proposal
Ziqiang Feng
 
Improving computer vision models at scale presentation
Dr. Mirko Kämpf
 
Improving computer vision models at scale presentation
Jan Kunigk
 
Improving computer vision models at scale (Strata Data NYC)
Dr. Mirko Kämpf
 
Exascale Deep Learning for Climate Analytics
inside-BigData.com
 
“Developing Edge Computer Vision Solutions for Applications with Extreme Limi...
Edge AI and Vision Alliance
 
Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...
MLconf
 
HPC Deployment / Use Cases (EVEREST + DAPHNE: Workshop on Design and Programm...
University of Maribor
 
Spark and Deep Learning Frameworks at Scale 7.19.18
Cloudera, Inc.
 
Machine Learning and Hadoop: Present and future
Cloudera, Inc.
 
Hadoop and Machine Learning
joshwills
 
nnU-Net: a self-configuring method for deep learning-based biomedical image s...
ivaderivader
 
Wildlife Detection System using Deep Neural Networks
IRJET Journal
 
AutoML for user segmentation: how to match millions of users with hundreds of...
Institute of Contemporary Sciences
 
Discovering exoplanets with Deep Leaning
Rafael Arana
 
Machine Learning with ML.NET and Azure - Andy Cross
Andrew Flatters
 
The Analytics Frontier of the Hadoop Eco-System
inside-BigData.com
 
Deep Learning for Computer Vision (1/4): Image Analytics @ laSalle 2016
Universitat Politècnica de Catalunya
 
201907 AutoML and Neural Architecture Search
DaeJin Kim
 
Training and deploying an image classification model
Knoldus Inc.
 
Ad

Recently uploaded (20)

PPTX
01_Approach Cyber- DORA Incident Management.pptx
FinTech Belgium
 
PPTX
Mastering Authorization: Integrating Authentication and Authorization Data in...
Hitachi, Ltd. OSS Solution Center.
 
PDF
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 
PDF
Hyderabad MuleSoft In-Person Meetup (June 21, 2025) Slides
Ravi Tamada
 
PDF
Pipeline Industry IoT - Real Time Data Monitoring
Safe Software
 
PDF
''Taming Explosive Growth: Building Resilience in a Hyper-Scaled Financial Pl...
Fwdays
 
PPTX
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Poster...
Michele Kryston
 
PDF
Bridging CAD, IBM TRIRIGA & GIS with FME: The Portland Public Schools Case
Safe Software
 
PDF
Proactive Server and System Monitoring with FME: Using HTTP and System Caller...
Safe Software
 
PDF
Optimizing the trajectory of a wheel loader working in short loading cycles
Reno Filla
 
PDF
Unlocking FME Flow’s Potential: Architecture Design for Modern Enterprises
Safe Software
 
PPTX
2025 HackRedCon Cyber Career Paths.pptx Scott Stanton
Scott Stanton
 
PDF
TrustArc Webinar - Navigating APAC Data Privacy Laws: Compliance & Challenges
TrustArc
 
PPTX
CapCut Pro PC Crack Latest Version Free Free
josanj305
 
PDF
Enhancing Environmental Monitoring with Real-Time Data Integration: Leveragin...
Safe Software
 
PDF
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
PPTX
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Pitch ...
Michele Kryston
 
PDF
99 Bottles of Trust on the Wall — Operational Principles for Trust in Cyber C...
treyka
 
PDF
Understanding AI Optimization AIO, LLMO, and GEO
CoDigital
 
PDF
DoS Attack vs DDoS Attack_ The Silent Wars of the Internet.pdf
CyberPro Magazine
 
01_Approach Cyber- DORA Incident Management.pptx
FinTech Belgium
 
Mastering Authorization: Integrating Authentication and Authorization Data in...
Hitachi, Ltd. OSS Solution Center.
 
ICONIQ State of AI Report 2025 - The Builder's Playbook
Razin Mustafiz
 
Hyderabad MuleSoft In-Person Meetup (June 21, 2025) Slides
Ravi Tamada
 
Pipeline Industry IoT - Real Time Data Monitoring
Safe Software
 
''Taming Explosive Growth: Building Resilience in a Hyper-Scaled Financial Pl...
Fwdays
 
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Poster...
Michele Kryston
 
Bridging CAD, IBM TRIRIGA & GIS with FME: The Portland Public Schools Case
Safe Software
 
Proactive Server and System Monitoring with FME: Using HTTP and System Caller...
Safe Software
 
Optimizing the trajectory of a wheel loader working in short loading cycles
Reno Filla
 
Unlocking FME Flow’s Potential: Architecture Design for Modern Enterprises
Safe Software
 
2025 HackRedCon Cyber Career Paths.pptx Scott Stanton
Scott Stanton
 
TrustArc Webinar - Navigating APAC Data Privacy Laws: Compliance & Challenges
TrustArc
 
CapCut Pro PC Crack Latest Version Free Free
josanj305
 
Enhancing Environmental Monitoring with Real-Time Data Integration: Leveragin...
Safe Software
 
🚀 Let’s Build Our First Slack Workflow! 🔧.pdf
SanjeetMishra29
 
MARTSIA: A Tool for Confidential Data Exchange via Public Blockchain - Pitch ...
Michele Kryston
 
99 Bottles of Trust on the Wall — Operational Principles for Trust in Cyber C...
treyka
 
Understanding AI Optimization AIO, LLMO, and GEO
CoDigital
 
DoS Attack vs DDoS Attack_ The Silent Wars of the Internet.pdf
CyberPro Magazine
 
Ad

Edge-based Discovery of Training Data for Machine Learning

  • 1. Edge-based Discovery of Training Data for Machine Learning Ziqiang (Edmond) Feng, Shilpa George, Jan Harkes, Padmanabhan Pillai†, Roberta Klatzky, Mahadev Satyanarayanan Carnegie Mellon University and †Intel Labs The New Yorker magazine April 20, 2018, p. 41
  • 2. The Deep Learning Recipe Collect a large amount of data and label it Select a model and train a DNN Deploy the DNN for inference 2 TPOD @ CMU
  • 3. DNNs for Domain Experts Valuable in ecology, military intelligence, medical diagnosis, etc. • Low base rate (prevalence) in the data • Requires expertise to identify Masked palm civet (Paguma larvata). Transmitter of SARS during its 2003 outbreak in China. BUK-M1. Believed to have shot down MH17 and killed 298, 2014. 3 Nuclear atypia in cancer.
  • 4. 4Building a Training Set Is Hard  Crowds are not experts Crowd-sourcing (e.g., Amazon Mechanical Turk) are not applicable  Access restriction of data Patient privacy, business policy, national security, etc. In the worst case, a single domain expert has to generate an entire training set of 103 to 104 examples. Masked palm civet Red panda Raccoon
  • 5. Our Contribution: Eureka  A system for efficient discovery of training examples from data sources dispersed over the Internet (focus on images in this paper)  Goal: to effectively utilize an expert’s time and attention  Key concepts:  Early discard  Iterative discovery workflow  Edge computing 5 (positive)
  • 6. Eureka’s Architecture Expert with domain-specific GUI cloudlet Archival Data Source LAN cloudlet LAN cloudlet Live Video I n t e r n e t Archival Data Source LAN 6 Executes early-discard code to drop clearly irrelevant data Only a tiny fraction of data along with meta-data is transmitted and shown to user, consuming little Internet bandwidth. High-bandwidth, low-latency access
  • 7. Example GUI: Finding Deer 7 Early-discard filters
  • 8. Iterative Discovery Workflow Explicit features, manual weights (RGB histogram, SIFT, perceptual hashing) Explicit features, learned weights (HOG + SVM) Shallow transfer learning (MobileNet + SVM) Deep transfer learning (Faster R-CNN finetuning) Deep learning 100 101 102 103 104 Number of Examples (log scale) Accuracy(nottoscale) 8
  • 9. Finding Deer (after a few iterations) 9
  • 10. System Design and Implementation  Software generality: allow use of CV code written in different languages, libraries and frameworks (e.g., Python, Matlab, C++, TensorFlow, PyTorch, Scikit-learn)  Empower experts with newest CV innovations quickly  Encapsulate filters in Docker containers  Runtime efficiency: be able to rapidly process and discard large volume of data  Exploit specialized hardware on cloudlets (e.g., GPU)  Cache filter results to exploit temporal locality 10
  • 11. Matching System to User The system should deliver images to user at a rate the user can inspect them. Wasting computation and precious Internet bandwidth  Suggestion 1. Restrict to fewer cloudlets 2. Bias filters towards precision rather than recall 11 Too Fast
  • 12. Matching System to User (cont’d) The system should deliver images to user at a rate the user can inspect them. Wasting expert time  Obvious solution Scale out to more cloudlets (Edge computing is your friend) Risk “Junk” (false positives) causes user annoyance and dissatisfaction Rule of thumb Focus on reducing false positive rate before scaling out 12 Too Slow
  • 13. Evaluation: Setup YFCC100M: 99.2 million Flickr photos. Real-life distribution of objects. Evenly partitioned over the cloudlets. Dataset 8 cloudlets with Nvidia GPUs, access data from local SSDs.Edge Connected to the cloudlets via the Internet.Client 13
  • 14. Evaluation: Case Studies Deer Taj Mahal Fire hydrant 0.07% 0.02% 0.005%Estimated base rate 111 105 74Collected positives in evaluation 7,447 4,791 15,379Images viewed by user 14 2,104,076 2,542,889 2,734,070Images discarded by Eureka
  • 15. Eureka vs. Brute-force 1,000 10,000 100,000 1,000,000 Deer Taj Mahal Fire hydrant Number of images the user viewed to collect ~100 true positives Brute-force Single-iteration Eureka Eureka Brute-force: User views every image. Single-iteration Eureka: Early-discard without iterative improvement. 15 Please refer to our paper for detailed results of each case study.
  • 16. Iteratively Improving Productivity The case of deer 0.4 0.36 1.49 4.24 4.77 1 2 3 4 5 Iteration in Eureka Productivity (New true positives / minute) 16 ~10X
  • 17. Compute Must Co-locate with Data 0 200 400 600 800 1000 10 Mbps 25 Mbps 100 Mbps 1 Gbps MachineProcessingThroughput (#/sec) Throttling bandwidth between . RGB histogram filter US average connectivity: 18.7 Mbps (2017) 17
  • 18. More in the Paper • Detailed system design and implementation • An analytic model relating user wait time to base rate, filter accuracy, cloudlet processing speed, etc. • Detailed results of individual case studies 18
  • 19. Conclusion Eureka combines early discard, iterative discovery workflow and edge computing to help domain experts efficiently discover training examples of rare phenomena from data sources on the edge. Eureka reduces human labeling effort by two orders of magnitude compared to a brute force approach. 19
  • 20. Thank you! I will also present on tomorrow’s PhD Forum to discuss related ideas. 20

Editor's Notes

  • #2: Cartoon: New Yorker magazine April 20, 2018, p. 41
  • #3: Deep learning has become the gold standard in many areas, especially computer vision, due to its superb accuracy. Here shows the high-level recipe when you try to apply deep learning to a problem. You collect a large amount of data and label it. Then you select a model and train a DNN. Finally you deploy the DNN for inference. Nowadays, there are many software libraries, frameworks, cloud services and web-based tools that let you do the last two steps with great convenience. Virtually all the painstaking effort is in the very first step. And it sometimes can be the showstopper of applying deep learning to your problem.
  • #4: In this work, we focus on DNNs used by domain experts. Here are some examples. This animal is the transmitter of the SARS disease in China, 2003. You can imagine how valuable it would be if we had an accurate DNN detector and use it in public health effort. Likewise, this is a weapon that shot down an airplane and this is a pathological image of nuclear atypia in cancer. In all these cases, the target has low base rates – they are pretty rare in the data you are examining. And they all require expertise to correct identify. https://en.wikipedia.org/wiki/Masked_palm_civet#Connection_with_SARS
  • #5: Building a training set of this kind of targets is hard. First, obviously, crowds are not experts. So crowd-sourcing methods like Amazon Mechanical Turk are not applicable to these domains. For example, only an expert can reliably and accurately distinguish between these animals. Second, there may exist access restriction of data, such as patient privacy, business policy and national security. In the worst case, a single domain expert has to generate an entire training set of thousands to tens of thousands of examples.
  • #6: In this paper, we describe a system called Eureka, for efficient discovery of training examples from data sources dispersed over the Internet. The goal of Eureka is to optimally utilize an expert’s time and attention. It combines three key concepts to achieve its goal: early discard, iterative discovery workflow and edge computing, which I will describe next.
  • #7: Here shows Eureka’s architecture. An expert user runs a GUI on her own computer. The GUI connects to a number of cloudlets across the Internet. These cloudlets are LAN-connected to some associated data sources. These data sources may be archival or live, depending on the specific use case. As the shape of these arrows indicates, connections between cloudlets and data sources are high-bandwidth and low-latency, while those on the Internet are the contrary. And this high-bandwidth access is used to execute early-discard code on the cloudlets to drop clearly irrelevant data. Only a tiny fraction of data long with meta-data are transmitted and shown to the user, consuming little Internet bandwidth.
  • #8: Here shows an example of using the GUI to find images of deer from an unlabeled dataset. You can specify a list of early-discard filters and only images passing all of the filters are transmitted and displayed. You are seeing many false positives because the filters used in this case are very weak color and texture filters. (more time: 1. extend to general logical expression; 2. 500 more – efficient use of user attention)
  • #9: To improve the efficacy of early-discard, we introduce the iterative discovery workflow. Here you see a spectrum of computer vision algorithms and machine learning models, from simple on the left, such as RGB and SIFT, to sophisticated on the right, such as deep learning. X-axis is the number of example images you have, and the Y-axis is the accuracy. While these numbers are not meant to be precise, the idea is that different models require a different amount of data to work properly, and they give you different levels of accuracy. When using Eureka, instead of creating a set of filters and searching for your target in one go, you iteratively change and improve your filters as you collect more examples, and move up the stairs when you have sufficient data to do so. When using Eureka, in the beginning, you have very few examples. So you should only use explicit features like RGB or SIFT. With these weak filters, you may be able to find a few more, which allows you to escalate to a little more advanced filter, like SVM. The SVM is considerable more accurate, making it easier to find some more positives in a reasonable amount of time. So you iterate and climb up the stairs when you have sufficient data. In this process, you are both using more and more sophisticated filters, and growing the size of the training set you collect.
  • #10: Here again shows the case of finding deer, but after a few iterations of using Eureka and an SVM is being used. You see the filter has now become much more accurate.
  • #11: When designing and implementing Eureka, we have two major concerns. First is software generality. We want to allow the use computer vision code written in a diversity of languages, libraries and frameworks, so that we can empower experts with the newest computer vision innovations quickly. To do so, we encapsulate filters in Docker containers. Second is runtime efficiency. Eureka needs to be able to rapidly discard large volume of data. To do so, we exploit specialized hardware such as GPUs on cloudlets when available. We also cache filter results to exploit temporal locality in typical Eureka workloads.
  • #12: Another interesting problem is to match the Eureka system to the user. We propose that, ideally, the system should deliver images to user at a rate the user can inspect them. Because if the system is delivering too fast, you are pumping lots of results into the network which the user may never see. So it’s a waste of computation and precious Internet bandwidth. Our suggestion in this case is to restrict your search to fewer cloudlets ( or to bias your filters towards precision rather than recall.)
  • #13: On the other hand, if the system is delivering too slowly, you are basically forcing the user to wait. And wasting an expert’s time is a really bad thing to do. An obvious solution is to scale out to more cloudlets. But there is a risk here. Showing more “junk” to user will cause user annoyance and dissatisfaction. So you really need to strike a balance between avoiding user wait time and avoiding too many false positives. Our rule of thumb in this scenario is one should focus on reducing the false positive rate before scaling out the many cloudlets.
  • #14: To evaluate Eureka, we 99 million Flickr images from the YFCC100M dataset. On the edge we have 8 cloudlets with local access to data. And the client GUI connects to the cloudlets over the Internet.
  • #15: We conducted three case studies using these three chosen targets – deer, Taj Mahal and fire hydrant. As you can see from the base rate, these are fairly rare objects in Flickr photos. We used Eureka to collect about 100 positive examples of each. Here you can see the number of images viewed by the user, and images discarded by Eureka in the whole process. You can see how effective Eureka is in reducing the amount of data the user needs to look at and label.
  • #16: We compare Eureka with a brute-force method, where the user goes through the images one by one and label them. That’s basically how many datasets are curated today. For reference, we also compare with what we called “single-iteration Eureka”, which means using early-discard, but without iterative improvement. Y-axis shows how many images the user viewed in order to collect the same number of positives. As you can see, compared with brute-force, single-iteration Eureka gives you up to an order of magnitude of improvement, showing the efficacy of early-discard. On top of that, full Eureka gives another order of magnitude of improvement, showing the benefit of the iterative workflow.
  • #17: We show how Eureka is iteratively improving user’s productivity, in the case of deer. We measure productivity in terms of new true positives found in each Eureka iteration. Over five iterations of using Eureka, the productivity increases from 0.4 to 4.7, more than 10X improvement.
  • #18: Finally, we show the importance of edge computing. Specifically, we show when the data is at the edge, the compute must also be at the edge for Eureka to be efficient. To do so, we throttled the bandwidth between the cloudlet and the data source, and measure the machine processing throughput of an RGB histogram filter. The result shows it really needs LAN-connectivity at 1Gbps to deliver sufficiently high throughput. If the data is shipped over the wide area network, it slows down by about 10X.
  • #20: In conclusion, … (….) Our evaluation shows ….
  • #30: Why is it hard? Most importantly, crowds are not experts. So crowd-sourcing approaches like Amazon Mechanical Turk are not applicable in these domains. Only an expert can reliably classify these animals. Besides, these interesting phenomena are usually rare, making it difficult to find positive examples in unlabeled. Finally, there may exists access restriction of data, such as patient privacy, business policy and national security. In the worst case, a single expert has to generate an entire training set of thousands of to tens of thousands of examples.
  • #31: Here shows the execution model. On the cloudlet, a component called itemizer reads whatever data in its raw format, and emits individual items. Items are independently unit of early-discard. Items are then feed into the item processor. Here a chain of filters evaluate the items and try to drop them. We encapsulate filters in Docker containers to achieve software generality I just mentioned. And we cache filter results to improve efficiency. Finally, the filters also attach key-value attributes to each item. These attributes both facilitates communication between filters during the run time and post-analysis after they are sent back to the user.
  • #32: Finally, we study the importance of edge computing for Eureka. Specifically, how necessary is high-bandwidth access to data. Here we throttle the bandwidth between the cloudlet and the data source, and measure the machine processing throughput of three filters, including cheap ones and expensive ones. As you can see, when we decrease the bandwidth, the throughput drops significantly. Under 25 Mbps, there is basically no difference between cheap filters and expensive filters, because data access time becomes the bottleneck. So we see high-bandwidth access is crucial to the efficacy of Eureka.