Subtitle (21)

The document presents a framework for categorizing machine learning projects based on data type (structured vs. unstructured) and dataset size (small vs. large). It emphasizes that best practices differ significantly between these categories, particularly in terms of data labeling and augmentation strategies. The author suggests that advice from those experienced in the same quadrant of machine learning problems is often more applicable than generalized advice.

Uploaded by

a334vchhkn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Subtitle (21)

Uploaded by

a334vchhkn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 3

I'd like to share with you

a useful framework for thinking about different major

types of machine learning projects. It turns out that the best practices for
organizing data for one type can be quite different than the best practices for
totally different types. Let's take a look at whether these major
types of machine learning projects. Let's fall in this two by two grid. One axis
will be whether your machine
learning problem uses unstructured data or structured data. I found that the best
practices for these
are very different, mainly because humans are great at processing unstructured
data,
the images and audio and text, and not as good at processing
structured data like database records. The second axis is
the size of your data set. Do you have a relatively small data set? or do you have
a very large data set? There is no precise definition of what
exactly is small and what is large? But I'm going to use as
a slightly arbitrary threshold, whether you have over 10,000 examples or
not. And clearly this boundary
is a little bit fuzzy and the transitions from small to
big data sets is a gradual one. But I found that best practices if
you have, say 100 or 1000 examples, smaller data sets is pretty different
than we have a very large data set. And the reason I chose the number 10,000
is that's roughly the size beyond which it becomes quite painful to
examine every single example yourself. If you have 1000 examples, you could
probably examine every example yourself. But when you have, 10,000,
100,000, million examples, it becomes very time consuming for
you as an individual or maybe a couple of machinery and engineers
to manually look at every example. So that affects the best
practices as well. Let's look at some examples. If you are training a manufacturing
visual inspection from just 100 examples of stretch phones, that's unstructured
data because this is image data and it's pretty small data set. If you are trying
to predict housing
prices based on the size of the halls and other features of the house,
from just 52 examples, then there's a structured data set. We've just real value
features and
a relatively small data sets. If you are carrying out speech recognition
from 50 million train examples, that's unstructured data. But you have a lot of
data or
if you are trying to recommend products. So online shopping recommendations and
you have a million users in your database, then that's a structured problem with
relatively large amount of data. For a lot of unstructured data problems, people,
Can help you to label data and data augmentation such as synthesizing
new images or synthesizing new audio. And there's some emerging techniques for
synthesizing new text as well, but data augmentation can help. So for manufacturing
vision inspection,
you can use data augmentation to maybe generate more pictures of
smart films or for speech recognition. Data augmentation can help
you synthesize audio clips with different background noise. In contrast for
structured data problems, it can be harder to obtain more data and
also harder to use data augmentation, if only 50 houses have been so
recently in that geography. Well, it's hard to synthesize
new houses that don't exist or if you have a million users
in your database, again, it's hard to synthesize new
users that don't really exist. And it's also harder not impossible,
still worth trying, but it may or may not be possible to get
humans to label the data. So I find that the best practices for unstructured versus
structured
data are quite different. The second axis is
the size of your data set. When you have a relatively small data set,
having clean labels is critical. If you have 100 training examples, then if just
one of the examples is
mislabeled, that's 1% of your data set. And because the data set is small enough
for you or a small team to go through it efficiently, it may well be aware of your
while to go through that 100 examples. And make sure that every one of those
examples is labelled in a clean and consistent way, meaning according
to a consistent labeling standard. In contrast, if you have a million
data points, it can be harder. Maybe impossible for a small machine learning team
to
manually go through every example. Having clean labels is still very helpful,
don't get me wrong. Even when you have a lot of data, clean
labels is better than non clean ones. But because of the difficulty of
having the machine learning and jointly go through every example,
the emphasis is on data processes. In terms of how you collect,
install the data, the labeling instructions you may write
for a large team of crowdsource labelers. And once you have executed some
data process, such as asked a large team of labelers to label a large set
of audio clips, it can also be much harder to go back and change your mind and
get everything relabeled. So let's summarize or
unstructured data problems. You may or may not have a huge
collection of unlabeled examples x. Maybe in your factory, you actually took
many thousands of images of smartphones, but you just haven't bothered
to label all of them yet. This is also common in
the self driving car industry, where many self driving car companies
have collected tons of images of cars driving around, but just have not yet
caught in that data labeled. For these structured data problems,
you can sometimes get more data by taking your unlabeled data x, and
asking humans to just label more of it. This doesn't apply to every problem,
but for the problems where you do have tons of
unlabeled data, this can be very helpful. And as we have already mentioned,
data augmentation can also be helpful. For structured data problems, is usually
harder to obtain more data
because you only have so many users or only so many houses were so
that you can collect data from. And human labeling on average is
also harder, although there are some exceptions, such as in the Lost Video
where you saw that we could try to ask people to label examples for
the user ID merge problem. But in many cases where we ask
humans to label structure data, even when there's a completely worthwhile
to ask people to try to label if two records are the same person, there's more
likely to be a little bit more ambiguity. But even the human labeler sometimes
finds it hard to be sure what is the correct label. Lastly, let's look at small
versus big data where I used to slightly arbitrary threshold
of whether you have more or less than say 10,000,
they put training examples. For small data sets,
clean labels are critical and the data set may be small enough for you to manually
look through the entire
data set and fix any inconsistent labels. Further, the labeling team is probably
not that large, it maybe one or two or just a handful of people
that created all the labels. So if you discover an inconsistency
in the labels, say one person label Iguanas one way and the different
person labeled Iguanas a different way. You can just get the two or
three labels together and have them talk to each other and hash out
and agree on one labeling convention. For the very large data sets,
the emphasis has to be on data process. And if you have a 100 labelers or
even more, it's just harder to get 100 people into a room to all talk
to each other and hash out the process. And so you might have to rely on
a smaller team to establish a consistent label definition and then share that
definition with all, say 100 or more labelers and ask them to
all implement the same process. I want to leave you with one last thought,
which is that I found this categorization of problems into unstructured versus
structured, small versus big data. I found this to be helpful for
predicting not just whether data processes generalize from
one to another problem, but also whether other machine learning
idea is generalized from one to another. So one tip, if you are working on a
problem from one of these four quadrants, then on average advice from someone
that has worked on problems in the same quadrants will probably
be more useful than advice from someone that's worked in
a different quadrant. I found also in hiring machine learning
engineers, someone that's worked in the same quadrant as the problem
I'm trying to solve will usually be able to adapt more quickly to working
on other problems in that quadrant. Because the instincts and decisions
are more similar within one quadrant than if you shift to a totally
different quadrants in discharge. I've sometimes heard people give advice
like if you are building a computer vision system always get at
least 1000 labeled examples. And I think people that give advice
like that are well meaning and I appreciate that they're
trying to give good advice, but I found that advice to not really
be useful for all problems. Machine learning is very diverse and it's hard to find
one size
fits all advice like that. I've seen computer vision problems
built with 100 examples or 100 examples for a class, screen systems
built with 100 million examples. And so if you are looking for advice on
a machine learning project, try to find someone that's worked in the same quadrant
as the problem you are trying to solve. Now we talked about one
formulation of different types of machine learning problems. There's one aspect I
would like to
dive into with you in the next video, which is how for small data problems,
having clean data is especially important. Let's take a look at the next
video of why it is true.

Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet
Developmental Plan
90% (10)
Developmental Plan
4 pages
Subtitle (20)
No ratings yet
Subtitle (20)
3 pages
Subtitle (5)
No ratings yet
Subtitle (5)
3 pages
Subtitle (22)
No ratings yet
Subtitle (22)
2 pages
Subtitle (16)
No ratings yet
Subtitle (16)
2 pages
Chapter 2 Preparing To Model
No ratings yet
Chapter 2 Preparing To Model
49 pages
Subtitle (23)
No ratings yet
Subtitle (23)
3 pages
Subtitle (27)
No ratings yet
Subtitle (27)
4 pages
Deep Learning with Keras: Beginner’s Guide to Deep Learning with Keras
From Everand
Deep Learning with Keras: Beginner’s Guide to Deep Learning with Keras
Frank Millstein
3/5 (1)
The Fragile Methodology
From Everand
The Fragile Methodology
Mr. Fragile
No ratings yet
Big Data Tips 1-2-3
From Everand
Big Data Tips 1-2-3
Richard M Batenburg, Jr
No ratings yet
Machine Learning - course
No ratings yet
Machine Learning - course
6 pages
3 Must-Have Projects For Your Data Science Portfolio - by Aakash N S - Jovian - Jan, 2021 - Medium
No ratings yet
3 Must-Have Projects For Your Data Science Portfolio - by Aakash N S - Jovian - Jan, 2021 - Medium
1 page
PYTHON DATA SCIENCE: A Practical Guide to Mastering Python for Data Science and Artificial Intelligence (2023 Beginner Crash Course)
From Everand
PYTHON DATA SCIENCE: A Practical Guide to Mastering Python for Data Science and Artificial Intelligence (2023 Beginner Crash Course)
Calvert Long
No ratings yet
0628251dc8fd54cfb9be078a314b367f_MITRES-TLL008F21-6864hw2
No ratings yet
0628251dc8fd54cfb9be078a314b367f_MITRES-TLL008F21-6864hw2
4 pages
10 Things That Used to be Good Ideas in Data Security
From Everand
10 Things That Used to be Good Ideas in Data Security
Mike Winkler
No ratings yet
Deep Learning Workflow
No ratings yet
Deep Learning Workflow
11 pages
3 Datasets
No ratings yet
3 Datasets
5 pages
Subtitle (15)
No ratings yet
Subtitle (15)
2 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
ML_DA
No ratings yet
ML_DA
55 pages
Unit2- AI Project Cycle-converted
No ratings yet
Unit2- AI Project Cycle-converted
12 pages
Unit 4 - DS - 1st year
No ratings yet
Unit 4 - DS - 1st year
6 pages
Data Cleaning: The Ultimate Practical Guide
From Everand
Data Cleaning: The Ultimate Practical Guide
Lee Baker
No ratings yet
Data Collection: Getting Started With Statistics
From Everand
Data Collection: Getting Started With Statistics
Lee Baker
No ratings yet
UNIT 1
No ratings yet
UNIT 1
12 pages
Manish Bhatt 2451137 ProjectIV
No ratings yet
Manish Bhatt 2451137 ProjectIV
20 pages
Project cycle- Key points
No ratings yet
Project cycle- Key points
3 pages
PYTHON MACHINE LEARNING: Leveraging Python for Implementing Machine Learning Algorithms and Applications (2023 Guide)
From Everand
PYTHON MACHINE LEARNING: Leveraging Python for Implementing Machine Learning Algorithms and Applications (2023 Guide)
Roberta Bowman
No ratings yet
6 CNN
No ratings yet
6 CNN
50 pages
Data Exploration
No ratings yet
Data Exploration
5 pages
Article-7
No ratings yet
Article-7
5 pages
PYTHON FOR DATA ANALYSIS: A Practical Guide to Manipulating, Cleaning, and Analyzing Data Using Python (2023 Beginner Crash Course)
From Everand
PYTHON FOR DATA ANALYSIS: A Practical Guide to Manipulating, Cleaning, and Analyzing Data Using Python (2023 Beginner Crash Course)
Ike Beck
No ratings yet
GROKKING ALGORITHMS: A Comprehensive Beginner's Guide to Learn the Realms of Grokking Algorithms from A-Z
From Everand
GROKKING ALGORITHMS: A Comprehensive Beginner's Guide to Learn the Realms of Grokking Algorithms from A-Z
Eric Schmidt
No ratings yet
My Hands-On ML Notebook
No ratings yet
My Hands-On ML Notebook
5 pages
NN-7
No ratings yet
NN-7
26 pages
Deep Learning Artificial Intelligence
No ratings yet
Deep Learning Artificial Intelligence
9 pages
C1 W3
No ratings yet
C1 W3
60 pages
Think Fast Act Faster: How to Generate Innovative Ideas and Make Them Happen
From Everand
Think Fast Act Faster: How to Generate Innovative Ideas and Make Them Happen
Gregg Rainer
No ratings yet
Introduction to Machine Learning-Q&A
No ratings yet
Introduction to Machine Learning-Q&A
25 pages
GROKKING ALGORITHMS: Advanced Methods to Learn and Use Grokking Algorithms and Data Structures for Programming
From Everand
GROKKING ALGORITHMS: Advanced Methods to Learn and Use Grokking Algorithms and Data Structures for Programming
Eric Schmidt
No ratings yet
20 Questions On Feature Engineering and Eda
No ratings yet
20 Questions On Feature Engineering and Eda
9 pages
ML 02 Dataset-Feature Selection PDF
No ratings yet
ML 02 Dataset-Feature Selection PDF
44 pages
Unit II - Data Science
No ratings yet
Unit II - Data Science
113 pages
Video 3 What Is Data
No ratings yet
Video 3 What Is Data
3 pages
Issues in ML and Generating Algo
No ratings yet
Issues in ML and Generating Algo
31 pages
SS CH2 LM AI CLASS X
No ratings yet
SS CH2 LM AI CLASS X
92 pages
Semi Supervised Learning
No ratings yet
Semi Supervised Learning
86 pages
Machine Learning: A Comprehensive, Step-by-Step Guide to Learning and Understanding Machine Learning Concepts, Technology and Principles for Beginners: 1
From Everand
Machine Learning: A Comprehensive, Step-by-Step Guide to Learning and Understanding Machine Learning Concepts, Technology and Principles for Beginners: 1
Peter Bradley
No ratings yet
02 Ai Project Cycle Revision Notes
No ratings yet
02 Ai Project Cycle Revision Notes
4 pages
Workflow of A Machine Learning Project
No ratings yet
Workflow of A Machine Learning Project
12 pages
Course Two
No ratings yet
Course Two
133 pages
ML Notes
No ratings yet
ML Notes
79 pages
Doubt Clearance Session(AI) on 29.12.2024
No ratings yet
Doubt Clearance Session(AI) on 29.12.2024
41 pages
2022_A review_ Data pre-processing and data augmentation techniques - ScienceDirect
No ratings yet
2022_A review_ Data pre-processing and data augmentation techniques - ScienceDirect
20 pages
Machine Learning: A Review of Classification and Combining Techniques
No ratings yet
Machine Learning: A Review of Classification and Combining Techniques
32 pages
What, So What, Now What
From Everand
What, So What, Now What
Chris Garson
3/5 (1)
Excel Formulas That Automate Tasks You No Longer Have Time For
From Everand
Excel Formulas That Automate Tasks You No Longer Have Time For
Erik Kopp
5/5 (1)
Working With Data - Annotated
No ratings yet
Working With Data - Annotated
62 pages
(Thang) MS Word Task List (1)
No ratings yet
(Thang) MS Word Task List (1)
45 pages
Term Project
No ratings yet
Term Project
3 pages
FVTT_20250228
No ratings yet
FVTT_20250228
3 pages
Change Page Orientation to Landscape or Portrait 03
No ratings yet
Change Page Orientation to Landscape or Portrait 03
1 page
Subtitle (14)
No ratings yet
Subtitle (14)
3 pages
Change Part Orientation to Landscape or Portrait 03
No ratings yet
Change Part Orientation to Landscape or Portrait 03
1 page
Duplicate Slide
No ratings yet
Duplicate Slide
6 pages
Subtitle
No ratings yet
Subtitle
3 pages
Wrap Text Around a Picture 01
No ratings yet
Wrap Text Around a Picture 01
2 pages
Subtitle (4)
No ratings yet
Subtitle (4)
2 pages
add_border_01 (1)
No ratings yet
add_border_01 (1)
2 pages
(Hoang) Computer Task - MS PowerPoint (1)
No ratings yet
(Hoang) Computer Task - MS PowerPoint (1)
152 pages
Subtitle (24)
No ratings yet
Subtitle (24)
3 pages
ad23b7fe781f4b2d0385028b7864fb62
No ratings yet
ad23b7fe781f4b2d0385028b7864fb62
9 pages
Subtitle (31)
No ratings yet
Subtitle (31)
2 pages
Computer Task - MS PowerPoint
No ratings yet
Computer Task - MS PowerPoint
82 pages
Get Design Idea Slides
No ratings yet
Get Design Idea Slides
6 pages
Alert Monitor System
No ratings yet
Alert Monitor System
1 page
Insert a Page Break 05
No ratings yet
Insert a Page Break 05
3 pages
news (1)
No ratings yet
news (1)
2 pages
Firefox Older Install
No ratings yet
Firefox Older Install
1 page
Email Domain Finder
No ratings yet
Email Domain Finder
1 page
Fianl Recommendations
No ratings yet
Fianl Recommendations
1 page
Stream Operations Terminal
No ratings yet
Stream Operations Terminal
14 pages
Knowledge Management in Construction Industry
50% (2)
Knowledge Management in Construction Industry
20 pages
Unit-6 - Janina Rose Maramag
No ratings yet
Unit-6 - Janina Rose Maramag
3 pages
Case 1 Being Too Nice To People and Shazia
88% (8)
Case 1 Being Too Nice To People and Shazia
7 pages
Lesson Notes 1
No ratings yet
Lesson Notes 1
5 pages
Contempo DLP Nov.3
No ratings yet
Contempo DLP Nov.3
3 pages
Module 5
No ratings yet
Module 5
4 pages
Plant Disease Detection Using Deep Learning Approach: Project phase-II Presentation On
No ratings yet
Plant Disease Detection Using Deep Learning Approach: Project phase-II Presentation On
20 pages
Visual Essay
No ratings yet
Visual Essay
8 pages
The Practicing Mind
0% (1)
The Practicing Mind
15 pages
Messages From The Future
No ratings yet
Messages From The Future
21 pages
Behavioral Management Principles
No ratings yet
Behavioral Management Principles
8 pages
Human Resource Management Practices and Innovation A Review of Literature PDF
100% (1)
Human Resource Management Practices and Innovation A Review of Literature PDF
13 pages
Lec 42
No ratings yet
Lec 42
14 pages
Empower ESS B2 Workbook Sample
100% (1)
Empower ESS B2 Workbook Sample
6 pages
Essay - Learning & Motivation
No ratings yet
Essay - Learning & Motivation
5 pages
Module Ethics Final Term Coverage
100% (1)
Module Ethics Final Term Coverage
63 pages
Narrative Presence - The Illusion of Language in Heart of Darkness - Jerry Wasserman
No ratings yet
Narrative Presence - The Illusion of Language in Heart of Darkness - Jerry Wasserman
13 pages
The Front Office Bider
No ratings yet
The Front Office Bider
13 pages
Chap 2 Review of Litrature Review & Developing Theoretical Framework PDF
No ratings yet
Chap 2 Review of Litrature Review & Developing Theoretical Framework PDF
47 pages
FS 2 Episode 2obe - Intended Learning Outcomes
100% (8)
FS 2 Episode 2obe - Intended Learning Outcomes
10 pages
B2 UNIT 7 Culture Teacher's Notes
No ratings yet
B2 UNIT 7 Culture Teacher's Notes
1 page
BSBTWK502 Ans Chanpreet Singh
No ratings yet
BSBTWK502 Ans Chanpreet Singh
30 pages
Module - 6 Thinking and Concept Formation
No ratings yet
Module - 6 Thinking and Concept Formation
62 pages
Chapter 5 Selection
No ratings yet
Chapter 5 Selection
15 pages
Leadership Skills For Nurses
100% (1)
Leadership Skills For Nurses
34 pages
Possessive Adjectives With The Verb To Be
No ratings yet
Possessive Adjectives With The Verb To Be
3 pages
Artificial Intelligence (AI)
No ratings yet
Artificial Intelligence (AI)
74 pages
Grade 9 and 10 Syllabus
No ratings yet
Grade 9 and 10 Syllabus
7 pages
Summarizing and Paraphrasing
No ratings yet
Summarizing and Paraphrasing
9 pages

Subtitle (21)

Uploaded by

Subtitle (21)

Uploaded by

I'd like to share with you

a useful framework for thinking about different major

You might also like