0% found this document useful (0 votes)
4 views

Subtitle (21)

The document presents a framework for categorizing machine learning projects based on data type (structured vs. unstructured) and dataset size (small vs. large). It emphasizes that best practices differ significantly between these categories, particularly in terms of data labeling and augmentation strategies. The author suggests that advice from those experienced in the same quadrant of machine learning problems is often more applicable than generalized advice.

Uploaded by

a334vchhkn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Subtitle (21)

The document presents a framework for categorizing machine learning projects based on data type (structured vs. unstructured) and dataset size (small vs. large). It emphasizes that best practices differ significantly between these categories, particularly in terms of data labeling and augmentation strategies. The author suggests that advice from those experienced in the same quadrant of machine learning problems is often more applicable than generalized advice.

Uploaded by

a334vchhkn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 3

I'd like to share with you

a useful framework for thinking about different major


types of machine learning projects. It turns out that the best practices for
organizing data for one type can be quite different than the best practices for
totally different types. Let's take a look at whether these major
types of machine learning projects. Let's fall in this two by two grid. One axis
will be whether your machine
learning problem uses unstructured data or structured data. I found that the best
practices for these
are very different, mainly because humans are great at processing unstructured
data,
the images and audio and text, and not as good at processing
structured data like database records. The second axis is
the size of your data set. Do you have a relatively small data set? or do you have
a very large data set? There is no precise definition of what
exactly is small and what is large? But I'm going to use as
a slightly arbitrary threshold, whether you have over 10,000 examples or
not. And clearly this boundary
is a little bit fuzzy and the transitions from small to
big data sets is a gradual one. But I found that best practices if
you have, say 100 or 1000 examples, smaller data sets is pretty different
than we have a very large data set. And the reason I chose the number 10,000
is that's roughly the size beyond which it becomes quite painful to
examine every single example yourself. If you have 1000 examples, you could
probably examine every example yourself. But when you have, 10,000,
100,000, million examples, it becomes very time consuming for
you as an individual or maybe a couple of machinery and engineers
to manually look at every example. So that affects the best
practices as well. Let's look at some examples. If you are training a manufacturing
visual inspection from just 100 examples of stretch phones, that's unstructured
data because this is image data and it's pretty small data set. If you are trying
to predict housing
prices based on the size of the halls and other features of the house,
from just 52 examples, then there's a structured data set. We've just real value
features and
a relatively small data sets. If you are carrying out speech recognition
from 50 million train examples, that's unstructured data. But you have a lot of
data or
if you are trying to recommend products. So online shopping recommendations and
you have a million users in your database, then that's a structured problem with
relatively large amount of data. For a lot of unstructured data problems, people,
Can help you to label data and data augmentation such as synthesizing
new images or synthesizing new audio. And there's some emerging techniques for
synthesizing new text as well, but data augmentation can help. So for manufacturing
vision inspection,
you can use data augmentation to maybe generate more pictures of
smart films or for speech recognition. Data augmentation can help
you synthesize audio clips with different background noise. In contrast for
structured data problems, it can be harder to obtain more data and
also harder to use data augmentation, if only 50 houses have been so
recently in that geography. Well, it's hard to synthesize
new houses that don't exist or if you have a million users
in your database, again, it's hard to synthesize new
users that don't really exist. And it's also harder not impossible,
still worth trying, but it may or may not be possible to get
humans to label the data. So I find that the best practices for unstructured versus
structured
data are quite different. The second axis is
the size of your data set. When you have a relatively small data set,
having clean labels is critical. If you have 100 training examples, then if just
one of the examples is
mislabeled, that's 1% of your data set. And because the data set is small enough
for you or a small team to go through it efficiently, it may well be aware of your
while to go through that 100 examples. And make sure that every one of those
examples is labelled in a clean and consistent way, meaning according
to a consistent labeling standard. In contrast, if you have a million
data points, it can be harder. Maybe impossible for a small machine learning team
to
manually go through every example. Having clean labels is still very helpful,
don't get me wrong. Even when you have a lot of data, clean
labels is better than non clean ones. But because of the difficulty of
having the machine learning and jointly go through every example,
the emphasis is on data processes. In terms of how you collect,
install the data, the labeling instructions you may write
for a large team of crowdsource labelers. And once you have executed some
data process, such as asked a large team of labelers to label a large set
of audio clips, it can also be much harder to go back and change your mind and
get everything relabeled. So let's summarize or
unstructured data problems. You may or may not have a huge
collection of unlabeled examples x. Maybe in your factory, you actually took
many thousands of images of smartphones, but you just haven't bothered
to label all of them yet. This is also common in
the self driving car industry, where many self driving car companies
have collected tons of images of cars driving around, but just have not yet
caught in that data labeled. For these structured data problems,
you can sometimes get more data by taking your unlabeled data x, and
asking humans to just label more of it. This doesn't apply to every problem,
but for the problems where you do have tons of
unlabeled data, this can be very helpful. And as we have already mentioned,
data augmentation can also be helpful. For structured data problems, is usually
harder to obtain more data
because you only have so many users or only so many houses were so
that you can collect data from. And human labeling on average is
also harder, although there are some exceptions, such as in the Lost Video
where you saw that we could try to ask people to label examples for
the user ID merge problem. But in many cases where we ask
humans to label structure data, even when there's a completely worthwhile
to ask people to try to label if two records are the same person, there's more
likely to be a little bit more ambiguity. But even the human labeler sometimes
finds it hard to be sure what is the correct label. Lastly, let's look at small
versus big data where I used to slightly arbitrary threshold
of whether you have more or less than say 10,000,
they put training examples. For small data sets,
clean labels are critical and the data set may be small enough for you to manually
look through the entire
data set and fix any inconsistent labels. Further, the labeling team is probably
not that large, it maybe one or two or just a handful of people
that created all the labels. So if you discover an inconsistency
in the labels, say one person label Iguanas one way and the different
person labeled Iguanas a different way. You can just get the two or
three labels together and have them talk to each other and hash out
and agree on one labeling convention. For the very large data sets,
the emphasis has to be on data process. And if you have a 100 labelers or
even more, it's just harder to get 100 people into a room to all talk
to each other and hash out the process. And so you might have to rely on
a smaller team to establish a consistent label definition and then share that
definition with all, say 100 or more labelers and ask them to
all implement the same process. I want to leave you with one last thought,
which is that I found this categorization of problems into unstructured versus
structured, small versus big data. I found this to be helpful for
predicting not just whether data processes generalize from
one to another problem, but also whether other machine learning
idea is generalized from one to another. So one tip, if you are working on a
problem from one of these four quadrants, then on average advice from someone
that has worked on problems in the same quadrants will probably
be more useful than advice from someone that's worked in
a different quadrant. I found also in hiring machine learning
engineers, someone that's worked in the same quadrant as the problem
I'm trying to solve will usually be able to adapt more quickly to working
on other problems in that quadrant. Because the instincts and decisions
are more similar within one quadrant than if you shift to a totally
different quadrants in discharge. I've sometimes heard people give advice
like if you are building a computer vision system always get at
least 1000 labeled examples. And I think people that give advice
like that are well meaning and I appreciate that they're
trying to give good advice, but I found that advice to not really
be useful for all problems. Machine learning is very diverse and it's hard to find
one size
fits all advice like that. I've seen computer vision problems
built with 100 examples or 100 examples for a class, screen systems
built with 100 million examples. And so if you are looking for advice on
a machine learning project, try to find someone that's worked in the same quadrant
as the problem you are trying to solve. Now we talked about one
formulation of different types of machine learning problems. There's one aspect I
would like to
dive into with you in the next video, which is how for small data problems,
having clean data is especially important. Let's take a look at the next
video of why it is true.

You might also like