Unexpected Challenges in Large Scale Machine Learning by Charles Parker

Unexpected Challenges in Large
Scale Machine Learning
Charles Parker, BigML, Inc.

Who Am I?

Ph. D. from Oregon State University, 2007

Four years with Eastman Kodak Research Labs
− Data mining
− Computer vision/image processing

Currently with BigML
− Developing a scalable, available, and beautiful
platform for machine learning
− Launched private beta in March
− Still early days (Nine employees in
Europe/U.S.)

Brief Summary

Introduce you to BigML

Review some of the recent research in the
large-scale ML community

Pose some research questions that may not be
on the Big Data radar

This is all very, very preliminary (comments
appreciated)

A Little Bit More about BigML

Right now, only decision trees (more to come)

Going for a wide range of users

Goals
− All resources can be created and retrieved via our
REST API

Programatic model creation

Downloadable, white-box models
− A compelling front-end interface
− Ease-of-use: As few clicks as possible; easy to
understand visualizations

A brief demo

Benefits

It's all in the cloud
− Easy to share with others
− Can “deploy” the model to anywhere
− Can trigger learning from anywhere (couch-
based machine learning)

Learns at scale
− Up to 64 GB (and counting)
− No specialized hardware or software required

Where's The Big?

Among our users, we find that very few have
data greater than 100mb

Why is this?
− Takes too long?
− Inadequate infrastructure?
− Don't have the right algorithms?

Maybe it's something else . . .

Research Direction #1
Algorithms

Speed, speed, speed
− Langford's
Vowpal Wabbit
− PEGaSoS

Parallelism
− Domingos, 2001
− Bekkerman's
Tutorial at KDD
'11

Research Direction #2
Tools

Setting up clusters for large scale, parallel
execution of jobs
− Hadoop
− Storm

Languages allowing for hardware-independent
specification of parallel algorithms
− Spark
− Scalops

Using the GPU

The Benefits of Big Data

Tools for processing of
massive data are crucial

Often, worse learners
can be “fixed” by more
data

If the hypothesis space
is large enough,
accuracy improvement
can be log-linear even to
billions of examples Banko and Brill, 2001

Is This What is Needed?

Processing big data is crucial, but many
interesting ML algorithms are trivially parallel

Is the focus on parallelism and architecture
really necessary, or just popular?

For most jobs, multi-machine architectures are
probably not necessary

“No one ever got fired for using Hadoop on a
cluster”
http://research.microsoft.com/pubs/163083/hotcbp12%20final.pdf

If not parallel architectures, then what?

Some Old Assumptions

Much of the current work in large scale learning
makes the standard assumptions about the
data:

That it is drawn i.i.d. From a stationary
distribution

That linear time algorithms are cheap

That super-linear time algorithms are expensive

Big Data, Assumption Breaker

Could easily be non-i.d.d.
− Even shuffling is expensive
− What if it's not all there?
− For many common large datasets, the
distribution is almost certainly not stationary
as the world itself isn't stationary

The easy solutions . . .
− Make a pass over the data to shuffle it
− Wait for it all to be there

. . . both break responsivness.

The New Complexity

Network latency and disk read times may
dominate the cost of some learning algorithms
− One pass over the data is expensive
− Multiple passes may be out of the question

Because reading the data dominates costs, we
can do intensive computation in a given locality
without significantly impacting cost
− Read the data once into memory, do several
hundred passes, read the next block, . . .
− Super-linear algorithms aren't so bad?

Example:
The “Slow Arrival” Problem

A lot of big data doesn't arrive all at once
− Transactional data
− Sensor data
− Economic data

We only get a chunk of the data every so often

The distribution may be non-stationary

Some Simple Solutions

Streaming algorithm, incremental updates
− Good, but limits our options somewhat
− Typically have to make choices about how long it
takes for data to “expire” (e.g., learning rate)

Lazy accumulators / Reservoir sampling
− Lazy algorithms limit options
− Reservoir sampling isn't using all data
− Implicit expiry of data is “never”

Window-based retraining
− Completely forgets past data
− Window size is an explicit choice

Related Research #1:
Theory

Strong “Mixing Conditions” - Analysis of time-
series data that is asymptotically independent
when it is sufficiently far apart in time

Block-wise Stationarity – The data is drawn
from the same distribution for some period of
time before the distribution changes

Concept Drift – When the concept learned by a
classifier becomes invalid due to changes in the
generating distributions of either the input or the
output

Some Slow Arrival Data

Simulated traffic data (closely mirrors some of our user data)
− Cars per minute on a busy street
− Predict: Number of cars that will be on the street in a given
minute on a given date

Varies by time of day
− Rush hours have more traffic
− Night time has little

Varies by month of year
− Less weekend travel in the winter
− Less weekday travel in the summer

Gaussian noise added to make it interesting

Algorithm and Strawmen

Basic algorithm:
− Given: Classifier at time n and the data
− When a new block arrives at n + 1, train a classifier on half
of the data
− Use the other half to estimate performance of the new
classifier vs. the old
− Resample according to the amount of “drift” detected, train
new classifier
− Repeat

Compare with
− Reservior Sampling
− Training only on last block
− Training on last four blocks

Some Results #1:
Regular Seasonal Effects

Training on the last
n blocks does well
in the present but
not in general

Reservoir sampling
trades a little
present
performance for
better performance
in the general case

Adaptive
resampling does
more or less the
same

Some Results #2:
Dramatic Changes

Sampling fails
completely as
history outside of
the current block
doesn't matter

Adaptive
resampling is able
to detect the
uselessness of the
history and
maintain
performance

Summary

Processing big data quickly is important

But it isn't everything!
− Big data brings new problems
− Some of these might be new learning settings
that are scientifically interesting

“Slow Arrival” data is one of these
− Seems general enough to be generally
interesting
− Benefits from something more than the naïve
approach

Try BigML!

We're still in private beta, but go to:
www.bigml.com
And request an invitation!

Unexpected Challenges in Large Scale Machine Learning by Charles Parker

More Related Content

What's hot (20)

Similar to Unexpected Challenges in Large Scale Machine Learning by Charles Parker (20)

More from BigMine (9)

Recently uploaded (20)

Unexpected Challenges in Large Scale Machine Learning by Charles Parker