Cornell CS578: Introduction
Cornell CS578: Introduction
1
Grading Homeworks
4 credit course short programming and experiment assignments
– e.g., implement backprop and test on a dataset
25% take-home mid-term (late-October) – goal: get familiar with a variety of learning methods
25% open-book final (????) two or more weeks to complete each assignment
30% homework assignments (3 assignments) C, C++, Java, Perl,
Perl, shell scripts, or Matlab
20% course project (teams of 1-4 people)
must be done individually
late penalty: one letter grade per day hand in code with summary and analysis of results
90-100 = A-, A, A+ emphasis on understanding and analysis of results,
80-90 = B-, B, B+ not generating a pretty report
70-80 = C-, C, C+ short course in Unix and writing shell scripts
2
Statistics, Machine Learning,
Fun Stuff
and Data Mining
3
Pre-Statistics: Ptolmey-1850
First “Data Sets”
Sets” created
– Positions of mars in orbit: Tycho Brahe (1546-1601)
– Star catalogs
before statistics Tycho catalog had 777 stars with 1-2 arcmin precision
– Messier catalog (100+ “dim fuzzies”
fuzzies” that look like comets)
– Triangulation of meridian in France
Not just raw data - processing is part of data
– Tychonic System: anti-Copernican, many epicycles
No theory of errors - human judgment
– Kepler knew Tycho’
Tycho’s data was never in error by 8 arcmin
Few models of data - just learning about modeling
– Kepler’
Kepler’s Breakthrough: Copernican model and 3 laws of orbits
4
Statistics: 1850-1950 Statistics: 1850-1950
Calculations done manually Analysis of errors in measurements
– manual decision making during analysis What is most efficient estimator of some value?
– Mendel’
Mendel’s genetics How much error in that estimate?
– human calculator pools for “larger”
larger” problems Hypothesis testing:
Simplified models of data to ease computation – is this mean larger than that mean?
– Gaussian, Poisson, … – are these two populations different?
– Keep computations tractable Regression:
Get the most out of precious data – what is the value of y when x=xi or x=x
x=xj?
– careful examination of assumptions How often does some event occur?
– outliers examined individually – p(fail(part1)) = p1; p(fail(part2)) = p2; p(crash(plane)) = ?
5
Machine Learning: 1950-2000... Machine Learning: 1950-2000...
Medium size data sets become available Computers can do very complex calculations on medium
– 100 to 100,000 records size data sets
– Higher dimension: 5 to 250 dimensions (more if vision) Models can be much more complex than before
– Fit in memory Empirical evaluation methods instead of theory
Exist in computer, usually not on paper – don’
don’t calculate expected error, measure it from sample
Too large for humans to read and fully understand – cross validation
Data not clean – e.g., 95% confidence interval from data, not Gaussian model
– Missing values, errors, outliers, Fewer statistical assumptions about data
– Many attribute types: boolean,
boolean, continuous, nominal, discrete, Make machine learning as automatic as possible
ordinal
Don’
Don’t know right model => OK to have multiple models
– Humans can’
can’t afford to understand/fix each point
(vote them)
Chest X-Ray
RBC Count
Blood Pressure
Albumin
Blood pO2
White Count
Age
Gender
Support Vector Machines (SVMs
(SVMs))
Ensemble Methods: Bagging and Boosting
Clustering Pre-Hospital In-Hospital
Attributes Attributes
6
ML: Autonomous Vehicle Navigation
Steering Direction
Can’t yet buy cars that drive
themselves, and few hospitals use
artificial neural nets yet to make
critical decisions about patients.
7
Data Mining: 1995-20??
Huge data sets collected fully automatically
– large scale science: genomics, space probes, satellites
– Cornell’
Cornell’s Arecibo Radio Telescope Project:
terabytes
per day
petabytesover life of project
too much data to move over internet -- they use FedEx!
Protein Folding
8
Data Mining: 1995-20?? Data Mining: 1995-20??
Huge data sets collected fully automatically Data exists only on disk (can’
(can’t fit in memory)
– large scale science: genomics, space probes, satellites Experts can’
can’t see even modest samples of data
– consumer purchase data Calculations done completely automatically
– web: > 500,000,000 pages of text – large computers
– clickstream data (Yahoo!: terabytes per day!) – efficient (often simplified) algorithms
– many heterogeneous data sources – human intervention difficult
High dimensional data Models of data
– “low”
low” of 45 attributes in astronomy – complex models possible
– 100’
100’s to 1000’
1000’s of attributes common – but complex models may not be affordable (Google
(Google))
– linkage makes many 1000’
1000’s of attributes possible Get something useful out of massive, opaque data
– data “tombs”
tombs”
9
Data Mining: 1990-20?? Data Mining: 1995-20??
What customers will respond best to this coupon? New Problems:
Who is it safe to give a loan to? – Data too big
What products do consumers purchase in sets? – Algorithms must be simplified and very efficient
(linear in size of data if possible, one scan is best!)
What is the best pricing strategy for products?
– Reams of output too large for humans to comprehend
Are there unusual stars/galaxies in this data? – Very messy uncleaned data
Do patients with gene X respond to treatment Y? – Garbage in, garbage out
What job posting best matches this employee? – Heterogeneous data sources
How do proteins fold? – Ill-posed questions
– Privacy
10
ML/DM Here to Stay
11