0% found this document useful (0 votes)

281 views

Cornell CS578: Introduction

The document discusses an upcoming course on empirical methods in machine learning and data mining. It provides information about course details like grading, assignments, projects and topics that will be covered. The topics include decision trees, k-nearest neighbor, neural networks, support vector machines, clustering, boosting and bagging. It also discusses textbooks and papers that will be referenced. There is a section on the historical perspective of statistics, machine learning and data mining discussing how data collection and analysis has evolved over time from early astronomers to the current era.

Uploaded by

Ian

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

281 views

Cornell CS578: Introduction

Uploaded by

Ian

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Today

COM 578  Dull organizational stuff

Empirical Methods in Machine – Course Summary
Learning and Data Mining – Grading
– Office hours
– Homework
Rich Caruana – Final Project
 Fun stuff
http://www.cs
http://www.cs..cornell.
cornell.edu/Courses/cs578/2007fa
edu/Courses/cs578/2007fa – Historical Perspective on Statistics, Machine Learning,
and Data Mining

Staff, Office Hours, … Topics

Rich Caruana Upson Hall 4157
Tue 4:30-5:00pm Wed 10:30-11:00am  Decision Trees  Performance Metrics
caruana@
caruana@cs.
cs.cornell.
cornell.edu  K-Nearest Neighbor  Data Transformation
TA: Daria Sorokina Upson Hall 5156  Artificial Neural Nets  Feature Selection
TBA
[email protected]  Support Vector Machines  Missing Values
TA: Ainur Yessenalina Upson Hall 4156  Association Rules  Case Studies:
TBA
[email protected]  Clustering – Medical prediction
 Boosting/Bagging – Protein folding
TA: Alex Niculescu-
Niculescu-Mizil Upson Hall 5154
TBA – Autonomous vehicle
[email protected]
[email protected]  Cross Validation navigation
Admin: Melissa Totman Upson Hall 4147
M-F 9:00am-4:00pm
~30% overlap with CS478

1
Grading Homeworks
 4 credit course  short programming and experiment assignments
– e.g., implement backprop and test on a dataset
 25% take-home mid-term (late-October) – goal: get familiar with a variety of learning methods
 25% open-book final (????)  two or more weeks to complete each assignment
 30% homework assignments (3 assignments) C, C++, Java, Perl,
 Perl, shell scripts, or Matlab
 20% course project (teams of 1-4 people)
 must be done individually
 late penalty: one letter grade per day  hand in code with summary and analysis of results
 90-100 = A-, A, A+  emphasis on understanding and analysis of results,
 80-90 = B-, B, B+ not generating a pretty report
 70-80 = C-, C, C+  short course in Unix and writing shell scripts

Project Text Books

 Data Mining Mini Competition  Required Text:
 Train best model on problem(s) we give you – Machine Learning by Tom Mitchell
– decision trees
– k-nearest neighbor  Optional Texts:
– artificial neural nets – Elements of Statistical Learning: Data Mining, Inference, and
– SVMs Prediction by Hastie,
Hastie, Tibshirani,
Tibshirani, and Friedman
– bagging, boosting, model averaging, ... – Pattern Classification,
Classification, 2nd ed., by Richard Duda,
Duda, Peter Hart, &
 Given train and test sets David Stork
– Have target values on train set – Pattern Recognition and Machine Learning by Chris Bishop
– No target values on test set – Data Mining: Concepts and Techniques by Jiawei Han and
– Send us predictions and we calculate performance Micheline Kamber
– Performance on test sets is part of project grade
 Due before exams & study period  Selected papers

2
Statistics, Machine Learning,
Fun Stuff
and Data Mining

Past, Present, and Future Once upon a time...

3
Pre-Statistics: Ptolmey-1850
 First “Data Sets”
Sets” created
– Positions of mars in orbit: Tycho Brahe (1546-1601)
– Star catalogs
before statistics  Tycho catalog had 777 stars with 1-2 arcmin precision
– Messier catalog (100+ “dim fuzzies”
fuzzies” that look like comets)
– Triangulation of meridian in France
 Not just raw data - processing is part of data
– Tychonic System: anti-Copernican, many epicycles
 No theory of errors - human judgment
– Kepler knew Tycho’
Tycho’s data was never in error by 8 arcmin
 Few models of data - just learning about modeling
– Kepler’
Kepler’s Breakthrough: Copernican model and 3 laws of orbits

Pre-Statistics: 1790-1850 Statistics: 1850-1950

 The Metric System:  Data collection starts to separate from analysis
– uniform system of weights and measures
 Hand-collected data sets
 Meridian from Dunkirk to Barcelona through Paris
– Physics, Astronomy, Agriculture, ...
– triangulation
– Quality control in manufacturing
 Meter = Distance (pole to equator)/10,000,000
– Many hours to collect/process each data point
 Most accurate survey made at that time
 1000’
1000’s of measurements spanning 10-20 years!  Usually Small: 1 to 1000 data points
 Data is available in a 3-volume book that analyses it  Low dimension: 1 to 10 variables
 No theory of error:  Exist only on paper (sometimes in text books)
– surveyors use judgment to “correct data”
data” for better consistency  Experts get to know data inside out
and accuracy!
 Data is clean: human has looked at each point

4
Statistics: 1850-1950 Statistics: 1850-1950
 Calculations done manually  Analysis of errors in measurements
– manual decision making during analysis  What is most efficient estimator of some value?
– Mendel’
Mendel’s genetics  How much error in that estimate?
– human calculator pools for “larger”
larger” problems  Hypothesis testing:
 Simplified models of data to ease computation – is this mean larger than that mean?
– Gaussian, Poisson, … – are these two populations different?
– Keep computations tractable  Regression:
 Get the most out of precious data – what is the value of y when x=xi or x=x
x=xj?
– careful examination of assumptions  How often does some event occur?
– outliers examined individually – p(fail(part1)) = p1; p(fail(part2)) = p2; p(crash(plane)) = ?

Statistics would look very

different if it had been born after Statistics meets Computers
the computer instead of 100
years before the computer

5
Machine Learning: 1950-2000... Machine Learning: 1950-2000...
 Medium size data sets become available  Computers can do very complex calculations on medium
– 100 to 100,000 records size data sets
– Higher dimension: 5 to 250 dimensions (more if vision)  Models can be much more complex than before
– Fit in memory  Empirical evaluation methods instead of theory
 Exist in computer, usually not on paper – don’
don’t calculate expected error, measure it from sample
 Too large for humans to read and fully understand – cross validation
 Data not clean – e.g., 95% confidence interval from data, not Gaussian model
– Missing values, errors, outliers,  Fewer statistical assumptions about data
– Many attribute types: boolean,
boolean, continuous, nominal, discrete,  Make machine learning as automatic as possible
ordinal
 Don’
Don’t know right model => OK to have multiple models
– Humans can’
can’t afford to understand/fix each point
(vote them)

Machine Learning: 1950-2000... ML: Pneumonia Risk Prediction

 Regression Pneumonia
Risk
 Multivariate Adaptive Regression Splines (MARS)
 Linear perceptron
 Artificial neural nets
 Decision trees
 K-nearest neighbor

Chest X-Ray

RBC Count
Blood Pressure

Albumin
Blood pO2
White Count
Age
Gender
 Support Vector Machines (SVMs
(SVMs))
 Ensemble Methods: Bagging and Boosting
 Clustering Pre-Hospital In-Hospital
Attributes Attributes

6
ML: Autonomous Vehicle Navigation
Steering Direction
Can’t yet buy cars that drive
themselves, and few hospitals use
artificial neural nets yet to make
critical decisions about patients.

Machine Learning: 1950-2000...

 New Problems:
– Can’
Can’t understand many of the models Machine Learning Leaves the Lab
– Less opportunity for human expertise in process
– Good performance in lab doesn’
doesn’t necessarily mean Computers get Bigger/Faster
good performance in practice
– Brittle systems, work well on typical cases but often but
break on rare cases Data gets Bigger/Faster, too
– Can’
Can’t handle heterogeneous data sources

7
Data Mining: 1995-20??
 Huge data sets collected fully automatically
– large scale science: genomics, space probes, satellites
– Cornell’
Cornell’s Arecibo Radio Telescope Project:
 terabytes
per day
 petabytesover life of project
 too much data to move over internet -- they use FedEx!

Protein Folding

8
Data Mining: 1995-20?? Data Mining: 1995-20??
 Huge data sets collected fully automatically  Data exists only on disk (can’
(can’t fit in memory)
– large scale science: genomics, space probes, satellites  Experts can’
can’t see even modest samples of data
– consumer purchase data  Calculations done completely automatically
– web: > 500,000,000 pages of text – large computers
– clickstream data (Yahoo!: terabytes per day!) – efficient (often simplified) algorithms
– many heterogeneous data sources – human intervention difficult
 High dimensional data  Models of data
– “low”
low” of 45 attributes in astronomy – complex models possible
– 100’
100’s to 1000’
1000’s of attributes common – but complex models may not be affordable (Google
(Google))
– linkage makes many 1000’
1000’s of attributes possible  Get something useful out of massive, opaque data
– data “tombs”
tombs”

9
Data Mining: 1990-20?? Data Mining: 1995-20??
 What customers will respond best to this coupon?  New Problems:
 Who is it safe to give a loan to? – Data too big
 What products do consumers purchase in sets? – Algorithms must be simplified and very efficient
(linear in size of data if possible, one scan is best!)
 What is the best pricing strategy for products?
– Reams of output too large for humans to comprehend
 Are there unusual stars/galaxies in this data? – Very messy uncleaned data
 Do patients with gene X respond to treatment Y? – Garbage in, garbage out
 What job posting best matches this employee? – Heterogeneous data sources
 How do proteins fold? – Ill-posed questions
– Privacy

Statistics, Machine Learning, Change in Scientific Methodology

and Data Mining
Traditional:
Traditional: New:
New:
 Historic revolution and refocusing of statistics  Formulate hypothesis  Design large experiment
 Statistics, Machine Learning, and Data Mining  Design experiment  Collect large data
merging into a new multi-faceted field  Collect data  Put data in large database
 Old lessons and methods still apply, but are used  Analyze results  Formulate hypothesis
in new ways to do new things  Review hypothesis  Evaluate hyp on database
 Repeat/Publish  Run limited experiments
 Those who don’
don’t learn the past will be forced to
to drive nail in coffin
reinvent it
 Review hypothesis
 => Computational Statistics, ML, DM, …  Repeat/Publish

10
ML/DM Here to Stay

 Will infiltrate all areas of science, engineering,

public policy, marketing, economics, …
 Adaptive methods as part of engineering process
– Engineering from simulation
– Wright brothers on steroids!
 But we can’
can’t manually verify models are right!
 Can we trust results of automatic learning/mining?

Büchner Lenz English
50% (8)
Büchner Lenz English
15 pages
Real Behavior: Change in Primary Care
100% (3)
Real Behavior: Change in Primary Care
266 pages
History of Job Satisfaction
88% (16)
History of Job Satisfaction
2 pages
Unit 4 Basics of Feature Engineering
No ratings yet
Unit 4 Basics of Feature Engineering
33 pages
v13n1 3anderson Is Belief in
No ratings yet
v13n1 3anderson Is Belief in
13 pages
Ausubel: Meaningful Reception Learning
No ratings yet
Ausubel: Meaningful Reception Learning
18 pages
Defeating The Mammon Spirit
No ratings yet
Defeating The Mammon Spirit
35 pages
Data Science & Analytics: Course Code: CSE3105 Credits: 02 Credit Hours: 02/week Exam Hours: 03
No ratings yet
Data Science & Analytics: Course Code: CSE3105 Credits: 02 Credit Hours: 02/week Exam Hours: 03
2 pages
Machine
No ratings yet
Machine
61 pages
Handout
No ratings yet
Handout
4 pages
Chap1 - Introduction To Machine Learning
No ratings yet
Chap1 - Introduction To Machine Learning
40 pages
Week 12 Intro to DS and ML
No ratings yet
Week 12 Intro to DS and ML
67 pages
CAS CS 565, Data Mining
No ratings yet
CAS CS 565, Data Mining
30 pages
ML Lect1
100% (1)
ML Lect1
51 pages
Data - Analytics - Chapter 2
No ratings yet
Data - Analytics - Chapter 2
58 pages
Machine Learning
No ratings yet
Machine Learning
137 pages
Chapter 04 - in class
No ratings yet
Chapter 04 - in class
52 pages
Regression
No ratings yet
Regression
53 pages
WIP - ML-22-DEC Weekend
No ratings yet
WIP - ML-22-DEC Weekend
40 pages
ML Question Bank Ans
No ratings yet
ML Question Bank Ans
24 pages
Lec 1
No ratings yet
Lec 1
48 pages
Data Mining and Machine Learning
No ratings yet
Data Mining and Machine Learning
3 pages
Data Science
No ratings yet
Data Science
64 pages
HW1
No ratings yet
HW1
4 pages
Session 02 - Introduction To ML
No ratings yet
Session 02 - Introduction To ML
13 pages
TT02 Data, Methods, and Scenarios
No ratings yet
TT02 Data, Methods, and Scenarios
44 pages
ME F321 - Data Minining in Mechanical Sciences - Handout - Jan 2023
No ratings yet
ME F321 - Data Minining in Mechanical Sciences - Handout - Jan 2023
4 pages
Intro To Data Science Lecture 1
No ratings yet
Intro To Data Science Lecture 1
7 pages
ML Lectures Summary 2
No ratings yet
ML Lectures Summary 2
52 pages
Sharda_11e_full_accessible_ppt_04
No ratings yet
Sharda_11e_full_accessible_ppt_04
40 pages
FDP Day1
No ratings yet
FDP Day1
35 pages
Introduction to Data Science
No ratings yet
Introduction to Data Science
5 pages
MLT Syllabus
No ratings yet
MLT Syllabus
3 pages
Data Mining Notes: 7 Semester. CS 1435: Syllabus
No ratings yet
Data Mining Notes: 7 Semester. CS 1435: Syllabus
4 pages
Chapter1 ML
No ratings yet
Chapter1 ML
101 pages
AI-unit-5
No ratings yet
AI-unit-5
103 pages
QSRI-lecture1
No ratings yet
QSRI-lecture1
45 pages
Basic Machine Learning
No ratings yet
Basic Machine Learning
34 pages
Data Science Master
No ratings yet
Data Science Master
11 pages
Lec 1
No ratings yet
Lec 1
33 pages
DMbookTOC1
No ratings yet
DMbookTOC1
8 pages
Module1 Introduction
No ratings yet
Module1 Introduction
35 pages
Birla Institute of Technology & Science, Pilani: Work Integrated Learning Programmes Part A: Content Design
No ratings yet
Birla Institute of Technology & Science, Pilani: Work Integrated Learning Programmes Part A: Content Design
6 pages
Machine Learning Notes 1
No ratings yet
Machine Learning Notes 1
120 pages
Support Machine Learning
No ratings yet
Support Machine Learning
161 pages
Booklet Stats v8
No ratings yet
Booklet Stats v8
309 pages
KIT306/606: Data Analytics Unit Coordinator: A/Prof. Quan Bai University of Tasmania
No ratings yet
KIT306/606: Data Analytics Unit Coordinator: A/Prof. Quan Bai University of Tasmania
51 pages
Data Mining1
No ratings yet
Data Mining1
13 pages
Introductions To Data Science - Lecture 1 - Introduction
No ratings yet
Introductions To Data Science - Lecture 1 - Introduction
15 pages
Classification
No ratings yet
Classification
50 pages
Python 06 MachineLearning
No ratings yet
Python 06 MachineLearning
45 pages
Is Zc415 (Data Mining BITS-WILP)
No ratings yet
Is Zc415 (Data Mining BITS-WILP)
4 pages
Data Mining Concepts and Techniques - Han, Kamber & Pei
No ratings yet
Data Mining Concepts and Techniques - Han, Kamber & Pei
953 pages
Data Mining and Analysis: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Analysis: Fundamental Concepts and Algorithms
9 pages
Statistics: New Foundations, Toolbox, and Machine Learning Recipes
No ratings yet
Statistics: New Foundations, Toolbox, and Machine Learning Recipes
309 pages
1635838720082
No ratings yet
1635838720082
35 pages
Module 4 - Supervised and Unsupervised learning techniques
No ratings yet
Module 4 - Supervised and Unsupervised learning techniques
52 pages
Accelerated Data Science Introduction To Machine Learning Algorithms
No ratings yet
Accelerated Data Science Introduction To Machine Learning Algorithms
37 pages
Information Technology Fundamentals: CCIT4085
No ratings yet
Information Technology Fundamentals: CCIT4085
43 pages
General ML Notes
No ratings yet
General ML Notes
30 pages
3 Data Mining
No ratings yet
3 Data Mining
58 pages
Introduction To Data Mining-Sources
No ratings yet
Introduction To Data Mining-Sources
5 pages
ML.5-Clustering Techniques (Week 9)
No ratings yet
ML.5-Clustering Techniques (Week 9)
71 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
112 pages
Data Science 1
100% (3)
Data Science 1
133 pages
Data Science with Python: From Zero to Machine Learning
From Everand
Data Science with Python: From Zero to Machine Learning
Pouvo
No ratings yet
Map-Reduce For Machine Learning On Multicore PDF
No ratings yet
Map-Reduce For Machine Learning On Multicore PDF
8 pages
Cornell CS578: Clustering
No ratings yet
Cornell CS578: Clustering
16 pages
Cornell CS578: Hypothesis Testing
No ratings yet
Cornell CS578: Hypothesis Testing
2 pages
Cornell CS578: Bagging and Boosting
No ratings yet
Cornell CS578: Bagging and Boosting
10 pages
Foundations of Computer Science - Solutions To Selected Exercise
No ratings yet
Foundations of Computer Science - Solutions To Selected Exercise
89 pages
Comparing and Evaluating Epoll, Select and Poll Event Mechanisms
No ratings yet
Comparing and Evaluating Epoll, Select and Poll Event Mechanisms
22 pages
Chubby Lock Service Over Distributed Network
No ratings yet
Chubby Lock Service Over Distributed Network
16 pages
Moon Group Project
No ratings yet
Moon Group Project
14 pages
Aristotle 02 - Physica - de Caelo - de Generatione Et Corruptione - Ross
No ratings yet
Aristotle 02 - Physica - de Caelo - de Generatione Et Corruptione - Ross
511 pages
Success Stories in Knowledge Management Systems
No ratings yet
Success Stories in Knowledge Management Systems
15 pages
Evaluating Messages and Images of Different Types of Text
No ratings yet
Evaluating Messages and Images of Different Types of Text
4 pages
Bond-Graphs: A Formalism For Modeling Physical Systems: Sagar Sen, Graduate Student
No ratings yet
Bond-Graphs: A Formalism For Modeling Physical Systems: Sagar Sen, Graduate Student
46 pages
Deadly Seduction Secrets PDF
100% (2)
Deadly Seduction Secrets PDF
30 pages
Why Is Mangal Arati Beneficial For Spiritual Life
No ratings yet
Why Is Mangal Arati Beneficial For Spiritual Life
3 pages
Unit 3: Flash Fiction Assignment Rubric: Requirements
No ratings yet
Unit 3: Flash Fiction Assignment Rubric: Requirements
2 pages
Modernism
0% (1)
Modernism
3 pages
Sabbir
No ratings yet
Sabbir
11 pages
Slavorum Apostoli: Ioannes Paulus PP. II
No ratings yet
Slavorum Apostoli: Ioannes Paulus PP. II
22 pages
Induction and Strong Induction
No ratings yet
Induction and Strong Induction
50 pages
Murtaza Vali: Kazi in Nomansland and The Crisis of History
No ratings yet
Murtaza Vali: Kazi in Nomansland and The Crisis of History
4 pages
Participatory Democracy Vs Representative Democracy
No ratings yet
Participatory Democracy Vs Representative Democracy
5 pages
OBE Syllabus of Liturgy and The Sacraments
No ratings yet
OBE Syllabus of Liturgy and The Sacraments
5 pages
Theoretical Framework
100% (1)
Theoretical Framework
12 pages
Ibn Khallikan's Biographical Dictionary - Bk. 2
No ratings yet
Ibn Khallikan's Biographical Dictionary - Bk. 2
713 pages
The Understanding of Space & The Nature of Silence
No ratings yet
The Understanding of Space & The Nature of Silence
13 pages
Sell Yourself Better: @jasonmesut
No ratings yet
Sell Yourself Better: @jasonmesut
96 pages
Makalah Bahasa Inggris Yovi
No ratings yet
Makalah Bahasa Inggris Yovi
8 pages
Research Proposal
100% (1)
Research Proposal
9 pages
Philosophy Now PDF
0% (1)
Philosophy Now PDF
60 pages
Best Model CLustalW
No ratings yet
Best Model CLustalW
28 pages
Electoral Guerilla Theatre: Radical Ridicule and Social Movements. L. M. Bogad. New
No ratings yet
Electoral Guerilla Theatre: Radical Ridicule and Social Movements. L. M. Bogad. New
3 pages

Cornell CS578: Introduction

Uploaded by

Cornell CS578: Introduction

Uploaded by

Today

COM 578  Dull organizational stuff

Staff, Office Hours, … Topics

Project Text Books

Past, Present, and Future Once upon a time...

Pre-Statistics: 1790-1850 Statistics: 1850-1950

Statistics would look very

Machine Learning: 1950-2000... ML: Pneumonia Risk Prediction

Machine Learning: 1950-2000...

Statistics, Machine Learning, Change in Scientific Methodology

 Will infiltrate all areas of science, engineering,

You might also like