SlideShare a Scribd company logo
Intro to Data Science
Erin LeDell Ph.D.โ€จ
Statistician & Machine
Learning Scientistโ€จ
H2O.ai
H2O World 2015
Download our app, โ€œH2O World 2015โ€
H2O World 2015
I have H2O
Installed
I have Python
installed
I have R
installed
I have the H2O
World data
sets
Pick up stickers or get install help at the
information booth
Intro to Data Science
โ€ข What is Data Science?
โ€ข The Data Scientist
โ€ข The Data Science Team
โ€ข Data Science Tools
โ€ข What is Machine Learning?
โ€ข What is Deep Learning?
โ€ข What is Ensemble Learning?
โ€ข Data Science Resources
What is Data Science?
One of the earliest uses of the
term "data science" occurred in
the title of the 1996 International
Federation of Classification
Societies conference in Kobe,
Japan.
What is Data Science?
โ€ข The term re-emerged and became popularized in 2001 by
William Cleveland, then at Bell Labs, when he published, โ€จ
"Data Science: An Action Plan for Expanding the Technical Areas
of the Field of Statisticsโ€.
โ€ข This publication describes a plan to enlarge the major areas of
technical work of the field of statistics. Dr. Cleveland states,
"Since plan is ambitious and implies substantial change, the
altered field will be called Data Science."
What is Data Science?
What is Data Science?
โ€ข Clean, transform, filter, aggregate, impute
โ€ข Convert into X and Y
Problem
Formulation
Data
Processing
Machine
Learning
โ€ข Identify a data task or prediction problem
โ€ข Collect relevant data
โ€ข Train models
โ€ข Evaluate models
The Data Science Venn Diagram
Drew Conway (2010)
The Data Scientist
The Data Scientist โ€œUnicornโ€
Survey of Data Scientists on LinkedIn
The number of data scientists has doubled
over the last 4 years.
The top five skills listed by data scientists:
1. Data Analysisโ€จ
2. R
3. Python
4. Data Miningโ€จ
5. Machine Learning
From Data Unicorns to Data Teams
Data Science Teams
โ€ข Usually a background in computer science or
engineering
โ€ข Very good programming and DevOps skills
Data
Analysts
Data
Engineers
Dataโ€จ
Scientists
โ€ข Strong data skills and the ability to use existing data
analysis tools
โ€ข Able to communicate and tell a story using data
โ€ข Strong math/stats background in addition to
programming ability
โ€ข Understanding of machine learning algorithms
Data Science Teams
Data Science in the Enterprise
โ€ข Data Science teams develop โ€œactionable insightsโ€ for business.
โ€ข They provide decision makers with information, guidance and
confidence in the decision making process.
โ€ข Competitive advantage
โ€ข Cost minimization
โ€ข Data-driven products
Data Science in the Enterprise
Donโ€™t be a data dinosaur.
Embrace the data!
Data Science Tools
2013 was the year of the โ€จ
data science โ€œlanguage wars.โ€
Data Science Tools
In 2015, we have evolved beyond thisโ€ฆโ€จ
We are too busy doing actual data science!
Data Science Tools
We are headed toward language agnostic data science, where
friendly APIs connect to powerful data processing engines.
What is Machine Learning?
Unlike rules-based systems which require a human
expert to hard-code domain knowledge directly into
the system, a machine learning algorithm learns how
to make decisions from the data alone.
โ€Field of study that gives computers the
ability to learn without being
explicitly programmed.โ€ โ€จ
โ€” Arthur Samuel, 1959
Machine Learning Tasks
โ€ข Multi-class or binary classification
โ€ข Ranking (e.g. Google Search results order)
โ€ข Evaluate with Classification Error or AUC
Regression
Classification
Clustering
โ€ข Unsupervised learning (no training labels)
โ€ข Partition the data; identify clusters or sub-populations
โ€ข Evaluate with AIC, BIC or Total Sum of Squares
โ€ข Predict a real-valued response (e.g. viral load, price)
โ€ข Gaussian, Gamma, Poisson, etc. distributed response
โ€ข Evaluate with MSE or R^2
Train, Validation and Test Set
โ€ข If you plan on doing any model tuning, you should split
your dataset into three parts: Train, Validation and Test
โ€ข There is no general rule for how you should partition the
data and it will depend on how strong the signal in your
data is, but an example could be: โ€จ
50% Train, 25% Validation and 25% Test
โ€ข The validation set is used strictly for model tuning (via
validation of models with different parameters) and the test
set is used to make a final estimate of the generalization
K-fold Cross-validation
โ€ข K-fold Cross-validation (CV) โ€จ
is used to evaluate the
performance of machine
learning algorithms.
โ€ข CV will give you the most
โ€œmileageโ€ on your training
data.
โ€ข Performance metrics are
averaged across k folds.
Machine Learning Workflow
Training and Predictionโ€จ
in machine learning
What is Deep Learning?
โ€ข Deep neural networks have more than one hidden
layer in their architecture. Thatโ€™s why they are
called โ€œdeepโ€ neural networks.
โ€ข Very useful for complex input data such as images,
video, audio.
โ€A branch of machine learning based on a set of
algorithms that attempt to model high-level
abstractions in data by using model architectures,
composed of multiple non-linear transformations.โ€ โ€จ
โ€จ
โ€” Wikipedia (2015)
What is Deep Learning?
โ€ข Deep learning architectures,
specifically artificial neural
networks (ANNs) have
been around since 1980.
โ€ข However, there were
breakthroughs in training
techniques that lead to their
recent resurgence in the mid
2000โ€™s.
โ€ข Combined with modern
computing power, they are
quite effective.
What is Ensemble Learning?
โ€ข Random Forests and Gradient Boosting Machines (GBM)
are both ensembles of decision trees.
โ€ข Stacking, or Super Learning, is technique for combining
various learners into a single, powerful learner using a
second-level metalearning algorithm.
โ€œEnsemble methods use multiple learning
algorithms to obtain better predictive
performance that could be obtained from โ€จ
any of the constituent learning algorithms.โ€ โ€จ
โ€” Wikipedia (2015)
No Free Lunch
โ€ข No general purpose algorithm to solve all problems.
โ€ข No right answer on optimal data preparation.
โ€ข Some algorithms may have such strong biases that they
can only learn certain kinds of functions.
"Even after the observation of the frequent or
constant conjunction of objects, we have no reason
to draw any inference concerning any object beyond
those of which we have had experience.โ€ โ€จ
โ€” David Hume (1711-1776)
Where to Learn More?
โ€ข H2O Online Training (free): http://learn.h2o.ai
โ€ข H2O Slidedecks: http://www.slideshare.net/0xdata
โ€ข H2O Video Presentations: https://www.youtube.com/user/0xdata
โ€ข H2O Community Events & Meetups: http://h2o.ai/events
โ€ข Machine Learning & Data Science courses: http://coursebuffet.com

More Related Content

What's hot (20)

PPTX
Domino and AWS: collaborative analytics and model governance at financial ser...
Domino Data Lab
ย 
PPTX
Data Science Salon: Introduction to Machine Learning - Marketing Use Case
Formulatedby
ย 
PDF
Big data expo - machine learning in the elastic stack
BigDataExpo
ย 
PPTX
H2O World - Top 10 Data Science Pitfalls - Mark Landry
Sri Ambati
ย 
PDF
Big Data Agile Analytics by Ken Collier - Director Agile Analytics, Thoughtwo...
Thoughtworks
ย 
PDF
Dataiku productive application to production - pap is may 2015
Dataiku
ย 
PDF
H2O World - Learning How Humans and Non-Humans Interact with Digital Ads
Sri Ambati
ย 
PDF
H2O World - NCS Continuous Media Optimization w/H2O - Satya Satyamoorthy
Sri Ambati
ย 
PDF
H2O World - Data Science in Action @ 6sense - Viral Bajaria
Sri Ambati
ย 
PDF
H2O World - Machine Learning at Comcast - Andrew Leamon & Chushi Ren
Sri Ambati
ย 
PPTX
The Other 99% of a Data Science Project
Eugene Mandel
ย 
PPTX
Introduction to Data Science
LivePerson
ย 
PDF
Top 10 Data Science Practitioner Pitfalls
Sri Ambati
ย 
PDF
Agile data science
Joel Horwitz
ย 
PPTX
Leveraging Open Source Automated Data Science Tools
Domino Data Lab
ย 
PPTX
Dataiku r users group v2
Cdiscount
ย 
PDF
Back to Square One: Building a Data Science Team from Scratch
Klaas Bosteels
ย 
PDF
Leveraged Analytics at Scale
Domino Data Lab
ย 
PPTX
Andreas weigend
BigDataExpo
ย 
PPTX
Data Quality Analytics: Understanding what is in your data, before using it
Domino Data Lab
ย 
Domino and AWS: collaborative analytics and model governance at financial ser...
Domino Data Lab
ย 
Data Science Salon: Introduction to Machine Learning - Marketing Use Case
Formulatedby
ย 
Big data expo - machine learning in the elastic stack
BigDataExpo
ย 
H2O World - Top 10 Data Science Pitfalls - Mark Landry
Sri Ambati
ย 
Big Data Agile Analytics by Ken Collier - Director Agile Analytics, Thoughtwo...
Thoughtworks
ย 
Dataiku productive application to production - pap is may 2015
Dataiku
ย 
H2O World - Learning How Humans and Non-Humans Interact with Digital Ads
Sri Ambati
ย 
H2O World - NCS Continuous Media Optimization w/H2O - Satya Satyamoorthy
Sri Ambati
ย 
H2O World - Data Science in Action @ 6sense - Viral Bajaria
Sri Ambati
ย 
H2O World - Machine Learning at Comcast - Andrew Leamon & Chushi Ren
Sri Ambati
ย 
The Other 99% of a Data Science Project
Eugene Mandel
ย 
Introduction to Data Science
LivePerson
ย 
Top 10 Data Science Practitioner Pitfalls
Sri Ambati
ย 
Agile data science
Joel Horwitz
ย 
Leveraging Open Source Automated Data Science Tools
Domino Data Lab
ย 
Dataiku r users group v2
Cdiscount
ย 
Back to Square One: Building a Data Science Team from Scratch
Klaas Bosteels
ย 
Leveraged Analytics at Scale
Domino Data Lab
ย 
Andreas weigend
BigDataExpo
ย 
Data Quality Analytics: Understanding what is in your data, before using it
Domino Data Lab
ย 

Viewers also liked (20)

PDF
H2O World - Intro to R, Python, and Flow - Amy Wang
Sri Ambati
ย 
PDF
Intro to Machine Learning with H2O and AWS
Sri Ambati
ย 
PPSX
Intro to Data Science Big Data
Indu Khemchandani
ย 
PDF
H2O World - Welcome to H2O World with Arno Candel
Sri Ambati
ย 
PDF
Introduction to data science intro,ch(1,2,3)
heba_ahmad
ย 
PPS
Big Data Science: Intro and Benefits
Chandan Rajah
ย 
PDF
Intro to H2O in Python - Data Science LA
Sri Ambati
ย 
PDF
Webinar: Deep Learning with H2O
Sri Ambati
ย 
PDF
H2O World - Building a Smarter Application - Tom Kraljevic
Sri Ambati
ย 
PPTX
Applied Data Science: Building a Beer Recommender | Data Science MD - Oct 2014
Austin Ogilvie
ย 
PDF
H2O World - H2O Deep Learning with Arno Candel
Sri Ambati
ย 
PDF
H2O Open Source Deep Learning, Arno Candel 03-20-14
Sri Ambati
ย 
PPTX
Introduction to Data Science
Caserta
ย 
PDF
H20: A platform for big math
DataWorks Summit/Hadoop Summit
ย 
PPTX
Introduction of Data Science
Jason Geng
ย 
PDF
Build Your Own Recommendation Engine
Sri Ambati
ย 
PDF
H2O with Erin LeDell at Portland R User Group
Sri Ambati
ย 
PDF
H2O World - Top 10 Deep Learning Tips & Tricks - Arno Candel
Sri Ambati
ย 
PDF
Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...
Lucidworks
ย 
PDF
Demystifying Data Science with an introduction to Machine Learning
Julian Bright
ย 
H2O World - Intro to R, Python, and Flow - Amy Wang
Sri Ambati
ย 
Intro to Machine Learning with H2O and AWS
Sri Ambati
ย 
Intro to Data Science Big Data
Indu Khemchandani
ย 
H2O World - Welcome to H2O World with Arno Candel
Sri Ambati
ย 
Introduction to data science intro,ch(1,2,3)
heba_ahmad
ย 
Big Data Science: Intro and Benefits
Chandan Rajah
ย 
Intro to H2O in Python - Data Science LA
Sri Ambati
ย 
Webinar: Deep Learning with H2O
Sri Ambati
ย 
H2O World - Building a Smarter Application - Tom Kraljevic
Sri Ambati
ย 
Applied Data Science: Building a Beer Recommender | Data Science MD - Oct 2014
Austin Ogilvie
ย 
H2O World - H2O Deep Learning with Arno Candel
Sri Ambati
ย 
H2O Open Source Deep Learning, Arno Candel 03-20-14
Sri Ambati
ย 
Introduction to Data Science
Caserta
ย 
H20: A platform for big math
DataWorks Summit/Hadoop Summit
ย 
Introduction of Data Science
Jason Geng
ย 
Build Your Own Recommendation Engine
Sri Ambati
ย 
H2O with Erin LeDell at Portland R User Group
Sri Ambati
ย 
H2O World - Top 10 Deep Learning Tips & Tricks - Arno Candel
Sri Ambati
ย 
Anyone Can Build A Recommendation Engine With Solr: Presented by Doug Turnbul...
Lucidworks
ย 
Demystifying Data Science with an introduction to Machine Learning
Julian Bright
ย 
Ad

Similar to H2O World - Intro to Data Science with Erin Ledell (20)

PDF
Intro to Data Science for Non-Data Scientists
Sri Ambati
ย 
PDF
Introduction to Data Science
Christy Abraham Joy
ย 
PPTX
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Robert Williams
ย 
PDF
The Art of Intelligence โ€“ A Practical Introduction Machine Learning for Orac...
Lucas Jellema
ย 
PDF
Choosing a Machine Learning technique to solve your need
GibDevs
ย 
PPTX
Introduction to Big Data/Machine Learning
Lars Marius Garshol
ย 
PDF
Introduction to machine learning and applications (1)
Manjunath Sindagi
ย 
PDF
An Elementary Introduction to Artificial Intelligence, Data Science and Machi...
Dozie Agbo
ย 
PDF
what-is-machine-learning-and-its-importance-in-todays-world.pdf
Temok IT Services
ย 
PPTX
Data scientist roadmap
Sonu Kumar
ย 
PPTX
NCCU: The Story of Data Science and Machine Learning Workshop - A Tutorial in...
The Statistical and Applied Mathematical Sciences Institute
ย 
PPTX
Workshop_Presentation.pptx
RUDRAPRASADSABAR
ย 
PDF
Machine learing
Abu Saleh Muhammad Shaon
ย 
PPS
Brief Tour of Machine Learning
butest
ย 
PPTX
Data Science Roadmap by Swapnil Microsoft
geekism12
ย 
PDF
How to become a data scientist
Manjunath Sindagi
ย 
PPTX
Machine Learning.pptx
NitinSharma134320
ย 
PDF
Introduction to Data Science
Niko Vuokko
ย 
PPTX
What is Machine Learning.pptx
kprasad8
ย 
PDF
Machine learning and_buzzwords
Rajarshi Dutta
ย 
Intro to Data Science for Non-Data Scientists
Sri Ambati
ย 
Introduction to Data Science
Christy Abraham Joy
ย 
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Robert Williams
ย 
The Art of Intelligence โ€“ A Practical Introduction Machine Learning for Orac...
Lucas Jellema
ย 
Choosing a Machine Learning technique to solve your need
GibDevs
ย 
Introduction to Big Data/Machine Learning
Lars Marius Garshol
ย 
Introduction to machine learning and applications (1)
Manjunath Sindagi
ย 
An Elementary Introduction to Artificial Intelligence, Data Science and Machi...
Dozie Agbo
ย 
what-is-machine-learning-and-its-importance-in-todays-world.pdf
Temok IT Services
ย 
Data scientist roadmap
Sonu Kumar
ย 
NCCU: The Story of Data Science and Machine Learning Workshop - A Tutorial in...
The Statistical and Applied Mathematical Sciences Institute
ย 
Workshop_Presentation.pptx
RUDRAPRASADSABAR
ย 
Machine learing
Abu Saleh Muhammad Shaon
ย 
Brief Tour of Machine Learning
butest
ย 
Data Science Roadmap by Swapnil Microsoft
geekism12
ย 
How to become a data scientist
Manjunath Sindagi
ย 
Machine Learning.pptx
NitinSharma134320
ย 
Introduction to Data Science
Niko Vuokko
ย 
What is Machine Learning.pptx
kprasad8
ย 
Machine learning and_buzzwords
Rajarshi Dutta
ย 
Ad

More from Sri Ambati (20)

PDF
H2O Label Genie Starter Track - Support Presentation
Sri Ambati
ย 
PDF
H2O.ai Agents : From Theory to Practice - Support Presentation
Sri Ambati
ย 
PDF
H2O Generative AI Starter Track - Support Presentation Slides.pdf
Sri Ambati
ย 
PDF
H2O Gen AI Ecosystem Overview - Level 1 - Slide Deck
Sri Ambati
ย 
PDF
An In-depth Exploration of Enterprise h2oGPTe Slide Deck
Sri Ambati
ย 
PDF
Intro to Enterprise h2oGPTe Presentation Slides
Sri Ambati
ย 
PDF
Enterprise h2o GPTe Learning Path Slide Deck
Sri Ambati
ย 
PDF
H2O Wave Course Starter - Presentation Slides
Sri Ambati
ย 
PDF
Large Language Models (LLMs) - Level 3 Slides
Sri Ambati
ย 
PDF
Data Science and Machine Learning Platforms (2024) Slides
Sri Ambati
ย 
PDF
Data Prep for H2O Driverless AI - Slides
Sri Ambati
ย 
PDF
H2O Cloud AI Developer Services - Slides (2024)
Sri Ambati
ย 
PDF
LLM Learning Path Level 2 - Presentation Slides
Sri Ambati
ย 
PDF
LLM Learning Path Level 1 - Presentation Slides
Sri Ambati
ย 
PDF
Hydrogen Torch - Starter Course - Presentation Slides
Sri Ambati
ย 
PDF
Presentation Resources - H2O Gen AI Ecosystem Overview - Level 2
Sri Ambati
ย 
PDF
H2O Driverless AI Starter Course - Slides and Assignments
Sri Ambati
ย 
PPTX
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
ย 
PDF
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
Sri Ambati
ย 
PPTX
Generative AI Masterclass - Model Risk Management.pptx
Sri Ambati
ย 
H2O Label Genie Starter Track - Support Presentation
Sri Ambati
ย 
H2O.ai Agents : From Theory to Practice - Support Presentation
Sri Ambati
ย 
H2O Generative AI Starter Track - Support Presentation Slides.pdf
Sri Ambati
ย 
H2O Gen AI Ecosystem Overview - Level 1 - Slide Deck
Sri Ambati
ย 
An In-depth Exploration of Enterprise h2oGPTe Slide Deck
Sri Ambati
ย 
Intro to Enterprise h2oGPTe Presentation Slides
Sri Ambati
ย 
Enterprise h2o GPTe Learning Path Slide Deck
Sri Ambati
ย 
H2O Wave Course Starter - Presentation Slides
Sri Ambati
ย 
Large Language Models (LLMs) - Level 3 Slides
Sri Ambati
ย 
Data Science and Machine Learning Platforms (2024) Slides
Sri Ambati
ย 
Data Prep for H2O Driverless AI - Slides
Sri Ambati
ย 
H2O Cloud AI Developer Services - Slides (2024)
Sri Ambati
ย 
LLM Learning Path Level 2 - Presentation Slides
Sri Ambati
ย 
LLM Learning Path Level 1 - Presentation Slides
Sri Ambati
ย 
Hydrogen Torch - Starter Course - Presentation Slides
Sri Ambati
ย 
Presentation Resources - H2O Gen AI Ecosystem Overview - Level 2
Sri Ambati
ย 
H2O Driverless AI Starter Course - Slides and Assignments
Sri Ambati
ย 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
ย 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
Sri Ambati
ย 
Generative AI Masterclass - Model Risk Management.pptx
Sri Ambati
ย 

Recently uploaded (20)

PPTX
3uTools Full Crack Free Version Download [Latest] 2025
muhammadgurbazkhan
ย 
PPTX
NeuroStrata: Harnessing Neuro-Symbolic Paradigms for Improved Testability and...
Ivan Ruchkin
ย 
PPTX
A Complete Guide to Salesforce SMS Integrations Build Scalable Messaging With...
360 SMS APP
ย 
PPTX
Comprehensive Guide: Shoviv Exchange to Office 365 Migration Tool 2025
Shoviv Software
ย 
PPTX
For my supp to finally picking supp that work
necas19388
ย 
PPTX
EO4EU Ocean Monitoring: Maritime Weather Routing Optimsation Use Case
EO4EU
ย 
PPTX
Mistakes to Avoid When Selecting Policy Management Software
Insurance Tech Services
ย 
PPTX
ManageIQ - Sprint 264 Review - Slide Deck
ManageIQ
ย 
PDF
Laboratory Workflows Digitalized and live in 90 days with Scifeonยดs SAPPA P...
info969686
ย 
PDF
AWS Consulting Services: Empowering Digital Transformation with Nlineaxis
Nlineaxis IT Solutions Pvt Ltd
ย 
PDF
capitulando la keynote de GrafanaCON 2025 - Madrid
Imma Valls Bernaus
ย 
PDF
Difference Between Kubernetes and Docker .pdf
Kindlebit Solutions
ย 
PDF
IDM Crack with Internet Download Manager 6.42 Build 41
utfefguu
ย 
PPTX
CONCEPT OF PROGRAMMING in language .pptx
tamim41
ย 
PDF
>Wondershare Filmora Crack Free Download 2025
utfefguu
ย 
PPTX
WYSIWYG Web Builder Crack 2025 โ€“ Free Download Full Version with License Key
HyperPc soft
ย 
PPTX
IObit Driver Booster Pro Crack Download Latest Version
chaudhryakashoo065
ย 
PPTX
IObit Uninstaller Pro 14.3.1.8 Crack Free Download 2025
sdfger qwerty
ย 
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
hashhshs786
ย 
PPTX
Iobit Driver Booster Pro 12 Crack Free Download
chaudhryakashoo065
ย 
3uTools Full Crack Free Version Download [Latest] 2025
muhammadgurbazkhan
ย 
NeuroStrata: Harnessing Neuro-Symbolic Paradigms for Improved Testability and...
Ivan Ruchkin
ย 
A Complete Guide to Salesforce SMS Integrations Build Scalable Messaging With...
360 SMS APP
ย 
Comprehensive Guide: Shoviv Exchange to Office 365 Migration Tool 2025
Shoviv Software
ย 
For my supp to finally picking supp that work
necas19388
ย 
EO4EU Ocean Monitoring: Maritime Weather Routing Optimsation Use Case
EO4EU
ย 
Mistakes to Avoid When Selecting Policy Management Software
Insurance Tech Services
ย 
ManageIQ - Sprint 264 Review - Slide Deck
ManageIQ
ย 
Laboratory Workflows Digitalized and live in 90 days with Scifeonยดs SAPPA P...
info969686
ย 
AWS Consulting Services: Empowering Digital Transformation with Nlineaxis
Nlineaxis IT Solutions Pvt Ltd
ย 
capitulando la keynote de GrafanaCON 2025 - Madrid
Imma Valls Bernaus
ย 
Difference Between Kubernetes and Docker .pdf
Kindlebit Solutions
ย 
IDM Crack with Internet Download Manager 6.42 Build 41
utfefguu
ย 
CONCEPT OF PROGRAMMING in language .pptx
tamim41
ย 
>Wondershare Filmora Crack Free Download 2025
utfefguu
ย 
WYSIWYG Web Builder Crack 2025 โ€“ Free Download Full Version with License Key
HyperPc soft
ย 
IObit Driver Booster Pro Crack Download Latest Version
chaudhryakashoo065
ย 
IObit Uninstaller Pro 14.3.1.8 Crack Free Download 2025
sdfger qwerty
ย 
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
hashhshs786
ย 
Iobit Driver Booster Pro 12 Crack Free Download
chaudhryakashoo065
ย 

H2O World - Intro to Data Science with Erin Ledell

  • 1. Intro to Data Science Erin LeDell Ph.D.โ€จ Statistician & Machine Learning Scientistโ€จ H2O.ai
  • 2. H2O World 2015 Download our app, โ€œH2O World 2015โ€
  • 3. H2O World 2015 I have H2O Installed I have Python installed I have R installed I have the H2O World data sets Pick up stickers or get install help at the information booth
  • 4. Intro to Data Science โ€ข What is Data Science? โ€ข The Data Scientist โ€ข The Data Science Team โ€ข Data Science Tools โ€ข What is Machine Learning? โ€ข What is Deep Learning? โ€ข What is Ensemble Learning? โ€ข Data Science Resources
  • 5. What is Data Science? One of the earliest uses of the term "data science" occurred in the title of the 1996 International Federation of Classification Societies conference in Kobe, Japan.
  • 6. What is Data Science? โ€ข The term re-emerged and became popularized in 2001 by William Cleveland, then at Bell Labs, when he published, โ€จ "Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statisticsโ€. โ€ข This publication describes a plan to enlarge the major areas of technical work of the field of statistics. Dr. Cleveland states, "Since plan is ambitious and implies substantial change, the altered field will be called Data Science."
  • 7. What is Data Science?
  • 8. What is Data Science? โ€ข Clean, transform, filter, aggregate, impute โ€ข Convert into X and Y Problem Formulation Data Processing Machine Learning โ€ข Identify a data task or prediction problem โ€ข Collect relevant data โ€ข Train models โ€ข Evaluate models
  • 9. The Data Science Venn Diagram Drew Conway (2010)
  • 11. The Data Scientist โ€œUnicornโ€
  • 12. Survey of Data Scientists on LinkedIn The number of data scientists has doubled over the last 4 years. The top five skills listed by data scientists: 1. Data Analysisโ€จ 2. R 3. Python 4. Data Miningโ€จ 5. Machine Learning
  • 13. From Data Unicorns to Data Teams
  • 14. Data Science Teams โ€ข Usually a background in computer science or engineering โ€ข Very good programming and DevOps skills Data Analysts Data Engineers Dataโ€จ Scientists โ€ข Strong data skills and the ability to use existing data analysis tools โ€ข Able to communicate and tell a story using data โ€ข Strong math/stats background in addition to programming ability โ€ข Understanding of machine learning algorithms
  • 16. Data Science in the Enterprise โ€ข Data Science teams develop โ€œactionable insightsโ€ for business. โ€ข They provide decision makers with information, guidance and confidence in the decision making process. โ€ข Competitive advantage โ€ข Cost minimization โ€ข Data-driven products
  • 17. Data Science in the Enterprise Donโ€™t be a data dinosaur. Embrace the data!
  • 18. Data Science Tools 2013 was the year of the โ€จ data science โ€œlanguage wars.โ€
  • 19. Data Science Tools In 2015, we have evolved beyond thisโ€ฆโ€จ We are too busy doing actual data science!
  • 20. Data Science Tools We are headed toward language agnostic data science, where friendly APIs connect to powerful data processing engines.
  • 21. What is Machine Learning? Unlike rules-based systems which require a human expert to hard-code domain knowledge directly into the system, a machine learning algorithm learns how to make decisions from the data alone. โ€Field of study that gives computers the ability to learn without being explicitly programmed.โ€ โ€จ โ€” Arthur Samuel, 1959
  • 22. Machine Learning Tasks โ€ข Multi-class or binary classification โ€ข Ranking (e.g. Google Search results order) โ€ข Evaluate with Classification Error or AUC Regression Classification Clustering โ€ข Unsupervised learning (no training labels) โ€ข Partition the data; identify clusters or sub-populations โ€ข Evaluate with AIC, BIC or Total Sum of Squares โ€ข Predict a real-valued response (e.g. viral load, price) โ€ข Gaussian, Gamma, Poisson, etc. distributed response โ€ข Evaluate with MSE or R^2
  • 23. Train, Validation and Test Set โ€ข If you plan on doing any model tuning, you should split your dataset into three parts: Train, Validation and Test โ€ข There is no general rule for how you should partition the data and it will depend on how strong the signal in your data is, but an example could be: โ€จ 50% Train, 25% Validation and 25% Test โ€ข The validation set is used strictly for model tuning (via validation of models with different parameters) and the test set is used to make a final estimate of the generalization
  • 24. K-fold Cross-validation โ€ข K-fold Cross-validation (CV) โ€จ is used to evaluate the performance of machine learning algorithms. โ€ข CV will give you the most โ€œmileageโ€ on your training data. โ€ข Performance metrics are averaged across k folds.
  • 25. Machine Learning Workflow Training and Predictionโ€จ in machine learning
  • 26. What is Deep Learning? โ€ข Deep neural networks have more than one hidden layer in their architecture. Thatโ€™s why they are called โ€œdeepโ€ neural networks. โ€ข Very useful for complex input data such as images, video, audio. โ€A branch of machine learning based on a set of algorithms that attempt to model high-level abstractions in data by using model architectures, composed of multiple non-linear transformations.โ€ โ€จ โ€จ โ€” Wikipedia (2015)
  • 27. What is Deep Learning? โ€ข Deep learning architectures, specifically artificial neural networks (ANNs) have been around since 1980. โ€ข However, there were breakthroughs in training techniques that lead to their recent resurgence in the mid 2000โ€™s. โ€ข Combined with modern computing power, they are quite effective.
  • 28. What is Ensemble Learning? โ€ข Random Forests and Gradient Boosting Machines (GBM) are both ensembles of decision trees. โ€ข Stacking, or Super Learning, is technique for combining various learners into a single, powerful learner using a second-level metalearning algorithm. โ€œEnsemble methods use multiple learning algorithms to obtain better predictive performance that could be obtained from โ€จ any of the constituent learning algorithms.โ€ โ€จ โ€” Wikipedia (2015)
  • 29. No Free Lunch โ€ข No general purpose algorithm to solve all problems. โ€ข No right answer on optimal data preparation. โ€ข Some algorithms may have such strong biases that they can only learn certain kinds of functions. "Even after the observation of the frequent or constant conjunction of objects, we have no reason to draw any inference concerning any object beyond those of which we have had experience.โ€ โ€จ โ€” David Hume (1711-1776)
  • 30. Where to Learn More? โ€ข H2O Online Training (free): http://learn.h2o.ai โ€ข H2O Slidedecks: http://www.slideshare.net/0xdata โ€ข H2O Video Presentations: https://www.youtube.com/user/0xdata โ€ข H2O Community Events & Meetups: http://h2o.ai/events โ€ข Machine Learning & Data Science courses: http://coursebuffet.com