intro
intro
1
Introduction to Data Science
2
What is data
science?
3
What is data
science?
4
What is data
science?
• Google The Google search engine algorithm is a testament to the power of data science. The
PageRank algorithm, leveraging data analysis, revolutionized web search by ranking pages based
on their importance and relevance.
• The Human Genome completed in 2003, marked a pivotal moment in data-driven medicine.
This project involved mapping and understanding the human genome, generating massive
amounts of data, and spurring advancements in personalized medicine.
5
Asking the Right Question
6
“A problem well defined
is a problem half
solved.”
Charles F. Kettering
7
Evaluate a new
product
Sales figures
8
Detect equipment
failure
Sensor data
Sensor data
Sensor data
9
Better targeted
Customer data marketing
Marketing data
10
Assess the Situation
Risks
Benefits
Contingencies
Regulations
Resources
Requirements
11
Define Goals
Objectives
Criteria
12
Formulate the Question
Define the Problem
Define Goals
13
Steps in the Data Science Process
14
ACQUIRE PREPARE ANALYZE REPORT ACT
15
ACQUIRE PREPARE ANALYZE REPORT ACT
16
ACQUIRE PREPARE ANALYZE REPORT ACT
17
ACQUIRE PREPARE ANALYZE REPORT ACT
Preliminary
analysis
18
ACQUIRE PREPARE ANALYZE REPORT ACT
19
ACQUIRE PREPARE ANALYZE REPORT ACT
20
ACQUIRE PREPARE ANALYZE REPORT ACT
21
ACQUIRE PREPARE ANALYZE REPORT ACT
22
ACQUIRE PREPARE ANALYZE REPORT ACT
Iterative process
23
Step 1: Acquiring Data
24
ACQUIRE PREPARE ANALYZE REPORT ACT
• Identify datasets
• Retrieve datasets
• Query data
25
Where’s the data?
26
Data comes from many places…
27
Traditional databases
28
Text files
Scripting languages
29
Remote data
SOAP
REST
WebSocket
Web Services
30
NoSQL storage
31
Acquiring data related to wildfires…
Historical weather SQL
Current weather
WebSocket
Real-time tweets
near fires REST
32
Traditional databases Remote data
SQL and query browsers Web Services
NoSQL storage
Text files Web Services
Scripting languages Programming Interfaces
33
Step 2-A: Exploring Data
34
ACQUIRE PREPARE ANALYZE REPORT ACT
Preliminary
analysis
35
Why explore?
36
Why explore?
Correlations
37
Why explore? Outliers
General trends
Correlations
38
Describe Your Data
39
Visualize Your Data Heat Maps
Scatter plots
Boxplot s
40
Informed
Analysis
Data
Undertanding
Data
Exploration
41
Step 2-B: Pre-processing Data
42
ACQUIRE PREPARE ANALYZE REPORT ACT
Clean Transform
43
Real-world data is messy!
• Inconsistent values
• Duplicate records
• Missing values
• Invalid data
• Outliers
44
Addressing Data Quality Issues
Domain
Knowledge
45
Getting Data in Shape
46
Data Munging
Data
Dimensionality Manipulation
Reduction
Transformation
Scaling
Feature
Selection
47
Scaling
Scaled Values
Weight
Height
48
Transformation
Original Transformed
Data Data
49
Feature Selection
Remove
feature Combine
features
X
Add
feature
50
Dimensionality Reduction
3D 2D
51
Data Manipulation
52
Always Remember!
Data preparation is
very important for
meaningful analysis!
53
Step 3: Analyze Data
54
ACQUIRE PREPARE ANALYZE REPORT ACT
55
Build Model
Input Data
56
Categories of Analysis Techniques
Classification
Regression
Clustering
Association
Graph Analysis
Analytics
57
Classification
Sunny
Goal: Predict category
Windy
Rainy
Cloudy
58
Regression
Goal: Predict numeric value
59
Clustering
Adults
Teenagers
60
Association Analysis
Goal: Find rules to capture
associations between items
61
Graph Analytics
Goal: Use graph structures to
find connections between entities
62
Select technique Modeling
Build model
Validate model
63
Evaluation of Results
64
Classification and Regression
Predicted Correct
Value Value
65
Clustering
66
Association Analysis and Graph Analytics
Investigate Validate
67
Determine Next Steps
Repeat analysis?
Act on results?
68
Select technique Build model Evaluate
• Classification
• Regression
• Clustering
• Association
• Analysis
• Graph Analytics
69
Step 4: Reporting Insights
70
ACQUIRE PREPARE ANALYZE REPORT ACT
71
What to Present
72
What to Present
73
How to Present
74
Visualization Tools
75
Present
with
using
76
Step 5: Turning Insights into Action
77
ACQUIRE PREPARE ANALYZE REPORT ACT
78
Database NoSQL Files
GIS
Social Sensor
Action
79
Implementation
Process
Action
Automation Stakeholders
80
Assess Impact
Monitor
Action
Measure
Evaluation
81
Action
Determine
Next Steps
Evaluation
Favorable Further
Revisit?
Results? Opportunities?
82
Action
Evaluation
Real-time
Action?
Favorable Further
Revisit?
Results? Opportunities?
83
Data Engineering Computational Data Science
84
Data
Data Science
Insight
Action
85
Insight Data Product
Analysis
Data Insight
Question
86
Insight Data Product
Analysis
Data Insight
Question
87
Book Recommendations
Customer
Demographic What kind of Book
Previous books does this recommendations
Purchases customer like?
Book reviews
88
Find Potential Audience for a Book
Model of customer’s
book preferences
Who is likely to like
this book?
New book
information
89
Market a New Book
Action to market the
Who is likely to like
book to the right
this book?
audience
90
Market a New Book
Insight Action
91
Actionable Information
Prediction
92
Prediction
Action
93
Data
94
A recent growth
Data
► Augmentation of available data
► Need to valorise this ressource
Infra
► Increasing of computational power/GPU
► Parallel computing
► Low cost data center
95
A recent growth
Data + Infra
► Academic maturity (Stats, Computer science, machine
learning)
► Industrial maturity (GAFAM, Silicon Valley)
⇒ value-added
► Create knowledge from data
► Exploit this knowledge : from product to clients
96
How Much Data Is Big Data?
97
The Data - A variety of data
► Text
► Audio
► Videos
► Graphs . . .
Data Deluge
“We are drowning in
information and starving for
knowledge”
– John Naisbitt
Source: Megatrends, 1982
99
Complex data
► Bioinformatics
► biochips, DNA sequencing . . .
► Medical science
► IRM, medical cases, drug
tests . . .
► Astrophysics
► Geophysics
► Education
► ...
100
Different kinds of Data
101
Terms to Describe Data
102
Other Names for ‘Sample’
instance
sample
row observation
record
example
103
Other Names for ‘Variable’
dimension
feature
variable column
attribute
field
104
Data Types
• Most common
Numeric Categorical
• Others
String Date …
105
Numeric Variables
• Values are numbers
• Also called ‘quantitative’
7x105
1
-0.4902
163.92
106
Categorical Variables
• Values are labels, names, or categories
• Also called ‘qualitative’ or ‘nominal’
107
• Feature
Variable • Field
• Column
• …
Sample
Categorical
• Instance
• Record Qualitative
• Row Numeric
Nominal
• Observation Quantitative
• …
108
Dataset
Data
► Data is a set of samples encoded by d features
► We generally consider N samples
Features
► Feature: an elementary descriptor of an entity
► Similar to attribute, characteristic, label, variable, . . .
Sample
► A sample is an entity encoding an object
► It’s composed of features
► Similar to point, instance, vector, . . .
► Usually encoded as a vector Rd )
109
Data : an Example
c i t r i c acid r e s i d u a l sugar chlori des s u l f u r dioxide
1 0 1.9 0.076 11
2 0 2.6 0.098 25
3 0.04 2.3 0.092 15
4 0.56 1.9 0.075 17
5 0 1.9 0.076 11
6 0 1.8 0.075 13
7 0.06 1.6 0.069 15
8 0.02 2 0.073 9
9 0.36 2.8 0.071 17
10 0.08 1.8 0.097 15
25
20
Variable 4 : Sulfur
15
10
0.1
2.8
0.09 2.6
2.4
0.08
2.2
0.07 2
1.8
0.06 1.6
110
Our Notation
Dataset
Consider we have a dataset composed N samples, each
sample being encoded by d features.
► Data: X ∈ RN×d
► Sample : x ∈ Rd
► x j : feature j of x
► ⇒ X (i ,:) = x Ti , X (i,j) = x i (j)
Labels
Property associated to each x i
► Domain : Y = {0, 1} for binary classification
► yi ∈ Y
111
Pre-Processing
An important step
► Cleaning and filtering data
► Managing incomplete data
► Detect and process outliers
► ...
Encoding
► From raw data to vectors
► Require data embedding
► Extract relevant features for each
sample
112
Statistical Analysis
What for ?
► Detect irrelevant data
► Identify bias between different
subsets
113
Statistical Analysis
Multivariate Statistics
► Bivariate analysis
► Correlation coefficient
► ...
What for ?
► Identify link between features
► Determine useless features
114
Data scaling
Standardisation
► Differents methods:
► Classic : subtract mean and scale standard deviation
What for
► Same order of magnitude for each feature
► ”Symetrization” of feature distribution, . . . 115
28 / 35
Data Standardisation: Illustration
Before Scaling
After Scaling
Modern Data Science Skills
• Programming in Python
• Statistics
• Machine Learning
• Scalable Big Data Analysis
117
Why ?
• Easy-to-read and learn
• Vibrant community
• Growing and evolving set of libraries
• Data management
• Analytical processing
• Visaualization
• Applicable to each step in the data science process
• Notebooks
118
What to look forward to!
• Jupyter notebooks
• NumPy
• Pandas
• Matplotlib
• Scikit-Learn
• BeautifulSoup
119
Case Study 1: Introduction
Notebook on moodle
120
Case Study 2: Soccer Data Analysis
121
But first.. Let’s have a look on Structured
Query Language
122
Working With Databases:
Relational Data Model
123
A Collection of Tables
ID FName LName Department Title Salary
202 John Gonzales IT DB Specialist 104750
203 Mary Roberts Research Director 175400
204 Janaki Rao HR Financial Analyst 63850
205 Alex Knight IT Security Specialist 123500
206 Pamela Ziegler IT Programmer 85600
207 Harry Dawson HR Director 115450
124
No Duplicates
125
Dissimilar Tuples Disallowed
ID Fname Lname Department Title Salary
202 John Gonzales IT DB Specialist 104750
203 Mary Roberts Research Director 175400
204 Janaki Rao HR Financial Analyst 63850
205 Alex Knight IT Security Specialist 123500
206 Pamela Ziegler IT Programmer 85600
207 Harry Dawson HR Director 115450
Jane Doe 208 Res. Associate 65800 Research
126
Foreign Keys
EmpSalaries
EmpID Date Salary
202 1/1/2016 104750
203 2/15/1016 175400 EmpSalaries.EmpID References
204 6/1/2015 63850
Employees.ID
205 9/15/2015 123500
206 10/1/2015 85600
207 4/15/2015 115450
202 9/15/2014 101250
204 3/1/2015 48000
207 9/15/2013 106900
205 10/1/2014 113400
127
ID
202
FName
John
LName
Gonzales
Joining Relations
203 Mary Roberts
204 Janaki Rao
ID FName LName Date Salary
205 Alex Knight
202 John Gonzales 1/1/2016 104750
206 Pamela Ziegler 202 John Gonzales 9/15/2014 101250
207 Harry Dawson 203 Mary Roberts 2/15/1016 175400
204 Janaki Rao 6/1/2015 63850
EmpID Date Salary
204 Janaki Rao 3/1/2015 48000
202 1/1/2016 104750
205 Alex Knight 9/15/2015 123500
203 2/15/1016 175400
205 Alex Knight 10/1/2014 113400
204 6/1/2015 63850
206 Pamela Ziegler 10/1/2015 85600
205 9/15/2015 123500
207 Harry Dawson 4/15/2015 115450
206 10/1/2015 85600
207 Harry Dawson 9/15/2013 106900
207 4/15/2015 115450
202 9/15/2014 101250
204 3/1/2015 48000
207 9/15/2013 106900
205 10/1/2014 113400
128
ID
202
FName
John
LName
Gonzales
Summary
203 Mary Roberts
204 Janaki Rao
• As aAlex
205
s Knight
ID FName LName Date Salary
202 John Gonzales 1/1/2016 104750
206 Pamela Ziegler 202 John Gonzales 9/15/2014 101250
207 Harry Dawson 203 Mary Roberts 2/15/1016 175400
204 Janaki Rao 6/1/2015 63850
EmpID Date Salary
204 Janaki Rao 3/1/2015 48000
202 1/1/2016 104750
205 Alex Knight 9/15/2015 123500
203 2/15/1016 175400
205 Alex Knight 10/1/2014 113400
204 6/1/2015 63850
206 Pamela Ziegler 10/1/2015 85600
205 9/15/2015 123500
207 Harry Dawson 4/15/2015 115450
206 10/1/2015 85600
207 Harry Dawson 9/15/2013 106900
207 4/15/2015 115450
202 9/15/2014 101250
204 3/1/2015 48000
207 9/15/2013 106900
205 10/1/2014 113400
129
Working With Databases:
Structured Query Language
130
What is Data Retrieval?
• Data retrieval
• The way in which the desired data is specified and retrieved from a
data store
• Our focus
• How to specify a data request
• The internal mechanism of data retrieval
131
name addr license
Structured Query Language Great
American
363 Main
St., SD, CA
41-
437844098
Bar 92390
• The standard for structured data Beer 6450 Mango 41-
Paradise Drive, SD, CA 973428319
92130
• Example Database Schema Have a Good 8236 Adams 32-
Time Avenue, SD, 032263401
CA 92116
Bars(name, addr, license)
Beers(name, manf)
Sells(bar, beer, price)
132
SELECT-FROM-WHERE
133
name addr license
134
Summary
135
Case Study 2: Soccer Data Analysis
136
Understanding the Benefits
Insight
Action
Data
Ask yourself:
“What insights do I expect to get!”
ACTIONS
• Coach can design programs that improve
these areas in teams
137
Basic Steps in a Data Science Project
ACQUIRE • Import raw dataset into your analytics platform
PREPARE • Explore & Visualize
• Perform Data Cleaning
ANALYZE • Feature Selection
• Model Selection
• Analyze the results
REPORT • Present your findings
ACT • Use them
138
Data Preparation: Explore using Statistics
139
Data Cleaning
• Why do we need to clean data?
• Missing entries
• Garbage values
• NULLs
• How do we clean data?
• Remove the entries
• Impute these entries with a counterpart
• Ex. Average values of the column
• Ex. Assign 0, -1, etc
140
Visualization
• Notebooks
• Cholera Case Study
• The Russian Campaign of 1812
141
Visualization
142
Visualization
143
Visualization
• The Russian Campaign of 1812
• Charles Joseph Minard
• Successive Losses in Men for the French Army in the Russian Campaign,1812
through 1813
144