0% found this document useful (0 votes)
6 views

intro

The document provides an introduction to data science, defining it as a multidisciplinary field that utilizes scientific methods to extract insights from data across various industries. It outlines the data science process, which includes steps such as acquiring, preparing, analyzing, reporting, and acting on data, while emphasizing the importance of asking the right questions and ethical considerations. Additionally, it discusses the roles of data scientists and the significance of data quality and preparation in achieving meaningful analysis.

Uploaded by

vagifsamadov2003
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

intro

The document provides an introduction to data science, defining it as a multidisciplinary field that utilizes scientific methods to extract insights from data across various industries. It outlines the data science process, which includes steps such as acquiring, preparing, analyzing, reporting, and acting on data, while emphasizing the importance of asking the right questions and ethical considerations. Additionally, it discusses the roles of data scientists and the significance of data quality and preparation in achieving meaningful analysis.

Uploaded by

vagifsamadov2003
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 144

Introduction to Data Science

1
Introduction to Data Science

• What is data science?


• The data science process
• Asking the right question
• Steps in the Data science process
• The role of data scientists
• Ethical considerations in data science

2
What is data
science?

• Data science is a multidisciplinary field that uses


scientific methods, processes, algorithms, and systems to extract knowledge and insights from data.
• Data scientists use their skills and knowledge to solve real-world problems
in a variety of industries, including healthcare, finance, technology, and retail.

3
What is data
science?

• Healthcare  predict disease outbreaks, personalize treatments, enhance patient outcomes,


recommending personalized treatment plans, ...
• Finance  fraud detection, algorithmic trading, risk assessment, predict market trends or risks,
aiding in investment decisions,…
• E-commerce  recommendation engines analyze user behavior to suggest products

4
What is data
science?

• Google  The Google search engine algorithm is a testament to the power of data science. The
PageRank algorithm, leveraging data analysis, revolutionized web search by ranking pages based
on their importance and relevance.
• The Human Genome  completed in 2003, marked a pivotal moment in data-driven medicine.
This project involved mapping and understanding the human genome, generating massive
amounts of data, and spurring advancements in personalized medicine.

5
Asking the Right Question

6
“A problem well defined
is a problem half
solved.”

Charles F. Kettering

Define the Problem

7
Evaluate a new
product

Sales figures

Call center logs

8
Detect equipment
failure

Sensor data

Sensor data

Sensor data

9
Better targeted
Customer data marketing

Marketing data

10
Assess the Situation

Risks
Benefits
Contingencies
Regulations
Resources
Requirements

11
Define Goals

Objectives

Criteria

12
Formulate the Question
Define the Problem

Assess the Situation

Define Goals

13
Steps in the Data Science Process

14
ACQUIRE PREPARE ANALYZE REPORT ACT

15
ACQUIRE PREPARE ANALYZE REPORT ACT

Step 1: Acquire Data

Identify data sets


Retrieve data
Query data

16
ACQUIRE PREPARE ANALYZE REPORT ACT

Step 2: Prepare Data

Step 2-A: Explore


Step 2-B: Pre-process

17
ACQUIRE PREPARE ANALYZE REPORT ACT

Step 2-A: Explore Data


Understand
nature of data

Preliminary
analysis

18
ACQUIRE PREPARE ANALYZE REPORT ACT

Step 2-B: Pre-process Data

Clean Integrate Package

19
ACQUIRE PREPARE ANALYZE REPORT ACT

Step 3: Analyze Data


Select analytical techniques
Build models

20
ACQUIRE PREPARE ANALYZE REPORT ACT

Step 4: Communicate Results

21
ACQUIRE PREPARE ANALYZE REPORT ACT

Step 5: Apply Results

22
ACQUIRE PREPARE ANALYZE REPORT ACT

Iterative process

23
Step 1: Acquiring Data

24
ACQUIRE PREPARE ANALYZE REPORT ACT

Step 1: Acquire Data

• Identify datasets
• Retrieve datasets
• Query data

25
Where’s the data?

• Identify suitable data


• Acquire all available data

26
Data comes from many places…

…with many ways to access it

27
Traditional databases

SQL and query browsers

28
Text files

Scripting languages

29
Remote data
SOAP
REST
WebSocket

Web Services

30
NoSQL storage

API Web Services

31
Acquiring data related to wildfires…
Historical weather SQL

Current weather
WebSocket

Real-time tweets
near fires REST

32
Traditional databases Remote data
SQL and query browsers Web Services

NoSQL storage
Text files Web Services
Scripting languages Programming Interfaces

33
Step 2-A: Exploring Data

34
ACQUIRE PREPARE ANALYZE REPORT ACT

Step 2-A: Explore Data


Understand
nature of data

Preliminary
analysis

35
Why explore?

Goal: Understand your data

36
Why explore?

Correlations

37
Why explore? Outliers

General trends
Correlations

38
Describe Your Data

39
Visualize Your Data Heat Maps

Histogram Line graphs

Scatter plots

Boxplot s

40
Informed
Analysis
Data
Undertanding

Data
Exploration

41
Step 2-B: Pre-processing Data

42
ACQUIRE PREPARE ANALYZE REPORT ACT

Step 2-B: Pre-process Data

Clean Transform

43
Real-world data is messy!
• Inconsistent values
• Duplicate records
• Missing values
• Invalid data
• Outliers

44
Addressing Data Quality Issues

•Remove data with missing values


•Merge duplicate records
•Generate best estimate for invalid values
•Remove outliers

Domain
Knowledge

45
Getting Data in Shape

46
Data Munging
Data
Dimensionality Manipulation
Reduction

Transformation

Scaling
Feature
Selection

47
Scaling

Scaled Values

Weight
Height

48
Transformation

Original Transformed
Data Data

49
Feature Selection
Remove
feature Combine
features

X
Add
feature

50
Dimensionality Reduction

3D 2D

51
Data Manipulation

52
Always Remember!

Garbage in = Garbage out

Data preparation is
very important for
meaningful analysis!

53
Step 3: Analyze Data

54
ACQUIRE PREPARE ANALYZE REPORT ACT

Step 3: Analyze Data


Select analytical techniques
Build models

55
Build Model
Input Data

Analysis Model Model Output


Technique

56
Categories of Analysis Techniques
Classification
Regression

Clustering
Association
Graph Analysis
Analytics

57
Classification
Sunny
Goal: Predict category

Windy

Rainy

Cloudy

58
Regression
Goal: Predict numeric value

59
Clustering

Goal: Organize similar items into groups


Seniors

Adults

Teenagers

60
Association Analysis
Goal: Find rules to capture
associations between items

61
Graph Analytics
Goal: Use graph structures to
find connections between entities

62
Select technique Modeling

Build model

Validate model

63
Evaluation of Results

64
Classification and Regression

Predicted Correct
Value Value

65
Clustering

66
Association Analysis and Graph Analytics

Investigate Validate

67
Determine Next Steps

Repeat analysis?

Take deeper dive?

Act on results?

68
Select technique Build model Evaluate

• Classification
• Regression
• Clustering
• Association
• Analysis
• Graph Analytics

69
Step 4: Reporting Insights

70
ACQUIRE PREPARE ANALYZE REPORT ACT

Step 4: Communicate Results

71
What to Present

72
What to Present

73
How to Present

74
Visualization Tools

75
Present

with

using

76
Step 5: Turning Insights into Action

77
ACQUIRE PREPARE ANALYZE REPORT ACT

Step 5: Apply Results

78
Database NoSQL Files
GIS

Social Sensor

Action

79
Implementation
Process

Action

Automation Stakeholders

80
Assess Impact
Monitor

Action

Measure

Evaluation

81
Action
Determine
Next Steps
Evaluation

Favorable Further
Revisit?
Results? Opportunities?

82
Action

Evaluation
Real-time
Action?

Favorable Further
Revisit?
Results? Opportunities?

83
Data Engineering Computational Data Science

ACQUIRE PREPARE ANALYZE REPORT ACT

Scale Scale Scale

84
Data

Data Science
Insight

Action

85
Insight Data Product
Analysis
Data Insight
Question

86
Insight Data Product
Analysis
Data Insight
Question

87
Book Recommendations
Customer
Demographic What kind of Book
Previous books does this recommendations
Purchases customer like?

Book reviews

88
Find Potential Audience for a Book
Model of customer’s
book preferences
Who is likely to like
this book?

New book
information

89
Market a New Book
Action to market the
Who is likely to like
book to the right
this book?
audience

90
Market a New Book

Action to market the


Who is likely to like
book to the right
this book?
audience

Insight Action

91
Actionable Information

Historical data Near real-time data

Prediction

92
Prediction

Action

93
Data

94
A recent growth
Data
► Augmentation of available data
► Need to valorise this ressource

Infra
► Increasing of computational power/GPU
► Parallel computing
► Low cost data center

95
A recent growth

Data + Infra
► Academic maturity (Stats, Computer science, machine
learning)
► Industrial maturity (GAFAM, Silicon Valley)

⇒ value-added
► Create knowledge from data
► Exploit this knowledge : from product to clients

96
How Much Data Is Big Data?

Nous ne pouv ons pas afficher l’image.

97
The Data - A variety of data

► Text
► Audio
► Videos
► Graphs . . .
Data Deluge
“We are drowning in
information and starving for
knowledge”

– John Naisbitt
Source: Megatrends, 1982

99
Complex data

► Bioinformatics
► biochips, DNA sequencing . . .
► Medical science
► IRM, medical cases, drug
tests . . .
► Astrophysics
► Geophysics
► Education
► ...

100
Different kinds of Data

Sensors → Quantitatives and


qualitatives features
Text → Strings (Twitter, blogs,
Facebook, . . . )
Speech → Temporal series
Images → 2D data
Videos → 2D + time
Networks → Graphs
Streams → Logs
Labels → Qualitative and ordinal data

101
Terms to Describe Data

102
Other Names for ‘Sample’
instance
sample
row observation

record

example

103
Other Names for ‘Variable’
dimension
feature
variable column

attribute

field

104
Data Types
• Most common

Numeric Categorical

• Others
String Date …

105
Numeric Variables
• Values are numbers
• Also called ‘quantitative’

7x105
1
-0.4902
163.92

106
Categorical Variables
• Values are labels, names, or categories
• Also called ‘qualitative’ or ‘nominal’

Color Categorical Variable


Red
Silver
Values are labels
Blue
White
Black

107
• Feature
Variable • Field
• Column
• …
Sample

Categorical
• Instance
• Record Qualitative
• Row Numeric
Nominal
• Observation Quantitative
• …

108
Dataset
Data
► Data is a set of samples encoded by d features
► We generally consider N samples

Features
► Feature: an elementary descriptor of an entity
► Similar to attribute, characteristic, label, variable, . . .

Sample
► A sample is an entity encoding an object
► It’s composed of features
► Similar to point, instance, vector, . . .
► Usually encoded as a vector Rd )

109
Data : an Example
c i t r i c acid r e s i d u a l sugar chlori des s u l f u r dioxide
1 0 1.9 0.076 11
2 0 2.6 0.098 25
3 0.04 2.3 0.092 15
4 0.56 1.9 0.075 17
5 0 1.9 0.076 11
6 0 1.8 0.075 13
7 0.06 1.6 0.069 15
8 0.02 2 0.073 9
9 0.36 2.8 0.071 17
10 0.08 1.8 0.097 15

25

20

Variable 4 : Sulfur
15

10

0.1
2.8
0.09 2.6
2.4
0.08
2.2
0.07 2
1.8
0.06 1.6

110
Our Notation

Dataset
Consider we have a dataset composed N samples, each
sample being encoded by d features.
► Data: X ∈ RN×d
► Sample : x ∈ Rd
► x j : feature j of x
► ⇒ X (i ,:) = x Ti , X (i,j) = x i (j)

Labels
Property associated to each x i
► Domain : Y = {0, 1} for binary classification
► yi ∈ Y

111
Pre-Processing

An important step
► Cleaning and filtering data
► Managing incomplete data
► Detect and process outliers
► ...

Encoding
► From raw data to vectors
► Require data embedding
► Extract relevant features for each
sample
112
Statistical Analysis

Univariate Descriptive Statistiques


► Mean, standard deviation, median
...
► Boxplots
► Histograms
► ...

What for ?
► Detect irrelevant data
► Identify bias between different
subsets

113
Statistical Analysis

Multivariate Statistics
► Bivariate analysis
► Correlation coefficient
► ...

What for ?
► Identify link between features
► Determine useless features

114
Data scaling
Standardisation
► Differents methods:
► Classic : subtract mean and scale standard deviation

► Usual: Reduce range to an interval


x (j) − min(X (:, j))
x (j) ←
max(X (:, j)) − min(X (:, j))
► Non-Linear (log, sigmoid, . . . )

What for
► Same order of magnitude for each feature
► ”Symetrization” of feature distribution, . . . 115
28 / 35
Data Standardisation: Illustration
Before Scaling

After Scaling
Modern Data Science Skills
• Programming in Python
• Statistics
• Machine Learning
• Scalable Big Data Analysis

117
Why ?
• Easy-to-read and learn
• Vibrant community
• Growing and evolving set of libraries
• Data management
• Analytical processing
• Visaualization
• Applicable to each step in the data science process
• Notebooks

118
What to look forward to!
• Jupyter notebooks
• NumPy
• Pandas
• Matplotlib
• Scikit-Learn
• BeautifulSoup

119
Case Study 1: Introduction

Notebook on moodle

120
Case Study 2: Soccer Data Analysis

Dataset location: https://www.kaggle.com/hugomathien/soccer

• Form meaningful player groups


• Discover other players that are similar to your favorite
athlete
• Form strong teams by using analytics

121
But first.. Let’s have a look on Structured
Query Language

122
Working With Databases:
Relational Data Model

123
A Collection of Tables
ID FName LName Department Title Salary
202 John Gonzales IT DB Specialist 104750
203 Mary Roberts Research Director 175400
204 Janaki Rao HR Financial Analyst 63850
205 Alex Knight IT Security Specialist 123500
206 Pamela Ziegler IT Programmer 85600
207 Harry Dawson HR Director 115450

124
No Duplicates

ID FName LName Department Title Salary


202 John Gonzales IT DB Specialist 104750
203 Mary Roberts Research Director 175400
204 Janaki Rao HR Financial Analyst 63850
205 Alex Knight IT Security Specialist 123500
206 Pamela Ziegler IT Programmer 85600
207 Harry Dawson HR Director 115450
207 Harry Dawson HR Director 115450

125
Dissimilar Tuples Disallowed
ID Fname Lname Department Title Salary
202 John Gonzales IT DB Specialist 104750
203 Mary Roberts Research Director 175400
204 Janaki Rao HR Financial Analyst 63850
205 Alex Knight IT Security Specialist 123500
206 Pamela Ziegler IT Programmer 85600
207 Harry Dawson HR Director 115450
Jane Doe 208 Res. Associate 65800 Research

126
Foreign Keys
EmpSalaries
EmpID Date Salary
202 1/1/2016 104750
203 2/15/1016 175400 EmpSalaries.EmpID References
204 6/1/2015 63850
Employees.ID
205 9/15/2015 123500
206 10/1/2015 85600
207 4/15/2015 115450
202 9/15/2014 101250
204 3/1/2015 48000
207 9/15/2013 106900
205 10/1/2014 113400

127
ID
202
FName
John
LName
Gonzales
Joining Relations
203 Mary Roberts
204 Janaki Rao
ID FName LName Date Salary
205 Alex Knight
202 John Gonzales 1/1/2016 104750
206 Pamela Ziegler 202 John Gonzales 9/15/2014 101250
207 Harry Dawson 203 Mary Roberts 2/15/1016 175400
204 Janaki Rao 6/1/2015 63850
EmpID Date Salary
204 Janaki Rao 3/1/2015 48000
202 1/1/2016 104750
205 Alex Knight 9/15/2015 123500
203 2/15/1016 175400
205 Alex Knight 10/1/2014 113400
204 6/1/2015 63850
206 Pamela Ziegler 10/1/2015 85600
205 9/15/2015 123500
207 Harry Dawson 4/15/2015 115450
206 10/1/2015 85600
207 Harry Dawson 9/15/2013 106900
207 4/15/2015 115450
202 9/15/2014 101250
204 3/1/2015 48000
207 9/15/2013 106900
205 10/1/2014 113400

128
ID
202
FName
John
LName
Gonzales
Summary
203 Mary Roberts
204 Janaki Rao
• As aAlex
205
s Knight
ID FName LName Date Salary
202 John Gonzales 1/1/2016 104750
206 Pamela Ziegler 202 John Gonzales 9/15/2014 101250
207 Harry Dawson 203 Mary Roberts 2/15/1016 175400
204 Janaki Rao 6/1/2015 63850
EmpID Date Salary
204 Janaki Rao 3/1/2015 48000
202 1/1/2016 104750
205 Alex Knight 9/15/2015 123500
203 2/15/1016 175400
205 Alex Knight 10/1/2014 113400
204 6/1/2015 63850
206 Pamela Ziegler 10/1/2015 85600
205 9/15/2015 123500
207 Harry Dawson 4/15/2015 115450
206 10/1/2015 85600
207 Harry Dawson 9/15/2013 106900
207 4/15/2015 115450
202 9/15/2014 101250
204 3/1/2015 48000
207 9/15/2013 106900
205 10/1/2014 113400

129
Working With Databases:
Structured Query Language

130
What is Data Retrieval?

• Data retrieval
• The way in which the desired data is specified and retrieved from a
data store
• Our focus
• How to specify a data request
• The internal mechanism of data retrieval

131
name addr license
Structured Query Language Great
American
363 Main
St., SD, CA
41-
437844098
Bar 92390
• The standard for structured data Beer 6450 Mango 41-
Paradise Drive, SD, CA 973428319
92130
• Example Database Schema Have a Good 8236 Adams 32-
Time Avenue, SD, 032263401
CA 92116
Bars(name, addr, license)
Beers(name, manf)
Sells(bar, beer, price)

132
SELECT-FROM-WHERE

• Which beers are made by Heineken?


Select manf=‘Heineken’ (Beers)
SELECT name Output attribute(s)
FROM Beers Table(s) to use
WHERE manf = ‘Heineken’ Project(name)
name
Heineken
The condition(s) to satisfy
Lager Beer
Strings like ‘Heineken’ are case-
sensitive and are put in quotes Amstel Lager
Amstel Light

133
name addr license

More Example Queries


Great 363 Main 41-
American St., SD, CA 437844098
Bar 92390
• Find expensive beer Beer 6450 Mango 41-
• SELECT DISTINCT beer, price Paradise Drive, SD, CA 973428319
92130
• FROM Sells
Have a Good 8236 Adams 32-
• WHERE price > 15
Time Avenue, SD, 032263401
• Which businesses have a Temporary CA 92116
License (starts with 32) in San Diego?
• SELECT name
• FROM Bars
• WHERE addr LIKE ‘%SD%’ AND license LIKE ’32%’
LIMIT 5

134
Summary

• SQL is the standard querying language for structured


relational data
• Resembles pandas data frames operations
• Allow for selection of data and more

135
Case Study 2: Soccer Data Analysis

Dataset location: https://www.kaggle.com/hugomathien/soccer

• Form meaningful player groups


• Discover other players that are similar to your favorite
athlete
• Form strong teams by using analytics

136
Understanding the Benefits

Insight

Action
Data
Ask yourself:
“What insights do I expect to get!”

INSIGHTS Data Science


• Better understanding and insights on
• player strengths
• enhancing performance
• critical attributes for a player’s performance

ACTIONS
• Coach can design programs that improve
these areas in teams

137
Basic Steps in a Data Science Project
ACQUIRE • Import raw dataset into your analytics platform
PREPARE • Explore & Visualize
• Perform Data Cleaning
ANALYZE • Feature Selection
• Model Selection
• Analyze the results
REPORT • Present your findings
ACT • Use them

138
Data Preparation: Explore using Statistics

139
Data Cleaning
• Why do we need to clean data?
• Missing entries
• Garbage values
• NULLs
• How do we clean data?
• Remove the entries
• Impute these entries with a counterpart
• Ex. Average values of the column
• Ex. Assign 0, -1, etc

140
Visualization

• Notebooks
• Cholera Case Study
• The Russian Campaign of 1812

141
Visualization

• Cholera Case Study

142
Visualization

• Cholera Case Study

143
Visualization
• The Russian Campaign of 1812
• Charles Joseph Minard
• Successive Losses in Men for the French Army in the Russian Campaign,1812
through 1813

144

You might also like