Beginners Guide To Data Science - A Twics Guide 1
Beginners Guide To Data Science - A Twics Guide 1
by
Turkish Women in Computing
Latife Genc, Groupon
Gokcen Cilingir, Intel
Rabia Nuray-Turan, Moodwire Inc
Umit Yalcinalp, myappellation.com
Gulustan Dogan, Yildiz Technical University
1
Data Science is: Popular
Lots of Data => Lots of Analysis => Lots of Jobs
2
Data is: Big! Lots of Data => Lots of Analysis => Lots of Jobs
3
Source: IBM http://www-01.ibm.com/software/data/bigdata/
If I have data, I will know :)
Everyone wants better predictability, forecasting, customer satisfaction, market
differentiation, prevention, great user experience, ...
4
Data Science is: making sense of Data
Lots of Data => Lots of Analysis => Lots of Jobs
5
Data Science is: multidisciplinary
● Statisticians
● Mathematicians
● Computer Scientists in
○ Data mining
○ Artificial Intelligence & Machine Learning
○ Systems Development and Integration
○ Database development
○ Analytics
● Domain Experts
○ Medical experts
○ Geneticists
○ Finance, Business, Economy experts
○ etc.
6
Plan Clean Data
What is the
question?
Data Reformating
Start Data Quality & Imputing
Acquisition Analysis Data
What type of
data is
needed?
Scripts
What is the
question?
Data Reformating
Start Data Quality & Imputing
Acquisition Analysis Data
What type of
data is
needed?
Scripts
9
Data Acquisition Stage
● Data may not exist
● Sources of data may be public or private
● Not all sources of data may be suitable for processing
● Data are often incomplete and dirty
● Data consolidation and cleanup are essential
○ Pieces of data may be in different sources
○ Formats may not match/may be incompatible
○ Unstructured data may need to be accounted for
10
Data Acquisition Stage -- Example
Example: Online customer experience may require collecting lots of data such
as
● clicks
● conversions
● add-to-cart rate
● dwell time
● average order value
● foot traffic
● bounce rate
● exits and time to purchase
11
Data Acquisition: Type and Source of Data
● Time spent on a page, browsing and/or
search history
○ Website Logs
● User and Inventory Data
○ Transaction databases
● Social Engagement
○ Social Networks (Yelp, Twitter,...)
● Customer Support
○ Call Logs, Emails
● Gas prices, competitors, news, Stock
Prices, etc..
○ RSS Feeds, News Sites, Wikipedia,...
● Training Data?
○ CrowdFlower, Mechanical Turk
12
Data Acquisition : Storage and Access
● Where the data resides
○ Cloud or Computing Clusters
● Storage System
○ SQL, NoSQL, File System
○ SQL: MySQL, Oracle, MS Server,...
○ NoSQL: MongoDB, Cassandra,
Couchbase, Hbase, Hive, ...
○ Text Indexing: Solr, ElasticSearch,...
● Data Processing Frameworks:
○ Hadoop, Spark, Storm etc...
13
Data Acquisition: Data Integration
Data integration involves combining data residing
in different sources and providing users with a Data
Source 1
unified view of these data. (Wikipedia)
Data
Source 4
14
Data Cleaning
● Data are often incomplete, incorrect.
○ Typo : e.g., text data in numeric fields
○ Missing Values : some fields may not be collected
for some of the examples
○ Impossible Data combinations: e.g., gender=
MALE, pregnant = TRUE
○ Out-of-Range Values: e.g., age=1000
● Garbage In Garbage Out
● Scripting, Visualization
What is the
question?
Data Reformating
Start Data Quality & Imputing
Acquisition Analysis Data
What type of
data is
needed?
Scripts
17
Analysis - Exploratory Analysis
Univariate Analysis: Analyze/explore variables one by one
18
Analysis - Exploratory Analysis
Summary statistics for “Temperature”:
Min. 1st Qu. Median Mean 3rd Qu. Max. Std Dev.
-7.29 45.90 60.71 59.36 73.88 102.00 18.68
20
Analysis - Exploratory Analysis
Bivariate Analysis: Explore relationship between variables
21
Analysis - Exploratory Analysis
Bivariate Analysis: Explore relationship between variables
- Categorical to categorical variables -> crosstab table
- Visualize Stacked bar charts
- Continuous to categorical variables ->
- Visualize Boxplots, Histograms for each level(category)
22
Analysis - Correlation vs Causation
Correlation ⇏ causation!
23
Analysis - Correlation vs Causation
Correlation ⇏ causation!
To prove causation:
24
Analysis - Feature Engineering
Create new features from existing raw features: discretize, bin
Transform Variables
Create new categorical variables: too many levels, levels that rarely occur, one
level almost always occur
25
Analysis - Missing Values
Missing values are unknown values of a feature.
Important as they may lead to biased models or incorrect estimations and conclusions.
Some ML algorithms accept missing values: for example some tree based models treat
missing values as a separate branch while many other algorithms require complete
dataset. Therefore, we can
26
Analysis - Outliers
Outliers are values distant from other observations like values that are > ~three
standard deviation away from the mean or values between top and bottom 5
percentiles or values outside of 1.5 IQR.
Visualization methods like Boxplots, Histograms and Scatterplots help
27
Analysis - Outliers
Some algorithms like regression are sensitive to outliers and can cause high error
variance and bias in the estimated values.
28
Plan Clean Data
What is the
question?
Data Reformating
Start Data Quality & Imputing
Acquisition Analysis Data
What type of
data is
needed?
Scripts
Examples:
● Wide variety of off-the-shelf algorithms are available today. Just pick a library
and go! (is it really that easy?)
○ Short answer: no. Long answer: model selection and tuning requires deeper understanding.
31
Machine learning - basics
Machine learning systems are made up of
3 major parts, which are:
33
Model selection and generalization
● Learning is an ill-posed problem; data is
not sufficient to find a unique solution
● There is a trade-off between three
factors:
○ Model complexity
○ Training set size
○ Generalization error (expected error
on new data)
● Overfitting and underfitting problems
Ref: http://www.inf.ed.ac.uk/teaching/courses/iaml/slides/eval-2x2.pdf 34
Generalization error and cross-validation
● Measuring the generalization error is a major
challenge in data mining and machine
learning
● To estimate generalization error, we need
data unseen during training. We could split
the data as
○ Training set (50%)
○ Validation set (25%) (optional, for selecting ML
algorithm parameters)
○ Test (publication) set (25%)
● How to avoid selection bias: k-fold cross-
validation
36
Plan Clean Data
What is the
question?
Data Reformating
Start Data Quality & Imputing
Acquisition Analysis Data
What type of
data is
needed?
Scripts
Domain knowledge may make or break a system: If you do not realize a type of
data is essential, the results will not be very useful
40
Resources
41