100% found this document useful (1 vote)

145 views

Beginners Guide To Data Science - A Twics Guide 1

Are you a beginner in Computer Science and Data mining ? Learn about Data Science through this guide.

Uploaded by

Jeffin Varghese

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

145 views

Beginners Guide To Data Science - A Twics Guide 1

Are you a beginner in Computer Science and Data mining ? Learn about Data Science through this guide.

Uploaded by

Jeffin Varghese

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Beginner’s Guide to Data Science

by
Turkish Women in Computing
Latife Genc, Groupon
Gokcen Cilingir, Intel
Rabia Nuray-Turan, Moodwire Inc
Umit Yalcinalp, myappellation.com
Gulustan Dogan, Yildiz Technical University
1
Data Science is: Popular
Lots of Data => Lots of Analysis => Lots of Jobs

Universities: Starting new multidisciplinary programs

Industry: Cottage industry evolving for online and training courses

Goal of this Talk:

● Hear if from people who do it and what they do

● Use it for further learning and specialization

2
Data is: Big! Lots of Data => Lots of Analysis => Lots of Jobs

● 2.5 quintillion (1018) bytes of data are generated every day!

● Everything around you collects/generates data
● Social media sites
● Business transactions
● Location-based data
● Sensors
● Digital photos, videos
● Consumer behaviour (online and store transactions)
● More data is publicly available
● Database technology is advancing
● Cloud based & mobile applications are widespread

3
Source: IBM http://www-01.ibm.com/software/data/bigdata/
If I have data, I will know :)
Everyone wants better predictability, forecasting, customer satisfaction, market
differentiation, prevention, great user experience, ...

● How can I price a particular product?

● What can I recommend online customers to buy after buying X, Y or Z?
● How can we discover market segments? group customers into market segments?
● What customer will buy in the upcoming holiday season? (what to stock?)
● What is the price point for customer retention for subscriptions?

4
Data Science is: making sense of Data
Lots of Data => Lots of Analysis => Lots of Jobs

● Multidisciplinary study of data collections for analysis, prediction, learning and

prevention.
● Utilized in a wide variety of industries.
● Involves both structured or unstructured data sources.

5
Data Science is: multidisciplinary
● Statisticians
● Mathematicians
● Computer Scientists in
○ Data mining
○ Artificial Intelligence & Machine Learning
○ Systems Development and Integration
○ Database development
○ Analytics
● Domain Experts
○ Medical experts
○ Geneticists
○ Finance, Business, Economy experts
○ etc.
6
Plan Clean Data

What is the
question?
Data Reformating
Start Data Quality & Imputing
Acquisition Analysis Data
What type of
data is
needed?
Scripts

Explore the Deployment

Data
Feature Model Results
Selection Selection Evaluation Maintenance
Feature
Engineering
Optimization
Scripts Scripts
Data Analysis Modeling Deployment and 7
optimization
Plan Clean Data

What is the
question?
Data Reformating
Start Data Quality & Imputing
Acquisition Analysis Data
What type of
data is
needed?
Scripts

Explore the Deployment

Data
Feature Model Results
Selection Selection Evaluation Maintenance
Feature
Engineering
Optimization
Scripts Scripts
Data Analysis Modeling Deployment and 8
optimization
Data Acquisition Stage
● As soon as the data scientist identified the problem she is trying to solve, she
must assess:
● What type of data is available
● What might be required and currently is not collected
● Is it available from other units of the company?
● Does she need to crawl/buy data from third parties?
● How much data is needed? (Data volume)
● How to access the data?
● Is the data private?
● Is it legally OK to use the data?

9
Data Acquisition Stage
● Data may not exist
● Sources of data may be public or private
● Not all sources of data may be suitable for processing
● Data are often incomplete and dirty
● Data consolidation and cleanup are essential
○ Pieces of data may be in different sources
○ Formats may not match/may be incompatible
○ Unstructured data may need to be accounted for

10
Data Acquisition Stage -- Example
Example: Online customer experience may require collecting lots of data such
as

● clicks
● conversions
● add-to-cart rate
● dwell time
● average order value
● foot traffic
● bounce rate
● exits and time to purchase
11
Data Acquisition: Type and Source of Data
● Time spent on a page, browsing and/or
search history
○ Website Logs
● User and Inventory Data
○ Transaction databases
● Social Engagement
○ Social Networks (Yelp, Twitter,...)
● Customer Support
○ Call Logs, Emails
● Gas prices, competitors, news, Stock
Prices, etc..
○ RSS Feeds, News Sites, Wikipedia,...
● Training Data?
○ CrowdFlower, Mechanical Turk

12
Data Acquisition : Storage and Access
● Where the data resides
○ Cloud or Computing Clusters
● Storage System
○ SQL, NoSQL, File System
○ SQL: MySQL, Oracle, MS Server,...
○ NoSQL: MongoDB, Cassandra,
Couchbase, Hbase, Hive, ...
○ Text Indexing: Solr, ElasticSearch,...
● Data Processing Frameworks:
○ Hadoop, Spark, Storm etc...

13
Data Acquisition: Data Integration
Data integration involves combining data residing
in different sources and providing users with a Data
Source 1
unified view of these data. (Wikipedia)

● Schema Mapping Data

● Record Matching Source 2 Data
ETL Warehouse
● Data Cleaning
Data
Source 3

Data
Source 4

14
Data Cleaning
● Data are often incomplete, incorrect.
○ Typo : e.g., text data in numeric fields
○ Missing Values : some fields may not be collected
for some of the examples
○ Impossible Data combinations: e.g., gender=
MALE, pregnant = TRUE
○ Out-of-Range Values: e.g., age=1000
● Garbage In Garbage Out
● Scripting, Visualization

Figure ref: https://thedailyomnivore.net/2015/12/02/

15
Plan Clean Data

What is the
question?
Data Reformating
Start Data Quality & Imputing
Acquisition Analysis Data
What type of
data is
needed?
Scripts

Explore the Deploy Models

Data
Feature Model Results
Selection Selection Evaluation Maintenance
Feature
Engineering
Optimization
Scripts Scripts
Data Analysis Modeling Deployment and 16
optimization
Analysis - Data Preparation
● Univariate Analysis: Analyze/explore variables one by one
● Bivariate Analysis: Explore relationship between variables
● Coverage, missing values: treating unknown values
● Outliers: detect and treat values that are distant from other observations
● Feature Engineering: Variable transformations and creation of new better
variables from raw features

Commonly used tools:

● SQL
● R: plyr, reshape, ggplot2, data.table,
● Python: NumPy, Pandas, SciPy, matplotlib

17
Analysis - Exploratory Analysis
Univariate Analysis: Analyze/explore variables one by one

- Continuous variable: explore central tendency and spread of the values

- Summary statistics
- mean, median, min, max
- IQR, standard deviation, variance, quartile
- Visualize Histograms, Boxplots

18
Analysis - Exploratory Analysis
Summary statistics for “Temperature”:
Min. 1st Qu. Median Mean 3rd Qu. Max. Std Dev.
-7.29 45.90 60.71 59.36 73.88 102.00 18.68

Walmart Store Sales Forecasting Data, Kaggle

19
Analysis - Exploratory Analysis
Univariate Analysis: Analyze/explore variables one by one

- Categorical Variable: frequency tables

- Count and count %
- Visualize Bar charts

20
Analysis - Exploratory Analysis
Bivariate Analysis: Explore relationship between variables

- Continuous to continuous variables: Correlation measures the strength and

direction of a linear relationship
- Visualize Scatterplots -> relationship may not be linear

21
Analysis - Exploratory Analysis
Bivariate Analysis: Explore relationship between variables
- Categorical to categorical variables -> crosstab table
- Visualize Stacked bar charts
- Continuous to categorical variables ->
- Visualize Boxplots, Histograms for each level(category)

22
Analysis - Correlation vs Causation
Correlation ⇏ causation!

23
Analysis - Correlation vs Causation
Correlation ⇏ causation!

To prove causation:

● Randomized controlled experiments

● Hypothesis testing, A/B testing

24
Analysis - Feature Engineering
Create new features from existing raw features: discretize, bin

Transform Variables

Create new categorical variables: too many levels, levels that rarely occur, one
level almost always occur

Extremely skewed data - outliers

Imputation: Filling in missing data

25
Analysis - Missing Values
Missing values are unknown values of a feature.

Important as they may lead to biased models or incorrect estimations and conclusions.

Some ML algorithms accept missing values: for example some tree based models treat
missing values as a separate branch while many other algorithms require complete
dataset. Therefore, we can

● omit: remove missing values and use available data

● impute: replace missing values estimating by mean/median/mode value of the
existing data, by most similar data points (KNN) or more complex algorithms like
Random Forest

26
Analysis - Outliers
Outliers are values distant from other observations like values that are > ~three
standard deviation away from the mean or values between top and bottom 5
percentiles or values outside of 1.5 IQR.
Visualization methods like Boxplots, Histograms and Scatterplots help

27
Analysis - Outliers
Some algorithms like regression are sensitive to outliers and can cause high error
variance and bias in the estimated values.

Delete, cap, transform or impute like missing values.

28
Plan Clean Data

What is the
question?
Data Reformating
Start Data Quality & Imputing
Acquisition Analysis Data
What type of
data is
needed?
Scripts

Explore the Deployment

Data
Feature Model Results
Selection Selection Evaluation Maintenance
Feature
Engineering
Optimization
Scripts Scripts
Data Analysis Modeling Deployment and 29
optimization
Predictive data modeling
Prediction, that is the end goal of many data science adventures!

Data on consumer behaviour is collected:

● to predict future consumer behaviour and to take action accordingly

Examples:

● Recommendation systems (netflix, pandora, amazon, etc.)

● Online user behaviour is used to predict best targeted ads
● Customer purchase histories are used to determine how to price,stock,
market and display future products.
30
Machine learning
● Machine Learning is the study of algorithms that improve their performance at
some task with example data or past experience
○ Foundation to many ML algorithms lie in statistics and optimization theory
○ Role of Computer science: Efficient algorithms to
■ Solve the optimization problem
■ Represent and evaluate data models for inference

● Wide variety of off-the-shelf algorithms are available today. Just pick a library
and go! (is it really that easy?)
○ Short answer: no. Long answer: model selection and tuning requires deeper understanding.

31
Machine learning - basics
Machine learning systems are made up of
3 major parts, which are:

● Model: the system that makes

predictions.
● Parameters: the signals or factors
used by the model to form its
decisions.
● Learner: the system that adjusts
the parameters — and in turn the
model — by looking at differences
in predictions versus actual
outcome. Ref: http://marketingland.com/how-machine-learning-works-150366 32
Machine learning application examples
● Association Analysis
○ Basket analysis: Find the probability that somebody
who buys X also buys Y
● Supervised Learning
○ Classification: Spam filter, language prediction,
customer/visit type prediction
○ Regression: Pricing
○ Recommendation
● Unsupervised Learning
○ Given a database of customer data, automatically
discover market segments and group customers into
different market segments

33
Model selection and generalization
● Learning is an ill-posed problem; data is
not sufficient to find a unique solution
● There is a trade-off between three
factors:
○ Model complexity
○ Training set size
○ Generalization error (expected error
on new data)
● Overfitting and underfitting problems

Ref: http://www.inf.ed.ac.uk/teaching/courses/iaml/slides/eval-2x2.pdf 34
Generalization error and cross-validation
● Measuring the generalization error is a major
challenge in data mining and machine
learning
● To estimate generalization error, we need
data unseen during training. We could split
the data as
○ Training set (50%)
○ Validation set (25%) (optional, for selecting ML
algorithm parameters)
○ Test (publication) set (25%)
● How to avoid selection bias: k-fold cross-
validation

Figure ref: https://www.quora.com/I-train-my-system-based-on-the-10-fold-cross-validation-framework-Now-it-gives-me-10-different-models-Which-model-to-select-as-a-representative

35
Deep Learning
● Neural networks(NN) has been around for decades but they just weren’t “deep” enough. NNs with
several hidden layers are called deep neural networks (DNN).
● Different than many ML approaches, deep learning attempts to model high-level abstractions in data.
● Deep learning is suited best when input space is locally structured – spatial or temporal – vs. arbitrary
input features

36
Plan Clean Data

What is the
question?
Data Reformating
Start Data Quality & Imputing
Acquisition Analysis Data
What type of
data is
needed?
Scripts

Explore the Deployment

Data
Feature Model Results
Selection Selection Evaluation Maintenance
Feature
Engineering
Optimization
Scripts Scripts
Data Analysis Modeling Deployment and 37
optimization
Deployment, maintenance and optimization
● Deployed solutions might include:
○ A trained data model (model + parameters)
○ Routines for inputting and prediction
○ (Optional) Routines for model improvement (through feedback, deployed system can improve
itself)
○ (Optional) Routines for training
● Once the model has been deployed in production, it is time for regular
maintenance and operations.

● The optimization phase could be triggered by failing performance, need to

add new data sources and retraining the model, or even to deploy improved
versions of the model based on better algorithms.
Ref: http://www.datasciencecentral.com/m/blogpost?id=6448529%3ABlogPost%3A234092 38
Recap - Software Toolbox of Data Scientists:
● Database
○ SQL
○ NoSQL languages for target databases
● Programming Languages and Libraries
○ Python (due to availability of libraries for data management) scikit-learn, pyML, pandas
○ R
○ General programming languages such as Java for gluing different systems
○ C/C++] mlpack, dlib

● Tools: Orange, Weka, Matlab

● Vendor Specific Platforms for data analytics

(such as Adobe Marketing Cloud, etc.)
● Hive
● Spark
39
Conclusion: It takes a team
Must haves:

- Programming and Scripting skills

- Statistics and data analysis skills
- Machine learning skills

Necessary but not sufficient:

- Database management skills

- Distributed computing skills

Domain knowledge may make or break a system: If you do not realize a type of
data is essential, the results will not be very useful

40
Resources

● [DDS] Doing Data Science (O’Neill, Schutt) O Reilly Press

● [CACM Blog Data] Science Workflow Overview and Challenges
http://cacm.acm.org/blogs/blog-cacm/169199-data-science-workflow-overview-and-challenges/fulltext

MRA ML1 - Kirtesh
100% (7)
MRA ML1 - Kirtesh
43 pages
Datascience With Answers
100% (1)
Datascience With Answers
36 pages
Chapter 3: Data Preprocessing
100% (1)
Chapter 3: Data Preprocessing
41 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
SQL Questions
No ratings yet
SQL Questions
4 pages
Data Visualization PDF
No ratings yet
Data Visualization PDF
3 pages
Introduction To Data Mining: Dr. Dipti Chauhan Assistant Professor SCSIT, SUAS Indore
No ratings yet
Introduction To Data Mining: Dr. Dipti Chauhan Assistant Professor SCSIT, SUAS Indore
16 pages
Data Visualization Techniques
No ratings yet
Data Visualization Techniques
20 pages
A Comprehensive Guide To Data Exploration: Steps of Data Exploration and Preparation Missing Value Treatment
100% (2)
A Comprehensive Guide To Data Exploration: Steps of Data Exploration and Preparation Missing Value Treatment
8 pages
Data Analysis With Pandas - Introduction To Pandas Cheatsheet - Codecademy PDF
100% (1)
Data Analysis With Pandas - Introduction To Pandas Cheatsheet - Codecademy PDF
3 pages
New Batches Info: Quality Thought Ai-Data Science Diploma
No ratings yet
New Batches Info: Quality Thought Ai-Data Science Diploma
16 pages
Programming and Data Analytics Using Python
100% (1)
Programming and Data Analytics Using Python
16 pages
SQL For Data Analytics
No ratings yet
SQL For Data Analytics
92 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
19 pages
Query Performance Tuning
No ratings yet
Query Performance Tuning
35 pages
Data Visualization - Matplotlib PDF
100% (1)
Data Visualization - Matplotlib PDF
15 pages
Chapter 2 - NumPy and Pandas
No ratings yet
Chapter 2 - NumPy and Pandas
26 pages
Data Science Theory: Analysis and Analytics
No ratings yet
Data Science Theory: Analysis and Analytics
14 pages
python interview question
No ratings yet
python interview question
39 pages
ERModel PDF
100% (1)
ERModel PDF
82 pages
Data Wrangling
No ratings yet
Data Wrangling
30 pages
Introduction To Data Engineering
No ratings yet
Introduction To Data Engineering
8 pages
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
No ratings yet
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
23 pages
Tableau Desktop Training: About Intellipaat
No ratings yet
Tableau Desktop Training: About Intellipaat
10 pages
Introduction To Data Mining
100% (1)
Introduction To Data Mining
643 pages
Data Science
100% (2)
Data Science
38 pages
Data Science PPT Module 1
100% (1)
Data Science PPT Module 1
24 pages
Data Generalization
No ratings yet
Data Generalization
3 pages
Chapter 5 - Data Exploration and Visualization With
No ratings yet
Chapter 5 - Data Exploration and Visualization With
39 pages
Introduction To Machine Learning
100% (1)
Introduction To Machine Learning
119 pages
Text Mining in R (Intro)
0% (1)
Text Mining in R (Intro)
4 pages
SAS Presentation
No ratings yet
SAS Presentation
49 pages
Natural Language Processing: Dr. Ahmed El-Bialy
100% (1)
Natural Language Processing: Dr. Ahmed El-Bialy
49 pages
Python Pyramid Program
No ratings yet
Python Pyramid Program
4 pages
Unit-1 Data Visualization Notes
No ratings yet
Unit-1 Data Visualization Notes
15 pages
Cs2258 Database Management Systems Lab Manual: Prepared by
No ratings yet
Cs2258 Database Management Systems Lab Manual: Prepared by
65 pages
Data Quality and Cleaning
No ratings yet
Data Quality and Cleaning
9 pages
Python Data Science
No ratings yet
Python Data Science
25 pages
DataMining S
No ratings yet
DataMining S
103 pages
Introduction To Python For Data Science - Syllabus
100% (1)
Introduction To Python For Data Science - Syllabus
5 pages
Python Syllbus by Lokesh
No ratings yet
Python Syllbus by Lokesh
5 pages
Advance SQL
No ratings yet
Advance SQL
103 pages
K-Means Clustering Using Python
No ratings yet
K-Means Clustering Using Python
30 pages
Presentation ON RDBMS: Submitted By-Dilpreet Singh Joginder Singh Class - Mba (Bu) 3 SEM
100% (3)
Presentation ON RDBMS: Submitted By-Dilpreet Singh Joginder Singh Class - Mba (Bu) 3 SEM
11 pages
150 Data Engineering Interview Questions PDF
No ratings yet
150 Data Engineering Interview Questions PDF
8 pages
Big Data Analytics Notes
No ratings yet
Big Data Analytics Notes
117 pages
Data Visualization
No ratings yet
Data Visualization
9 pages
15 Essential Python Interview Questions: Data Structures Primitive Types The Heap
100% (1)
15 Essential Python Interview Questions: Data Structures Primitive Types The Heap
144 pages
CSE 530 - Database Management Systems: Data Warehousing Presentation by Ali Gardezi Prashanth Janardanan Aaron Sheffield
No ratings yet
CSE 530 - Database Management Systems: Data Warehousing Presentation by Ali Gardezi Prashanth Janardanan Aaron Sheffield
69 pages
Chapter 2. Python Basics
No ratings yet
Chapter 2. Python Basics
125 pages
Text Mining With R - Twitter Data Analysis
No ratings yet
Text Mining With R - Twitter Data Analysis
24 pages
Data-Structures-Lecture 1
No ratings yet
Data-Structures-Lecture 1
12 pages
Introduction To Data Mining
75% (4)
Introduction To Data Mining
45 pages
An Introduction To Text: Mining
No ratings yet
An Introduction To Text: Mining
39 pages
Statistical Foundations - Intro 64zlf
100% (2)
Statistical Foundations - Intro 64zlf
86 pages
Numpy Basics Introduction To
No ratings yet
Numpy Basics Introduction To
35 pages
Python Interview Usecases
No ratings yet
Python Interview Usecases
33 pages
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
Pentaho Data Integration Cookbook - Second Edition
From Everand
Pentaho Data Integration Cookbook - Second Edition
Alex Meadows
No ratings yet
Internship Report Data Science
100% (1)
Internship Report Data Science
58 pages
Study Material I
No ratings yet
Study Material I
140 pages
Exploratory Data Analysis and Data Mining On Yelp Restaurant Review Using Ada Boosting and MLP Techniques
No ratings yet
Exploratory Data Analysis and Data Mining On Yelp Restaurant Review Using Ada Boosting and MLP Techniques
5 pages
28 Oct EDA Notes
No ratings yet
28 Oct EDA Notes
16 pages
GEA1000 Lecture Notes
No ratings yet
GEA1000 Lecture Notes
155 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
23 pages
Databricks Guide
No ratings yet
Databricks Guide
27 pages
r22 Manual Master
No ratings yet
r22 Manual Master
68 pages
essential_python
No ratings yet
essential_python
16 pages
UNIT 2 Notes - Data Science
No ratings yet
UNIT 2 Notes - Data Science
18 pages
Chapter2 Marketing Information
No ratings yet
Chapter2 Marketing Information
36 pages
Fodsa Unit 1
No ratings yet
Fodsa Unit 1
8 pages
Bhanuprakash Avadutha_Datacrew
No ratings yet
Bhanuprakash Avadutha_Datacrew
1 page
Rainfall Prediction using Machine Learning
No ratings yet
Rainfall Prediction using Machine Learning
9 pages
Generative+AI+Foundations+Certificate+Brochure+(1)
No ratings yet
Generative+AI+Foundations+Certificate+Brochure+(1)
10 pages
brochure-da-1722596516
No ratings yet
brochure-da-1722596516
18 pages
Quantitative Skills for Animal Sciences-day 2
No ratings yet
Quantitative Skills for Animal Sciences-day 2
69 pages
Data analysis Notes
No ratings yet
Data analysis Notes
8 pages
Exploratory Data Analysis Using Python
No ratings yet
Exploratory Data Analysis Using Python
10 pages
DS and BA IIT Madras-Compressed
No ratings yet
DS and BA IIT Madras-Compressed
14 pages
data analyts resume
No ratings yet
data analyts resume
2 pages
Exploratory Data Analysis: 2.1 Objectives
No ratings yet
Exploratory Data Analysis: 2.1 Objectives
23 pages
vertopal.com_project2
No ratings yet
vertopal.com_project2
27 pages
Case Study PPT Content
No ratings yet
Case Study PPT Content
12 pages
Module2 Ids 240201 162026
No ratings yet
Module2 Ids 240201 162026
11 pages
570 - Statistics For Management - Frontsheet Final Report
No ratings yet
570 - Statistics For Management - Frontsheet Final Report
11 pages
Unit 3
No ratings yet
Unit 3
47 pages
Credit Card Final Review
No ratings yet
Credit Card Final Review
21 pages
Correlation Analysis of Noise, Vibration, and Harshness
No ratings yet
Correlation Analysis of Noise, Vibration, and Harshness
18 pages
Explaratory Data Analysis - Python
No ratings yet
Explaratory Data Analysis - Python
16 pages
Chapter 6 Introduction To Data Visualization - Introduction To Data Science
No ratings yet
Chapter 6 Introduction To Data Visualization - Introduction To Data Science
4 pages

Beginners Guide To Data Science - A Twics Guide 1

Uploaded by

Beginners Guide To Data Science - A Twics Guide 1

Uploaded by

Beginner’s Guide to Data Science

Universities: Starting new multidisciplinary programs

Industry: Cottage industry evolving for online and training courses

Goal of this Talk:

● Hear if from people who do it and what they do

● 2.5 quintillion (1018) bytes of data are generated every day!

● How can I price a particular product?

● Multidisciplinary study of data collections for analysis, prediction, learning and

Explore the Deployment

Explore the Deployment

● Schema Mapping Data

Figure ref: https://thedailyomnivore.net/2015/12/02/

Explore the Deploy Models

Commonly used tools:

- Continuous variable: explore central tendency and spread of the values

Walmart Store Sales Forecasting Data, Kaggle

- Categorical Variable: frequency tables

- Continuous to continuous variables: Correlation measures the strength and

● Randomized controlled experiments

Extremely skewed data - outliers

Imputation: Filling in missing data

● omit: remove missing values and use available data

Delete, cap, transform or impute like missing values.

Explore the Deployment

Data on consumer behaviour is collected:

● to predict future consumer behaviour and to take action accordingly

● Recommendation systems (netflix, pandora, amazon, etc.)

● Model: the system that makes

Figure ref: https://www.quora.com/I-train-my-system-based-on-the-10-fold-cross-validation-framework-Now-it-gives-me-10-different-models-Which-model-to-select-as-a-representative

Explore the Deployment

● The optimization phase could be triggered by failing performance, need to

● Tools: Orange, Weka, Matlab

● Vendor Specific Platforms for data analytics

- Programming and Scripting skills

Necessary but not sufficient:

- Database management skills

● [DDS] Doing Data Science (O’Neill, Schutt) O Reilly Press

You might also like