Foundations of Data Science - Unit 2
Foundations of Data Science - Unit 2
Data Science
Unit 2
Acknowledgement
▪ Most of the slides in this presentation are taken from material provided by
▪ Han and Kimber (Data Mining Concepts and Techniques) and
▪ Tan, Steinbach and Kumar (Introduction to Data Mining)
Zarmeen
Spring 2021 2
Nasim
A simplified Data Science Taxonomy
Data Science
Data
Acquisition
Statistics
Data Analytics
Data Mining
Data
Visualization
Zarmeen
Spring 2021 3
Nasim
Data Analytics
Zarmeen
Spring 2021 4
Nasim
Descriptive vs. Predictive Analytics
▪ Descriptive Analytics
▪ what happened and why did it happen
▪ Referred to as “unsupervised learning” in machine learning
▪ Predictive Analytics
▪ what will happen
▪ Referred to as “supervised learning” in machine learning
Zarmeen
Spring 2021 5
Nasim
Predictive analytics
Classification Techniques Prediction
▪ Classification Trees
▪ Naïve Bayes ▪ Regression Analysis
▪ Random Forest ▪ Time Series Analysis
▪ Neural Networks
▪ Support Vector Machine
Zarmeen
Spring 2021 6
Nasim
Descriptive Analytics
▪ Clustering
▪ Market Basket Analysis
Fig. Clustering
Zarmeen
Spring 2021 7
Nasim
What Is Data Mining?
▪ Data mining (knowledge discovery from data)
▪ Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful)
patterns or knowledge from huge amount of data
▪ Alternative names
▪ Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern
analysis, data archeology, data dredging, information harvesting etc.
Zarmeen
Spring 2021 9
Nasim
Data Visualization
▪ Representation of Data using visual forms such as charts, graphs and maps
▪ Goal - Communicate information clearly and effectively to users
▪ Data visualization helps data scientists to get better insights
▪ Tools: Plotly, DataHero, Tableau, Dygraphs, QlikView, ZingCHhart, etc.
Zarmeen
Spring 2021 10
Nasim
Data Science vs. Allied
Fields
Data Science vs. Statistics
▪ Statistics is a mathematically-based field which seeks to collect and interpret
quantitative data. In contrast, data science is a multidisciplinary field which uses
scientific methods, processes, and systems to extract knowledge from data in a range
of forms.
▪ Data scientists use methods from many disciplines, including statistics. However, the
fields differ in their processes, the types of problems studied, and several other
factors.
▪ Statistics has its roots in mathematics, and therefore, there has been an emphasis on
mathematical rigor, a desire to establish that something is sensible on theoretical
grounds before testing it in practice.
▪ In contrast, the data science community has its origin very much in computer practice.
This has led to a practical orientation, a willingness to test something out to see how
well it performs, without waiting for a formal proof of effectiveness.
Zarmeen
Spring 2021 12
Nasim
Data Science vs. Machine Learning
▪ Data science is a broad
term for multiple Machine Learning(ML) Data Science
disciplines, machine
learning fits within data Develop new (individual) Explore many models, build and
science. models tune hybrids
Zarmeen
Spring 2021 13
Nasim
Data Science vs. Business Intelligence
Business Intelligence
Features Data Science
(BI)
Structured Both Structured and Unstructured
Data Sources (Usually SQL, often Data ( logs, cloud data, SQL, NoSQL,
Warehouse) text)
Zarmeen
Spring 2021 14
Nasim
BI Answers for Fraud Detection
▪ How many cases were investigated last month?
▪ What was the success rate in collecting debts?
▪ How much revenue was recovered through collections?
▪ What was the close rate of cases in the past month? Past quarter? Past year?
▪ For debts that were closed out, how many days it take on average to close out
debts?
Zarmeen
Spring 2021 15
Nasim
Predictive Analytics for Fraud Detection
▪ What is the likelihood that the transaction is fraudulent?
▪ What is the likelihood the invoice is fraudulent or warrants further investigation?
▪ Which characteristics of the transaction are most related to or most predictive of
fraud?
▪ What is the expected amount of fraud?
▪ Historically, which demographic and historic purchase patterns were most related
to fraud?
Zarmeen
Spring 2021 16
Nasim
Predictive Analytics for Customer Analytics
▪ What is the likelihood an e-mail will be opened?
▪ What is the likelihood a customer will click-through a link in an e-mail?
▪ Which product is a customer more likely to purchase if given the choice?
▪ How many e-mails should the customer receive to maximize the likelihood of a
purchase?
▪ What is the likelihood of a product will sell out if it is put on sale?
Zarmeen
Spring 2021 17
Nasim
BI Answers for Customer Analytics
▪ Which regions/states/ZIPs had the highest response rates?
▪ Which products had the highest/lowest click-through rates?
▪ How many repeat purchasers were there last month?
▪ How many new subscriptions to the loyalty program were there?
▪ How many visits to the store/website did a person have?
Zarmeen
Spring 2021 18
Nasim
Structured vs. Non-Structured Data
▪ Most business databases contain structured data consisting of well-defined fields
with numeric or alpha-numeric values.
▪ An example of unstructured data is a video recorded by a surveillance camera in
a departmental store. This form of data generally requires extensive processing to
extract and structure the information contained in it.
Zarmeen
Spring 2021 19
Nasim
Structured vs. Non-Structured Data (Cont’d)
▪ Structured data is often referred to as traditional data, while the semi-structured
and unstructured data are lumped together as non-traditional data.
▪ Most of the current data mining methods and commercial tools are applied to
traditional data.
Zarmeen
Spring 2021 20
Nasim
Data Science Process
Data Science - Process (CRISP-DM)
Zarmeen
Spring 2021 22
Nasim
CRISP-DM (Business Understanding)
▪ Understand the project objectives and requirements
▪ Can it be converted into a data mining problem definition
▪ Were any effort made in the past? If yes, what were the findings? Why are we
doing it again? What has changed?
▪ Assess availability of time, technology and human resources. Do we have enough
time and resources to execute the analytics project?
▪ Identify the success criteria, key risks and major stake holders
Zarmeen
Spring 2021 23
Nasim
CRISP-DM (Data Understanding)
▪ Get familiar with the data. Is it enough to solve the stated business problem? If
not, do we need to redesign the data collection process?
▪ What’s needed vs. what’s available
▪ Identify data quality problems
▪ Determine the structures and tools needed
▪ Discover first insights into the data
Zarmeen
Spring 2021 24
Nasim
CRISP-DM (Data Preparation)
▪ Construct the final dataset
▪ Process likely to be repeated multiple times, and not in any prescribed order
▪ Tasks include attribute selection as well as transformation and cleaning of data
▪ Understand what to keep and what to discard
▪ Extensive use of exploratory data analysis and visualization
Zarmeen
Spring 2021 25
Nasim
CRISP-DM (Modeling)
▪ Application of various modeling techniques and calibration of their parameters to
optimal values
▪ Documenting assumptions behind each modeling technique to get feedback from
stake holders and domain experts
▪ Typically require stepping back to the data preparation phase
Zarmeen
Spring 2021 26
Nasim
CRISP-DM (Evaluation)
▪ Test robustness of the models under consideration by gauging their performances
against hold-out data
▪ Analyze if the models achieve the business objectives.
▪ Finalize a data mining model
▪ Quantify business value and identify key findings
Zarmeen
Spring 2021 27
Nasim
CRISP-DM (Deployment)
▪ Typically a customer-driven stage instead of data analyst driven.
▪ Important for the customer to understand up front the actions needed to actually
make use of the created models.
▪ Define process to update and retrain the model, as needed.
Zarmeen
Spring 2021 28
Nasim