Lesson 1
Lesson 1
Introduction
TEACHER
Mirko Mazzoleni
PLACE
University of Bergamo
Who I am
• Name: Mirko Mazzoleni
• Contact details:
✓ [email protected] ✓ http://cal.unibg.it/ CAL research laboratory
✓ https://mirkomazzoleni.github.io/ ✓ https://www.facebook.com/calunibg/
2 /37
Course content 6. Decision trees
3 /37
Course content
Part II: Automation
4 /37
Evaluation
• Written exam – 2 hours
Up to 25 points
• Theoretical open questions and exercises
+
• [OPTIONAL] Small data analysis project
(groups of max 3 people) Up to 6 points
5 /37
Data science projects in the CAL research group
1. Forecasting of sales volume (for food industry)
6 /37
Data science projects in the CAL research group
Plant disease
2. Image processing classification
People identification
and classification
Blimp
7 /37
Data science projects in the CAL research group
3. Fault diagnosis
Bearing inner
race fault
Ballscrew
jam in EMA
8 /37
Data science projects in the CAL research group
4. Industrial automation
ICT for remote mantainance Automatic transplant machine
9 /37
Outline
1. Introduction to data science
10 /37
Outline
1. Introduction to data science
11 /37
Why
Retail $0,8T
Travels $480B
Business value created by
Logistics $475B
the AI up to 2030 [1] Automotive & assembly $405B
Materials $300B
Advanced electronics & semiconductors $291B
Healthcare systems & services $267B
$13 High tech
Telecom
$267B
$174B
Trillions
Oil & gas $173B
Agricoulture $164B
• It is difficult to find an industrial sector that will not benefit from AI in the near future
12 /37
We will use the terms “machine learning”, “data mining”, “data science” quite
Data science has been deemed as the sexiest job of the 21st century
13 /37
Learning examples
Recent years: stunning breakthroughs in computer vision applications
14 /37
Learning examples
Recent years: stunning breakthroughs in computer vision applications
15 /37
What learning is about
Machine learning and data science are meaningful to be applied if:
1. A pattern exists
2. We cannot pin it down mathematically (an analytical solutions does not exists)
3. We have data on it
16 /37
Data types
The data can have different formats. The most typical is that of a table
House
# bedrooms Price (1000$) • AIM: predict house prices
area(feet 2 )
523 1 115
645 1 150 Regression
708 2 210
1034 3 280 • The data can come from a
2290 4 355
database or from .csv, Excel files…
2545 4 440
Picture Label
• AIM: recognize if there is a cat in the image
Cat
18 /37
Data are dirty
Garbage IN, garbage OUT
House
# bedrooms Price (1000$)
Data problems: area(feet 2 )
523 1 115
• Missing values 645 1 0,001
708 unknown 210
1034 3 unknown
• Not correct values
unknown 4 355
2545 unknown 440
19 /37
Machine learning vs. data science
House area (feet 2 ) # bedrooms # bathrooms Recently renowed Price (1000$)
523 1 2 No 115
645 1 3 No 150
708 2 1 No 210
1034 3 3 Si 280
2290 4 4 No 355
2545 4 5 Si 440
A B
Machine learning Data science
• Predict B given A • Houses with 3 bathrooms are more expensive
Output: Code and than those with 2 bathrooms of the same size
• Running software program
• Recently renovated Output: Slide deck
(web site\ mobile app)
houses cost 15% more
20 /37
Machine learning vs. data science
Other tools
AI
ML
Deep
learning
Data science
21 /37
Outline
1. Introduction to data science
22 /37
Data-analytic thinking Picture taken from [1]: Provost, Foster, and Tom Fawcett. “Data
Science for Business: What you need to know about data mining
and data-analytic thinking”. O'Reilly Media, Inc., 2013
23 /37
Approaching a data mining problem
Cross Industry Standard Process for Data Mining
(CRISP-DM) https://mineracaodedados.files.wordpress.com/2012/04/the-crisp-
dm-model-the-new-blueprint-for-data-mining-shearer-colin.pdf
24 /37
CRISP-DM: Business understanding
Cast the business problems into one or more data science problems
25 /37
CRISP-DM: Data understanding
Identify the available and needed data
26 /37
CRISP-DM: Data preparation
Clean and prepare data for use with algorithms
• Pay attention to not use historical data that will not be available when your model
will be used
27 /37
CRISP-DM: Modeling
Estimate a mathematical model to extract pattern from data
28 /37
CRISP-DM: Evaluation
Assess the validity of the results
• The devised solution and the model’s decisions should the comprehensible by the
stakeholders
29 /37
CRISP-DM: Deployment
Put the model (or the data mining steps) into production
• For this reason, it is suggested to involve a member of the development team in the
early phases of the data science project
• Deployment can involve not only the final model, but also previous phases (data
collection, model building, evaluation)
30 /37
From business problems to data mining tasks
Each data science project is unique. The aim is to decompose the business problem
into subtasks for which a common approach exists.
There are many machine learning algorithms. However, they address a handful of tasks:
31 /37
Outline
1. Introduction to data science
32 /37
Supervised vs unsupervised methods
A specific data science task can be tackled via a supervised or unsupervised approach
Unsupervised A B
“Do our customers naturally fall into different groups?”
There is no a specific target (or purpose) for the grouping. The aim is only to find similarities between
individuals
Supervised A B
“Can we find groups of customers who have particularly high likelihoods of canceling
their service soon after their contract expire?”
There is a specific target: find people who will leave when contract expires. In this case, there must be data
on the target. The value of the target for an individual is called label or class. We need a dataset of people
that we know they left (labeled dataset)
33 /37
Supervised vs unsupervised methods
• Classification and class probability estimation
• Regression Supervised
• Causal modeling
• Symilarity matching
• Link prediction Supervised or Unsupervised
• Data reduction
• Clustering
• Co-occurrence grouping Unsupervised
• Profiling
34 /37
Business problems as data science examples
Supervised Unsupervised
Supervised or unsupervised
35 /37
Additional resources
MOOC Books
• Learning from data (Yaser S. Abu-Mostafa - EDX) • Data science for business (Foster Provost, Tom
Fawcett)
• The analytics edge (Dimitris Bertsimas - EDX) • Neural Networks and Deep Learning
(Michael Nielsen)
• Statistical learning (Trevor Hastie and
Robert Tibshirani - Standford Lagunita)
• P̂attern Recognition and Machine
Learning (Christopher Bishop)
36 /37
References
1. Notes from the AI frontier: Modeling the impact of AI on the world economy, 2018.
2. Provost, Foster, and Tom Fawcett. “Data Science for Business: What you need to know about data mining and
data-analytic thinking”. O'Reilly Media, Inc., 2013.
3. Brynjolfsson, E., Hitt, L. M., and Kim, H. H. “Strength in numbers: How does data driven decision making affect firm
performance?” Tech. rep., available at SSRN: http://ssrn.com/abstract=1819486, 2011.
4. Pyle, D. “Data Preparation for Data Mining”. Morgan Kaufmann, 1999.
5. Kohavi, R., and Longbotham, R. “Online experiments: Lessons learned”. Computer, 40 (9), 103–105, 2007.
6. Abu-Mostafa, Yaser S., Malik Magdon-Ismail, and Hsuan-Tien Lin. ”Learning from data”. AMLBook, 2012.
7. Andrew Ng. ”Machine learning”. Coursera MOOC. (https://www.coursera.org/learn/machine-learning)
37 /37