Lecture02
Lecture02
Data Mining
S2
NOVA-IMS 2024/2025
Fernando Lucas Bação
[email protected]
http://www.isegi.unl.pt/fbacao
Agenda
• Data Science
• Different roles
• Relevance of data
• Building features
• Statistics and data science
1
17/09/24
Data Science
Data Science
2
17/09/24
Data Science
3
17/09/24
Data Science
• Key Qualifications
Data Science
• Job requirements:
• Programming skills related to data using technologies like Python, PERL, C#, etc.
• Stats and data analysis experience working with advanced tools like R, SAS and advanced
Excel
• Experience using data to build ML and statistical models that impact a product or business
• Ability to access, reduce, and join large data sets programmatically
• Natural curiosity that extends beyond your daily activities to help stitch together different
areas of the company, market and technology in new ways
• Ability to think algorithmically about product and business issues
• Willingness to work in a start-up organization with evolving responsibilities and a wide variety
of work
• Understanding that getting value from imperfect data and systems is a core virtue for a data
scientist
• Creativity to find pragmatic paths through ambiguity
• Attention to detail and accuracy
• Ability to collaborate with people representing diverse points of view
• Solid writing, presentation and data visualization skills
• Other desirable skills and experiences:
• Engineering experience building large data systems based on SQL, Hadoop, etc.
• Ability to bring broad product, customer, and market context into data analyses
• Ability to interact with senior leaders to drive product and business impact
• Degree in computer science, machine learning, statistics, math, economics, business or other scientific or quant-focused field, MS or PHD preferred.
4
17/09/24
Different Roles
10
5
17/09/24
• Business Analyst:
• Business analysts’ strengths
lie in their business acumen.
11
• Data Scientist:
• Data science is largely rooted
in statistics, data modeling,
analytics and algorithms.
12
6
17/09/24
• Data Engineer:
• While data scientists dig into
the research and visualization
of data, data engineers
ensure the data is
powered and flows correctly
through the pipeline.
13
Fonte: https://towardsdatascience.com/data-engineer-vs-data-scientist-vs-business-analyst-b68d201364bc
Instituto Superior de Estatística e Gestão de Informação
Universidade Nova de Lisboa
14
7
17/09/24
15
Find/Build Attributes
16
8
17/09/24
Find/Build Attributes
• Features/Attributes
• Features are fundamental to
train an ML system
Weight: 340g
Colous: Orange
17
Find/Build Attributes
Colour
18
9
17/09/24
Find/Build Attributes
Weight
oranges (orange dots) weight and colour
• With these features the ML
system can learn to split data
up with a line to separate
oranges from apples
19
Find/Build Attributes
• Classify a papaya
• This is because it only knows
about apples and oranges and
this was the closest match
Colour
20
10
17/09/24
Find/Build Attributes
Examples
Algorithm Knowledge Learning
(training set)
Examples
Classification Classifier Classification
(new)
21
22
11
17/09/24
23
Find/Build Attributes
number of seeds
• Choosing the appropriate features
has a major impact on the
performance of any ML system
24
12
17/09/24
Find/Build Attributes
Weight
Colour Colour
25
Find/Build Attributes
26
13
17/09/24
Data-Driven Thinking
Examples
Algorithm Knowledge Learning
(training set)
Examples
Classification Classifier Classification
(new)
27
Supervised Learning
123 G Apple
39 G Apple
404 O
Examples
Orange
234 R
(training set)
Apple
Classifier Classification
257 R Apple
368 O Orange
120 R Apple
300 R Apple
28
Instituto Superior de Estatística e Gestão de Informação
Universidade Nova de Lisboa
28
14
17/09/24
Supervised Learning
Examples
Algorithm Knowledge Learning
(training set)
Examples
Classification Classifier Classification
(new)
29
Instituto Superior de Estatística e Gestão de Informação
Universidade Nova de Lisboa
29
Supervised Learning
Examples
Algorithm Knowledge Learning
(training set)
Weight Color
342 O
Classification 123 G Classifier Classification
39 G
404 O Examples
234 R (new)
257 R
368 O
120 R
300 R
30
15
17/09/24
Supervised Learning
Examples
Algorithm Knowledge Learning
(training set)
Examples
Classification Classifier Classification
(new)
31
Instituto Superior de Estatística e Gestão de Informação
Universidade Nova de Lisboa
31
Supervised Learning
Examples
Algorithm Knowledge Learning
(training set)
LABEL
Examples ORANGE
Classification Classifier
(new) APPLE
APPLE
ORANGE
APPLE
Classification
APPLE
ORANGE
APPLE
APPLE
32
Instituto Superior de Estatística e Gestão de Informação
Universidade Nova de Lisboa
32
16
17/09/24
Building Features
33
Building Features
34
17
17/09/24
Building Features
• Transform
• Select variables, create new variables, merge, etc.
• Load
• Load data, periodicity, replacement, historical.
35
Building Features
1.60 79 M 41 3000 S N
1.72 82 M 32 4000 S N
1.66 65 F 28 2500 N N
1.82 87 M 35 2000 N S
1.71 66 F 42 3500 N S
36
18
17/09/24
37
Find/Build Attributes
38
19
17/09/24
Find/Build Attributes
39
20