0% found this document useful (0 votes)
5 views

Lecture02

Uploaded by

r.duzinho10
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Lecture02

Uploaded by

r.duzinho10
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

17/09/24

Data Mining
S2

NOVA-IMS 2024/2025
Fernando Lucas Bação
[email protected]
http://www.isegi.unl.pt/fbacao

Instituto Superior de Estatística e Gestão de Informação


Universidade Nova de Lisboa

Agenda

• Data Science
• Different roles

• How to build models

• Relevance of data
• Building features
• Statistics and data science

• The canonical tasks in data mining


• Supervised learning
• Unsupervised learning

Instituto Superior de Estatística e Gestão de Informação


Universidade Nova de Lisboa

1
17/09/24

Data Science

Instituto Superior de Estatística e Gestão de Informação


Universidade Nova de Lisboa

Data Science

• Data science is the study of where information comes


from, what it represents and how it can be turned into a
valuable resource in the creation of business and IT
strategies.

• “Data scientists are involved with gathering data,


massaging it into a tractable form, making it tell its story,
and presenting that story to others,” — Mike Loukides

Instituto Superior de Estatística e Gestão de Informação


Universidade Nova de Lisboa

2
17/09/24

Data Science vs Data Mining

• Data Science • Data mining


• Data science is a set of • The actual extraction of
fundamental principles that
support and guide the knowledge from data via
principled extraction of technologies that incorporate
information and knowledge from the data science principles.
data.
• There are hundreds of
• Data science involves more different data-mining
than just data-mining algorithms, and a great deal
algorithms. Successful data of detail to the methods of the
scientists must be able to view field.
business problems from a data
perspective.

Instituto Superior de Estatística e Gestão de Informação


Universidade Nova de Lisboa

Data Science

Instituto Superior de Estatística e Gestão de Informação


Universidade Nova de Lisboa

3
17/09/24

Data Science

• iPhone Operations Data Scientist, Apple - Santa Clara Valley - California -


US

• Key Qualifications

• The position requires a software programming skill set, utilization of statistical


techniques, experience managing data integrity, and implementing automated
solutions. The Data Scientist will need to have an understanding of relational database
management systems, design, and structured query language.
• Excellent analytical skills, high level of statistics with the ability to identify and predict
trends and anomalies.
• Experience in data mining extremely large data sets, high proficiency in SQL (Teradata,
Oracle, or MySQL), relational database management systems and design.
• Strong programming, with excellent object-oriented and dynamic scripting language
skills. Python, Java/C#/Objective-C, HTML5, CSS3, JavaScript, and Unix shell
scripting are strongly desired.
• Experience with statistical tools like JMP, Minitab, and Stata
• Strong ability to manage multiple tasks concurrently and in a timely manner, including
large, complex projects.
• Effective presentation skills and be able to explain complex data and charts in a clear
manner to large audiences. Outstanding communication skills, both verbal and written.

Instituto Superior de Estatística e Gestão de Informação


Universidade Nova de Lisboa

Data Science

• Data & Applied Scientist, Microsoft - Redmond, WA, US

• Job requirements:
• Programming skills related to data using technologies like Python, PERL, C#, etc.
• Stats and data analysis experience working with advanced tools like R, SAS and advanced
Excel
• Experience using data to build ML and statistical models that impact a product or business
• Ability to access, reduce, and join large data sets programmatically
• Natural curiosity that extends beyond your daily activities to help stitch together different
areas of the company, market and technology in new ways
• Ability to think algorithmically about product and business issues
• Willingness to work in a start-up organization with evolving responsibilities and a wide variety
of work
• Understanding that getting value from imperfect data and systems is a core virtue for a data
scientist
• Creativity to find pragmatic paths through ambiguity
• Attention to detail and accuracy
• Ability to collaborate with people representing diverse points of view
• Solid writing, presentation and data visualization skills
• Other desirable skills and experiences:
• Engineering experience building large data systems based on SQL, Hadoop, etc.

• Experience with product and service telemetry systems


• Experience with Feedback/Text analysis

• Experience conducting multivariate experiments such as A/B testing

• Ability to bring broad product, customer, and market context into data analyses
• Ability to interact with senior leaders to drive product and business impact

• 2 years' experience in data exploration, analysis, programming, and modeling

• Degree in computer science, machine learning, statistics, math, economics, business or other scientific or quant-focused field, MS or PHD preferred.

Instituto Superior de Estatística e Gestão de Informação


Universidade Nova de Lisboa

4
17/09/24

Different Roles

Instituto Superior de Estatística e Gestão de Informação


Universidade Nova de Lisboa

Data Science – Different Roles

Instituto Superior de Estatística e Gestão de Informação


Universidade Nova de Lisboa

10

5
17/09/24

Data Science – Different Roles

• Business Analyst:
• Business analysts’ strengths
lie in their business acumen.

• They can communicate well


with both the data scientist
and C-suite to help drive
data-driven decisions faster.

• The best business analysts


also have skills in statistics to
be able to glean interesting
insights from past behavior.

Instituto Superior de Estatística e Gestão de Informação


Universidade Nova de Lisboa

11

Data Science – Different Roles

• Data Scientist:
• Data science is largely rooted
in statistics, data modeling,
analytics and algorithms.

• They focus on conducting


research, optimizing data to
help companies get better at
what they do.

• The minds behind


recommended products on
Amazon.

Instituto Superior de Estatística e Gestão de Informação


Universidade Nova de Lisboa

12

6
17/09/24

Data Science – Different Roles

• Data Engineer:
• While data scientists dig into
the research and visualization
of data, data engineers
ensure the data is
powered and flows correctly
through the pipeline.

• They’re typically software


engineers who can engineer
a strong foundation for data
scientists or analysts to think
critically about the data.

Instituto Superior de Estatística e Gestão de Informação


Universidade Nova de Lisboa

13

Data Science – Different Roles

Fonte: https://towardsdatascience.com/data-engineer-vs-data-scientist-vs-business-analyst-b68d201364bc
Instituto Superior de Estatística e Gestão de Informação
Universidade Nova de Lisboa

14

7
17/09/24

How to Build Models

Instituto Superior de Estatística e Gestão de Informação


Universidade Nova de Lisboa

15

Find/Build Attributes

Instituto Superior de Estatística e Gestão de Informação


Universidade Nova de Lisboa

16

8
17/09/24

Find/Build Attributes

• Features/Attributes
• Features are fundamental to
train an ML system

• They are the properties of the


things you’re trying to learn
about

Weight: 340g
Colous: Orange

Instituto Superior de Estatística e Gestão de Informação


Universidade Nova de Lisboa

17

Find/Build Attributes

• Features/Attributes Plotting apples (red and green dots) and


Weight

oranges (orange dots) weight and colour


• Features of a fruit might be
weight and colour. 2 features
would mean 2 dimensions
• 2 dimensions can be plotted in
a graphic provided they are
expressed numerically

Colour

Instituto Superior de Estatística e Gestão de Informação


Universidade Nova de Lisboa

18

9
17/09/24

Find/Build Attributes

• Features/Attributes Plotting apples (red and green dots) and

Weight
oranges (orange dots) weight and colour
• With these features the ML
system can learn to split data
up with a line to separate
oranges from apples

• This can now be used to make


future classifications when we
plot new points the system has
Colour
not seen

Instituto Superior de Estatística e Gestão de Informação


Universidade Nova de Lisboa

19

Find/Build Attributes

• Features/Attributes Plotting apples (red and green dots) and


Weight

oranges (orange dots) weight and colour


• An ML system cannot predict
about stuff it does not know
about

• Classify a papaya
• This is because it only knows
about apples and oranges and
this was the closest match
Colour

Instituto Superior de Estatística e Gestão de Informação


Universidade Nova de Lisboa

20

10
17/09/24

Find/Build Attributes

Examples
Algorithm Knowledge Learning
(training set)

Examples
Classification Classifier Classification
(new)

Instituto Superior de Estatística e Gestão de Informação


Universidade Nova de Lisboa

21

Let’s build a model!!!

Instituto Superior de Estatística e Gestão de Informação


Universidade Nova de Lisboa

22

11
17/09/24

The Relevance of Data

Instituto Superior de Estatística e Gestão de Informação


Universidade Nova de Lisboa

23

Find/Build Attributes

• Features/Attributes Plotting apples and oranges ripeness and


Ripeness

number of seeds
• Choosing the appropriate features
has a major impact on the
performance of any ML system

• Some features will never allow the


system to produce good results

• How to choose? Practice and


knowledge about the problem
Number
of Seeds

Instituto Superior de Estatística e Gestão de Informação


Universidade Nova de Lisboa

24

12
17/09/24

Find/Build Attributes

Weight
Colour Colour

More features = higher probability of discriminating (up to a point)

Instituto Superior de Estatística e Gestão de Informação


Universidade Nova de Lisboa

25

Find/Build Attributes

• Do you know which examples


correspond to apples and which
correspond to oranges?
• We need the labels (fraud)

• Do you have enough labeled


examples?
• We need experience (scarce)

• Do you know what an orange is?


• We need clear cut definitions (churn)

Instituto Superior de Estatística e Gestão de Informação


Universidade Nova de Lisboa

26

13
17/09/24

Data-Driven Thinking

Examples
Algorithm Knowledge Learning
(training set)

Examples
Classification Classifier Classification
(new)

Instituto Superior de Estatística e Gestão de Informação


Universidade Nova de Lisboa

27

Supervised Learning

Weight Color Fruit Knowledge Learning


342 O Orange

123 G Apple
39 G Apple

404 O
Examples
Orange

234 R
(training set)
Apple
Classifier Classification
257 R Apple

368 O Orange

120 R Apple

300 R Apple

28
Instituto Superior de Estatística e Gestão de Informação
Universidade Nova de Lisboa

28

14
17/09/24

Supervised Learning

Examples
Algorithm Knowledge Learning
(training set)

Examples
Classification Classifier Classification
(new)

29
Instituto Superior de Estatística e Gestão de Informação
Universidade Nova de Lisboa

29

Supervised Learning

Examples
Algorithm Knowledge Learning
(training set)

Weight Color
342 O
Classification 123 G Classifier Classification
39 G
404 O Examples
234 R (new)
257 R
368 O
120 R
300 R

Instituto Superior de Estatística e Gestão de Informação


30
Universidade Nova de Lisboa

30

15
17/09/24

Supervised Learning

Examples
Algorithm Knowledge Learning
(training set)

Examples
Classification Classifier Classification
(new)

31
Instituto Superior de Estatística e Gestão de Informação
Universidade Nova de Lisboa

31

Supervised Learning

Examples
Algorithm Knowledge Learning
(training set)

LABEL

Examples ORANGE
Classification Classifier
(new) APPLE
APPLE
ORANGE
APPLE
Classification
APPLE
ORANGE
APPLE
APPLE
32
Instituto Superior de Estatística e Gestão de Informação
Universidade Nova de Lisboa

32

16
17/09/24

Building Features

Instituto Superior de Estatística e Gestão de Informação


Universidade Nova de Lisboa

33

Building Features

Instituto Superior de Estatística e Gestão de Informação


Universidade Nova de Lisboa

34

17
17/09/24

Building Features

• ETL (extract, transform and load)


• Extract
• To extract and to consolidate data from different sources.

• Transform
• Select variables, create new variables, merge, etc.

• Load
• Load data, periodicity, replacement, historical.

Instituto Superior de Estatística e Gestão de Informação


Universidade Nova de Lisboa

35

Building Features

ABT (Analytic Base Table)


Height Weight Sex Age Inc. PA Insurance
Cost

1.60 79 M 41 3000 S N

1.72 82 M 32 4000 S N

1.66 65 F 28 2500 N N

1.82 87 M 35 2000 N S

1.71 66 F 42 3500 N S

Instituto Superior de Estatística e Gestão de Informação


Universidade Nova de Lisboa

36

18
17/09/24

Let’s build features!!!

Instituto Superior de Estatística e Gestão de Informação


Universidade Nova de Lisboa

37

Find/Build Attributes

Instituto Superior de Estatística e Gestão de Informação


Universidade Nova de Lisboa

38

19
17/09/24

Find/Build Attributes

• Some relevant variable


• Recency – day since last visit/purchase
• Frequency – number of transactions per customer
• Monetary Value – total value of sales (different from profit)
• Average Purchase – average of the purchase per visit
• Most Frequent Store
• Average Time Between Transactions – Transaction Interval
• Standard Deviation of Transactional Interval
• Customer Stability Index - Standard Deviation of
Transactional Interval/Average Time Between Transactions
• Relative Spend on Each Product

Instituto Superior de Estatística e Gestão de Informação


Universidade Nova de Lisboa

39

20

You might also like