0% found this document useful (0 votes)
14 views

01 - ML - Introduction (1)

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

01 - ML - Introduction (1)

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Ho Chi Minh University of Banking

Department of Economic
Mathematics

Machine Learning
Lecture 1. Overview of Machine Learning

Vuong Trong Nhan ([email protected])


Outline

1. Introduction to Machine Learning


2. What is Machine Learning ?
3. Why use Machine Learning ?
4. Machine Learning Application
5. How does Machine Learning work?
6. Types of Machine Learning
7. Main Challenges of Machine Learning
8. Ethical and Social Considerations
9. Trends and Challenges

2
Introduction

3
Source: Internet
Introduction

Wine quality prediction

autonomous car

In healthcare 4
Source: Internet
AI

5
General AI vs Narrow AI

6
7
What is Machine Learning ?

Machine Learning (ML) is an active subfield of


Artificial Intelligence

8
What is Machine Learning ?

Arthur Samuel (1959). Machine Learning:


"Field of study that gives computers the
ability to learn without being explicitly
programmed".
Source: Wikipedia
What is Machine Learning ?

Tom Mitchell (1998) Well-posed Learning


Problem: "A computer program is said to learn
from experience E with respect to some task T
and some performance measure P, if its
performance on T, as measured by P, improves
Tom Mitchell’s homepage with experience E".

A learning problem can be described as a triple (T, P, E):


T: Task
P: performance
E: Experience

10
Example 1
Suppose your email program watches which emails you
do or do not mark as spam, and based on that learns how
to better filter spam.

What is the task T in this setting?

A. Classifying emails as spam or not spam. (T)


B. Watching you label emails as spam or not spam. (E)
C. The number (or fraction) of emails correctly classified
as spam/not spam. (P)
D. None of the above

11
Example 1
In Spam E-Mail detection,
Task, T: To classify mails into
Spam or Not Spam.
Performance measure, P: Total
percent of mails being correctly
classified as being “Spam” or “Not
Spam”. Spam
Experience, E: Set of Mails with No
?
Yes
label “Spam”

12
Why Use Machine Learning?
Consider how you would write a spam filter using:
Traditional programming techniques

Figure 1-1. The traditional approach


13
Why Use Machine Learning?
Consider how you would write a spam filter using:
The machine learning techniques

Figure 1-2. The Machine Learning approach


14
Why Use Machine Learning?
Consider how you would write a spam filter using:
The machine learning techniques

Figure 1-3. Automatically adapting to change


15
Why Use Machine Learning?

Figure 1-4. Machine learning can help humans learn


16
Examples of Applications

17
Applications in banking / finance

18
Examples of Applications
Analyzing images of products on a production line to
automatically classify them
Detecting tumors in brain scans
Automatically classifying news articles
Automatically flagging offensive comments on
discussion forums
Summarizing long documents automatically
Creating a chatbot or a personal assistant
Forecasting your company’s revenue next year, based
on many performance metrics
Making your app react to voice commands

19
Examples of Applications
Detecting credit card fraud
Segmenting clients based on their purchases so that
you can design a different marketing strategy for each
segment
Representing a complex, high-dimensional dataset in
a clear and insightful diagram
Recommending a product that a client may be
interested in, based on past purchases
Building an intelligent bot for a game

20
How does Machine Learning work?

Machine Learning Process

21
Ng, Frederick, Runqing Jiang, and James CL Chow. "Predicting radiation treatment planning evaluation parameter using artificial intelligence
and machine learning." IOP SciNotes 1.1 (2020): 014003.
CRISP-DM Methodology

Figure: A diagram of the CRISP-DM process which shows the six key phases 22
and indicates the important relationships between them [Wirth and Hipp, 2000].
Types of Machine Learning

Machine learning systems can be categorized based


on:
Supervision during Training:
o Supervised: Learn from labeled data with input-output pairs.
o Unsupervised: Discover patterns and structures without labeled data.
o Semi-supervised: A mix of labeled and unlabeled data.
o Self-supervised: Generate labels from data itself.
o Reinforcement Learning: Learn through interaction with an environment,
receiving rewards or penalties.
Learning Approach:
o Online Learning: Incremental learning on new data as it arrives.
o Batch Learning: Train on a fixed dataset.
Learning Strategy:
o Instance-based (lazy): Compare new data directly to known data points.
o Model-based (eager): Detect patterns in training data to build predictive
models.

23
Types of Machine Learning

Main ML algorithms:
Supervised
Unsupervised
Semi-supervised
Reinforcement Learning

24
SUPERVISED
LEARNING
25
Supervised learning

The training set you feed to the algorithm


includes the desired solutions, called labels

Typical tasks:
Classification
Regression

26
Supervised learning: classification
The spam filter:
Train: sample emails with their class / target (spam or ham)
Goal: learn how to classify new emails.

Figure 1-5. A labeled training set for spam classification 27


Supervised learning: regression
Predicting the price of a car based on its features
like mileage, age, and brand.
Train: examples of cars including (features, price)

Figure 1-6. A regression problem: predict a value, given an input feature


28
(usually multiple input features, and sometimes multiple output values)
Supervised learning
Data: D = {D1, D2, … , Dn} a set of n samples
where 𝐷𝑖 = < 𝐗𝐢, 𝑦𝑖 >
𝐗𝐢 is a input matrix
𝑦𝑖 is a desired output

Objective:
learning the mapping 𝑓: 𝑿 → 𝒚
subject to 𝑦𝑖 ≈ 𝑓(𝐗𝐢) for all i = 1,…,n

Classification: 𝑦 is discrete
Regression: 𝑦 is continuous

29
Classification

Types of classification problems


Binary classification
o Only two classes, but one sample has one label
Multi-class classification
o Multiple classes, but one sample has one label
Multi-label classification
o One sample can have multiple class labels
Image segmentation
o Traditionally, clustering problem
o Recently, pixel-based classification problem

30
Supervised learning: algorithms

k-Nearest Neighbors
Linear Regression
Logistic Regression
Support Vector Machines (SVMs)
Decision Trees and Random Forests
Neural networks

31
Scenario: You’re running a company, and you want to develop
learning algorithms to address each of two problems.

Problem 1: You have a large inventory of identical items. You want


to predict how many of these items will sell over the next 3 months.

Problem 2: You’d like software to examine individual customer


accounts, and for each account decide if it has been
hacked/compromised.

Should you treat these as classification or as regression problems?


A. Treat both as classification problems.
B. Treat problem 1 as a classification problem, problem 2 as a regression
problem.
C. Treat problem 1 as a regression problem, problem 2 as a classification
problem.
32
D. Treat both as regression problems.
UNSUPERVISED
LEARNING

33
Unsupervised Learning
❖ Dataset: unlabels 𝑥(1) , … 𝑥 (𝑛)

❖ Goal: to find interesting structures in the data

❖ Typical tasks:

❖ Clustering

❖ Dimensionality reduction

❖ Anomaly detection

34
Unsupervised Learning: Clustering

Figure 1-7. An unlabeled training


set for unsupervised learning

Figure 1-8. Clustering


35
Unsupervised Learning: Dimensionality reduction

Figure 1-9. Example of a t-SNE visualization highlighting semantic clusters


36
Unsupervised Learning: Anomaly detection

Figure 1-10. Anomaly detection


37
Unsupervised Learning: Algorithms
Clustering
K-Means
DBSCAN
Hierarchical Cluster Analysis (HCA)
Anomaly detection and novelty detection
One-class SVM
Isolation Forest
Dimensionality reduction
Principal Component Analysis (PCA)
Kernel PCA
Locally Linear Embedding (LLE)
Association rule learning
Apriori
Eclat

38
SEMI-
SUPERVISED
LEARNING

39
Semi-supervised learning

Figure 1-11. Semisupervised learning with two classes (triangles and squares): the unlabeled
examples (circles) help classify a new instance (the cross) into the triangle class rather than
the square class, even though it is closer to the labeled squares
40
SELF- SUPERVISED
LEARNING
Goal: generate a fully labeled dataset from a fully unlabeled one.

41
Self-supervised learning

Figure 1-12. Self-supervised learning example: input (left) and target (right)
42
REINFORCEMEN
T LEARNING
• Reinforcement learning is a very different beast.
• The learning system, called an agent in this context, can observe the
environment, select and perform actions, and get rewards in return (or penalties
in the form of negative rewards).
• It learns by itself what is the best strategy, called a policy, to get the most
reward over time. A policy defines what action the agent should choose when it
is in a given situation

43
Reinforcement learning

Figure 1-13. Reinforcement learning


44
Main Challenges of Machine Learning

1. Insufficient Quantity of Training Data


2. Nonrepresentative Training Data
3. Poor-Quality Data
4. Irrelevant Features
5. Overfitting or Underfitting the Training Data

45
Main Challenges of Machine Learning

1. Insufficient Quantity of Training Data

Figure 1-21. The importance of data versus algorithms 46


Main Challenges of Machine Learning

2. Nonrepresentative Training Data


For example, the set of countries you used earlier for training the linear model was not
perfectly representative; it did not contain any country with a GDP per capita lower than
$23,500 or higher than $62,500

Figure 1-22. A more representative training sample


47
Main Challenges of Machine Learning

3. Poor-Quality Data
Training data is full of errors, outliers, and noise
-> your system is less likely to perform well

=> spend lot of time cleaning up your training data

Solutions:
• If some instances are clearly outliers, it may help to
simply discard them or try to fix the errors manually.

• If some instances are missing a few features (e.g., 5%


of your customers did not specify their age), you must
decide whether you want to ignore this attribute
altogether, ignore these instances, fill in the missing
values (e.g., with the median age), or train one model
with the feature and one model without it 48
Main Challenges of Machine Learning

4. Irrelevant Features
Your system will only be capable of learning if the training data contains enough
relevant features and not too many irrelevant ones.

Solution:
Feature engineering:
• Feature selection
• Feature extraction
• Creating new features by gathering new data

-> spend lot of time cleaning up your training data


49
Main Challenges of Machine Learning

5. Overfitting or Underfitting

50
Main Challenges of Machine Learning

5. Overfitting or Underfitting

Test error
Error

Training error

Underfitting Good Overfitting


model
Too Too
Simple complex
51
Main Challenges of Machine Learning

Underfitting
• Select a more powerful model, with more parameters.
• Feed better features to the learning algorithm (feature engineering).
• Reduce the constraints on the model (for example by reducing the
regularization hyperparameter)

52
Main Challenges of Machine Learning
Overfitting
Some solutions:
• Simplify the model by selecting one with fewer
parameters (e.g., a linear model rather than a high-
degree polynomial model), by reducing the
number of attributes in the training data, or by
constraining the model.
• Gather more training data.
• Reduce the noise in the training data (e.g., fix
data errors and remove outliers)
• Use validation set (or dev-set)

53
Main Challenges of Machine Learning

Learning algorithm
Under what conditions the chosen algorithm will
converge?
For a given application/domain and a given
objective function, what algorithm performs best?
No-free-lunch theorem [Wolpert and Macready, 1997]:
if an algorithm performs well on a certain class of
problems then it necessarily pays for that with
degraded performance on the set of all remaining
problems

There is no one model that works best for every situation


Hay: Không có thuật toán nào là thuật toán hoạt động “tốt nhất”

cho mọi bài toán 54


Main Challenges of Machine Learning

Training data
How many observations are enough for learning?
Whether or not does the size of the training set
affect performance of an ML system?
What is the effect of the disrupted or noisy
observations?

55
Main Challenges of Machine Learning

Learnability:
The goodness/limit of the learning algorithm?
What is the generalization of the system?
o Predict well new observations, not only the training data.
o Avoid overfitting.

56
Overfitting: example
Increasing the size of a decision tree can degrade prediction on
unseen data, even though increasing the accuracy for the
training data.

[Mitchell, 1997] 57
Ethical and Social Considerations

• Bias and Fairness


• Privacy Concerns
• Accountability and Transparency

Amazon’s hiring algorithm COMPAS Recidivism Algorithm 58


Summary
Definition of Machine learning
Type of machine learning
Supervised Learning
o Data: labeled
o Task: regression, classification, …
Unsupervised Learning:
o Data: unlabeled
o Task: clustering, dimensionality reduction, …
Reinforcement Learning
Machine learning Process
Ethical and Social Considerations
Emerging Trends and Challenges

59
Review Questions

60
Scenario 1: Credit Card Fraud Detection

You work for a financial institution and are tasked with


developing a credit card fraud detection system. The dataset
you have includes past credit card transactions, each labeled
as either "fraudulent" or "legitimate." Your goal is to build a
machine learning model that can accurately predict whether a
new transaction is fraudulent or not.

Which type of machine learning approach should you use for


this task?
A) Supervised Learning
B) Unsupervised Learning
C) Reinforcement Learning
D) Semi-Supervised Learning

61
Scenario 2: Customer Segmentation
You are a marketing analyst working for a retail company.
Your team wants to group customers into different segments
based on their purchasing behavior. You have a large dataset
containing customer purchase history, but it does not have
any pre-defined labels for segments.

Which type of machine learning approach should you use to


perform customer segmentation?

A) Supervised Learning
B) Unsupervised Learning
C) Reinforcement Learning
D) Semi-Supervised Learning

62
Scenario 3: Anomaly Detection
As a cybersecurity analyst, your job is to detect network
intrusions and malicious activities. You have access to
log data from various network devices and systems.
Your objective is to identify abnormal patterns that might
indicate a potential security breach.

What type of machine learning approach should you use


to detect anomalies in the network data?

A) Supervised Learning
B) Unsupervised Learning
C) Reinforcement Learning
D) Semi-Supervised Learning

63
64
Thank you!

65

You might also like