MLA TAB Lecture2
MLA TAB Lecture2
Numerical data
Tabular Raw Data
le = LabelEncoder()
df['color'] = le.fit_transform(df['color'])
df[['color','size','classlabel']] =
oe.fit_transform(df[['color','size','classlabel']])
Example: The following two sentences have similar meaning but may
seem quite different to a text classifier (i.e. sentiment detector):
• “The countess (Rebecca) considers\n
the boy to be quite naïve.”
• “countess rebecca considers boy naive”
Tokenization
• Tokenization: Splits text into small parts by white space and
punctuation.
Example:
Sentence Tokens
“I”, “do”, “n’t”, “like”, “eggs”,
“I don’t like eggs.”
“.”
Tokens will be used for further cleaning and vectorization.
Stop Words Removal
• Stop Words: Some words that frequently appear in texts, but they
don’t contribute too much to the overall meaning.
Common stop words: “a”, “the”, “so”, “is”, “it”, “at”, “in”, “this”, “there”, “that”,
“my”, “by”, “nor”
Example:
Is this a good list of stop words for a binary text classification of product
reviews (positive or negative review) ?
Stemming
• Set of rules to slice a string to a substring that usually refers to a more
general meaning.
The goal is to remove word affixes (particularly suffixes) such as “s”,
“es”, “ing”, “ed”, etc.
o “playing”
o “played” “play”
o ”plays”
“It is a dog.” 1 0 1 1 1 0 0 0 0
“It is a dog.” 1 0 1 1 1 0 0 0 0
X = countVectorizer.fit_transform(sentences)
print(X.toarray())
Bag of Words in sklearn
TfidfVectorizer: sklearn text vectorizer, converts a collection of text
documents to a matrix of TF-IDF features - .fit(), .transform()
MLA-TAB-Lecture2-Text-Processing.ipynb
Tree-based Models
Problem: Package Delivery Prediction
• Given this dataset, let’s Weather
Sunny
Demand
High
Address
Correct
ontime
No
predict package on time Sunny High Misspelled No
Weather
Root Node How to Learn a
• the start node
Decision Tree?
Sunny Rainy
Top down approach: Grow the tree from root node to leaf nodes.
*ID3 (Iterative Dichotomiser 3)
Decision Trees: Numerical Example
x1 x2 y
• Given this dataset, let’s
3.5 2 1
predict the y class (1 vs. 2) 5 2.5 2
using a Decision Tree. 1 3 1
2 4 1
• Iteratively split the dataset 4 2 1
into subsets from a root node, 6 6 2
such that the leaf nodes 2 9 2
contain mostly one class (as 4 9 2
pure as possible). 5 4 1
3 8 2
Decision Trees: Numerical Example
x2 Class 1 x1 x2 y
Class 2 3.5 2 1
9 5 2.5 2
8 1 3 1
7 2 4 1
6 4 2 1
5 6 6 2
4 2 9 2
3 4 9 2
2 5 4 1
1 3 8 2
1 2 3 4 5 6 x1
Decision Trees: Numerical Example
x2 Class 1
Class 2
Class: 1, 2
9
1 2 3 4 5 6 x1
Decision Trees: Numerical Example
x2 Class 1
Class 2
9
x2≤ 5
Yes No
8
7 Class: 1, 2 Class = 1
6
What feature (x1 or x2) to
55
use to split this subset, to
4
best separate class 1 from
3 class 2?
2
1 2 3 4 5 6 x1
Decision Trees: Numerical Example
x2 Class 1
Class 2
9
x2≤ 5
Yes No
8
7 x 1 ≤ 4.5 Class = 1
6 No
Yes
55
4
Class = 2 Class: 1, 2
3 What feature (x1 or x2) to
2 use to split this subset, to
1 best separate class 1 from
class 2?
1 2 3 4 4.5
5 6 x1
Decision Trees: Numerical Example
x2 Class 1
Class 2
9
x2≤ 5
Yes No
8
7 x 1 ≤ 4.5 Class = 1
6 No
Yes
55
4
? Class = 2 x2≤ 3
3 Yes No
3
2
Class = 1 Class = 2
1
1 2 3 4 4.5
5 6 x1
Decision Trees: Example
[9+, 5-]
Weather Demand Address ontime
Class: Yes, No Sunny High Correct No
Sunny High Misspelled No
What feature (’Weather’, Overcast High Correct Yes
‘Demand’ or ‘Address’) to Rainy High Correct Yes
Not too Somewhat Somewhat Not too Overcast High Misspelled Yes
Demand Address
Weather
Sunny Rainy
High Normal Correct Misspelled
Overcast
[2+, 3-]
Not too [3+, 2-]
[3+, 4-] [1+, 6-] [6+, 2-] [3+, 3-]
[4+, 0-] Not too
sure
Absolutely sure Not too Somewhat Somewhat Not too
sure sure sure sure sure
How to Measure Uncertainty
Weather
Sunny Rainy
Overcast
[2+, 3-]
Not too [3+, 2-]
[4+, 0-] Not too
sure
Absolutely sure
sure
Calculating Gini Impurity
Weather
Sunny Rainy
Overcast
[2+, 3-]
Not too [3+, 2-]
[4+, 0-] Not too
sure
Absolutely sure
sure
[3+, 4-] [1+, 6-] [6+, 2-] [3+, 3-] “Weather” has the highest gain of all, so
we start the tree with the “Weather”
feature as the root node!
Recap ID3 Algorithm
ID3 Algorithm*: Repeat
Feature with highest
Information Gain 1. Select “the best feature” to split using
Information Gain.
A B C
2. Separate the training samples according to
the selected feature.
Feature with highest Feature with highest
Information Gain given A Information Gain given C 3. Stop if have samples from a single class or if
all features used, and note a leaf node.
D E Class 1 F G
4. Assign the leaf node the majority class of the
Class 2 Class 1 Class 2
samples in it.
Class 1
*To build a Decision Tree Regressor: 1. Replace the Information Gain with Standard Deviation Reduction. 3. Stop when numerical values are homogeneous
(standard deviation is zero) or if used all features, and note it as a leaf node. 4. Assign the leaf node the average value of its samples.
Decision Trees in sklearn
DecisionTreeClassifier: sklearn Decision Tree classifier (there is also a
Regressor version) - .fit(), .predict()
DecisionTreeClassifier(criterion='gini’,
max_depth=None, min_samples_split=2,
min_samples_leaf=1, class_weight=None)
Weak Model 1
RandomForestClassifier(n_estimators=100,
max_samples=None, max_features='auto’,
criterion='gini’, max_depth=None, min_samples_split=2,
min_samples_leaf=1, class_weight=None)
BaggingClassifier(base_estimator=None, n_estimators=10,
max_samples=1.0, bootstrap=True)
Hyperparameter 2
GridSearchCV(estimator, param_grid, scoring=None)
param_grid ={max_depth: [5, 10, 50, 100, 250], Total hyperparameters combinations
5 x 5 = 25
min_samples_leaf: [15, 20, 25, 30, 35]}
[5, 15], [5, 20], [5, 25], [10, 15], …
Randomized Search in sklearn
RandomizedSearchCV: randomized search on hyperparameters
Chooses a fixed number (given by parameter n_iter) of random combinations of
hyperparameter values and only tries those.
Can sample from distributions (sampling with replacement is used), if at least one
parameter is given as a distribution.
Hyperparameter 2
RandomizedSearchCV(estimator, param_distributions,
n_iter=10, scoring=None)
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)
Putting it all together
• In this notebook, we continue to work with our review dataset to predict
the target field
• The notebook covers the following tasks:
Exploratory Data Analysis
Splitting dataset into training and test sets
Categoricals encoding and text vectorization
Train a Decision Tree Classifier, and Hyperparameter Tuning
Check the performance metrics on test set
MLA-TAB-Lecture2-Trees.ipynb
AWS SageMaker
AWS SageMaker: Train and Deploy
SageMaker is an AWS service to easily build, train, tune and deploy ML
models: https://aws.amazon.com/sagemaker/
MLA-TAB-Lecture2-SageMaker.ipynb
AWS SageMaker
GroundTruth
SageMaker GroundTruth: Data Labeling
• Machine learning can be applied in many different areas. With this,
we usually have many different types of labels.
• We will use SageMaker GroundTruth tool and label some sample
data.
• GroundTruth allows users to create labeling tasks and assign them to
internal team members or outsource them.
SageMaker GroundTruth: Text Tasks
SageMaker GroundTruth: Image Tasks
SageMaker GroundTruth: Demo
Assume we will label these 5 images from our final project.
There are two classes: Software and Video game