C2 - W2 Mlopssadasdsa
C2 - W2 Mlopssadasdsa
DeepLearning.AI makes these slides available for educational purposes. You may not use or
distribute these slides for commercial purposes. You may make copies of these slides and
use or distribute them for educational purposes as long as you cite DeepLearning.AI as the
source of the slides.
Welcome
Feature Engineering
Introduction to
Preprocessing
“Coming up with features is difficult,
time-consuming, and requires expert knowledge.
Applied machine learning often requires careful
engineering of the features and dataset.”
— Andrew Ng
Outline
Feature
Launch and Tune Objective
Reiterate
Engineering
Function
Make New
Features
Typical ML pipeline
During training During serving
Feature Real-time
Batch processing
Engineering processing
Key points
● Feature engineering can be difficult and time consuming, but also very
important to success
● Squeezing the most out of data through feature engineering enables
models to learn better
● Concentrating predictive information in fewer features enables more
efficient use of compute resources
● Feature engineering during training must also be applied correctly
during serving
Feature Engineering
Preprocessing
Operations
Outline
0: { [
house_info : { 6.0,
Process of creating features
num_rooms : 6 1.0,
from raw data is feature
num_bedrooms : 3 0.0,
engineering
street_name: “Shorebird Way” 0.0,
num_basement_rooms: -1 Feature Engineering 9.321,
… -2.20,
} 1.01,
} Raw data doesn’t 0.0,
come to us as feature …,
vectors ]
Mapping categorical values
Street names
{'Charleston Road', 'North Shoreline Boulevard', 'Shorebird Way', 'Rengstorff Avenue'}
Raw Data Feature Vector
vocabulary_feature_column = tf.feature_column.categorical_column_with_vocabulary_list(
key=feature_name,
vocabulary_feature_column = tf.feature_column.categorical_column_with_vocabulary_file(
key=feature_name,
vocabulary_file="product_class.txt",
vocabulary_size=3)
Empirical knowledge of data
Feature Engineering
Techniques
Outline
● Feature Scaling
● Bucketizing / Binning
● Other techniques
Feature engineering techniques
● Scaling
● Normalizing
Numerical Range
● Standardizing
● Bucketizing
Grouping ● Bag of words
Scaling
● Converts values from their natural range into a
prescribed range
○ E.g. Grayscale image pixel intensity scale is [0,255]
usually rescaled to [-1,1]
image = (image - 127.5) / 127.5
● Benefits
○ Helps neural nets converge faster
○ Do away with NaN errors during training
○ For each feature, the model learns the right weights
Normalization
10 10000
0 1
Normalization
Original Normalized
Standardization (z-score)
(z-score)
10 10000
-3σ +3σ
Standardization (z-score)
Original Standardized
Bucketizing / Binning
Feature crossing
TensorFlow embedding projector
● Intuitive exploration of
high-dimensional data
● Techniques
○ PCA
○ t-SNE
○ UMAP
● Ready to play
@ projector.tensorflow.org
Key points
● Feature engineering:
○ Prepares, tunes, transforms, extracts and constructs features.
Feature Crosses
Outline
● Feature crosses
● Encoding features
Feature crosses
● healthy trees
● sick trees
Classification
boundary
Need for encoding non-linearity
● healthy trees
● sick trees
Classification
boundary
Census dataset
Key points
Preprocessing Data
At Scale
Probably not ideal
Python Java
ML Pipeline
Example TensorFlow
Validator
Extended
Example Gen Statistics Gen SchemaGen
TRAINING &
EVAL DATA
Transform Trainer Evaluator Pusher
● Preprocessing granularity
Consistent transforms
Real-world models: Large-scale data
between training &
terabytes of data processing frameworks
serving
Inconsistencies in feature engineering
Transformations
Instance-level Full-pass
Clipping Minimax
etc. etc.
When do you transform?
Pros Cons
Run-once Transformations reproduced at serving
Pros Cons
TensorFlow Transform
Outline
● Going deeper
● tf.Transform Analyzers
Enter tf.Transform
Transform Trainer
Transformed Trained
Input data
Data Models
Example TensorFlow
Validator
Extended
Example Gen Statistics Gen SchemaGen
TRAINING &
EVAL DATA
Transform Trainer Evaluator Pusher
● User-provided
ExampleGen SchemaGen
transform
(tf.Transform)
● Schema for parsing
Data Schema
● Applied during
training
● Embedded during Transform Code
serving
Transform Transformed
Graph Data
Performance
optimizations
Trainer
tf. Transform: Going deeper
Training Serving
Raw Data
Tf. Transform
SavedModel
Raw Inference
API
Tf. Transform Request
TensorFlow Graph
Beam
Preprocessing
Processed Data
Tf. Transform
TensorFlow Graph
Model Training Trained Model
Model Training
TensorFlow Graph TensorFlow Graph
Prediction
tf.Transform Analyzers
Training Serving
Benefits of using tf.Transform
quantiles
Bucketizing apply_buckets
tf.Transform bucketize
Analyzers bag_of_words
Vocabulary tfidf
ngrams
Dimensionality
pca
Reduction
tf.Transform preprocessing_fn
def preprocessing_fn(inputs):
...
import tensorflow as tf
import apache_beam as beam
import apache_beam.io.iobase
Hello World
with tf.Transform
Hello world with tf.Transform
Analyze
1 2 3 4
tf.Transform
[
{'x': 1, 'y': 1, 's': 'hello'},
{'x': 2, 'y': 2, 's': 'world'},
{'x': 3, 'y': 3, 's': 'hello'}
]
Inspect data and prepare metadata (Data)
raw_data_metadata = dataset_metadata.DatasetMetadata(
dataset_schema.from_feature_spec({
x ● x_centered
[0.0, 0.5, 1.0]
x - tft.mean(x)
['hello', 'world', ● y_normalized
'hello'] tft.scale_to_0_1(y)
s ● s_integerized
[0, 1, 0]
tft.compute_and_apply_vocabulary(s)
[1, 2, 3] ● x_centered * y_normalized
y [-0.0, 0.0, 1.0]
Running the pipeline
def main():
with tft_beam.Context(temp_dir=tempfile.mkdtemp()):
transformed_dataset, transform_fn = (
(raw_data, raw_data_metadata) | tft_beam.AnalyzeAndTransformDataset(
preprocessing_fn))
Running the pipeline
transformed_data, transformed_metadata = transformed_dataset
print('\nRaw data:\n{}\n'.format(pprint.pformat(raw_data)))
print('Transformed data:\n{}'.format(pprint.pformat(transformed_data)))
if __name__ == '__main__':
main()
Before transforming with tf.Transform
# Raw data:
[{'s': 'hello', 'x': 1, 'y': 1},
{'s': 'world', 'x': 2, 'y': 2},
{'s': 'hello', 'x': 3, 'y': 3}]
After transforming with tf.Transform
# After transform
[{'s_integerized': 0,
'x_centered': -1.0,
'x_centered_times_y_normalized': -0.0,
'y_normalized': 0.0},
{'s_integerized': 1,
'x_centered': 0.0,
'x_centered_times_y_normalized': 0.0,
'y_normalized': 0.5},
{'s_integerized': 0,
'x_centered': 1.0,
'x_centered_times_y_normalized': 1.0,
'y_normalized': 1.0}]
Key points
Feature Spaces
Outline
X1
X2
X
X0 X1 X0
Feature vector Feature space (3D) Scatter plot (2D)
Feature space
3D Feature Space
No. of Rooms Area Locality Price
X0 X1 X2 Y
x1
x1
x1
x0
x0 x0
x0
Feature space coverage
Feature Selection
Feature selection
All Features
● Identify features that best represent
✅ ✅ ✅
the relationship
Feature selection
● Remove features that don’t influence
X X X the outcome
Useful features ● Reduce the size of the feature space
● Reduce the resource requirements
and model complexity
Why is feature selection needed?
Unsupervised
Feature Selection
Supervised
Unsupervised feature selection
1. Unsupervised
● Features-target variable relationship not considered
● Removes redundant features (correlation)
Supervised feature selection
2. Supervised
● Uses features-target variable relationship
● Selects those contributing the most
Supervised methods
Filter Methods
Wrapper Methods
Feature Selection Supervised
Embedded Methods
Practical example
Metrics (sklearn.metrics):
Filter Methods
Filter methods
● Correlation
Filter Methods ● Univariate feature
selection
Wrapper Methods
Feature Selection Supervised
Embedded Methods
Filter methods
Features + target
○ To each other (Bad)
○ And with target variable (Good)
● Falls in the range [-1, 1]
○ 1 High positive correlation
○ -1 High negative correlation
-0.2
Features + target
Feature comparison statistical tests
Other methods:
● Mutual information
● F-Test
● Chi-Squared test
Determine correlation
1.0
# Pearson’s correlation by default
cor = df.corr()
Features + target
plt.figure(figsize=(20,20))
# Seaborn
sns.heatmap(cor, annot=True, cmap=plt.cm.PuBu)
plt.show()
-0.2
Features + target
Selecting features
cor_target = abs(cor["diagnosis_int"])
Best Result
Univariate feature selection in SKLearn
X_train_scaled = StandardScaler().fit_transform(X_train)
X_test_scaled = StandardScaler().fit_transform(X_test)
min_max_scaler = MinMaxScaler()
Scaled_X = min_max_scaler.fit_transform(X_train_scaled)
Best Result
Feature Selection
Wrapper Methods
Wrapper methods
● Correlation
Filter Methods ● Univariate feature
selection
● Forward elimination
Wrapper Methods ● Backward elimination
Feature Selection Supervised
● Recursive feature
elimination
Embedded Methods
Wrapper methods
X_train_scaled = StandardScaler().fit_transform(X_train)
X_test_scaled = StandardScaler().fit_transform(X_test)
rfe_feature_names = run_rfe()
rfe_eval_df = evaluate_model_on_features(df[rfe_feature_names], Y)
rfe_eval_df.head()
Performance table
Best Result
Feature Selection
Embedded Methods
Embedded methods
● Correlation
Filter Methods ● Univariate feature
selection
● Forward elimination
Wrapper Methods ● Backward elimination
Feature Selection Supervised
● Recursive feature
elimination
● L1 regularization
Embedded Methods ● Feature importance
Feature importance
return model
Feature importance plot
def select_features_from_model(model):
feature_idx = model.get_support()
feature_names = df.drop("diagnosis_int",1 ).columns[feature_idx]
return feature_names
Tying together and evaluation
Best Result
Review
● Intro to Preprocessing
● Feature Engineering
● Preprocessing Data at Scale
○ TensorFlow Transform
● Feature Spaces
● Feature Selection
○ Filter Methods
○ Wrapper Methods
○ Embedded Methods