HCA2 (1)
HCA2 (1)
ANALYTICS ON MACHINE
LEARNING
SYNOPSIS
• Machine Learning Pipeline – Pre-processing –Visualization –
Feature Selection – Training model parameter – Evaluation model
: Sensitivity , Specificity , PPV ,NPV, FPR ,Accuracy , ROC ,
Precision Recall Curves , Valued target variables –Python:
Variables and types, Data Structures and containers , Pandas Data
Frame :Operations – Scikit –Learn : Pre-processing , Feature
Selection
Machine Learning Pipeline
• machine learning pipeline is a sequence of data processing steps,
where each step in the pipeline feeds into the next one, ultimately
leading to the creation of a machine learning model.
• A well organized pipeline helps automate and streamline the end-to-
end process of building, training, and deploying machine learning
models.
• the key components of a typical machine learning pipeline:
key components
1.Data Collection: Gather relevant data from various sources. This may
involve accessing databases, APIs, or other data repositories.
2. Data Cleaning and Preprocessing: Handle missing data, outliers, and
inconsistencies.
• Transform and normalize data to make it suitable for machine learning
algorithms.
• Perform feature engineering to create new features or modify existing
ones.
3. Exploratory Data Analysis (EDA): Understand the characteristics of
the data through statistical analysis and visualization.
• Identify patterns, trends, and relationships that may inform model
selection and feature engineering.
CONT..
4.Feature Selection: Choose the most relevant features to be used in
the model.
• This step is essential for improving model efficiency and reducing
overfitting.
5. Model Selection: Choose the appropriate machine learning
algorithm based on the nature of the problem (e.g., classification,
regression, clustering).
• Split the data into training and testing sets for model evaluation.
6. Model Training: Train the selected model on the training data.
Adjust hyper parameters to optimize model performance.
CONT…
7. Model Evaluation: Assess the model's performance on the testing data
using relevant evaluation metrics.
• Use techniques like cross-validation to ensure robustness.
8. Model Tuning: Fine-tune the model based on performance metrics. Adjust
hyper parameters or try different algorithms to improve results.
9. Model Deployment: Integrate the trained model into the production
environment where it can make predictions on new, unseen data.
• Set up monitoring for the deployed model's performance.
10. Feedback Loop: Collect feedback on model performance from real-
world usage.
• Iterate on the model or pipeline based on user feedback or changes in the
data distribution
PREPROCESSING
• scaler = StandardScaler()
• X_scaled = scaler.fit_transform(X)
• Normalization: This scales features to lie between a specific range,
usually [0, 1]
• from sklearn.preprocessing import MinMaxScaler
• scaler = MinMaxScaler()
• X_normalized = scaler.fit_transform(X)
2.Encoding Categorical Variables:
• One-Hot Encoding: Converts • Label Encoding: Converts
categorical variables into a format categorical labels into numeric
that can be provided to ML
values.
algorithms to do a better job in
prediction. • from sklearn.preprocessing
• from sklearn.preprocessing import import LabelEncoder
OneHotEncoder • encoder = LabelEncoder()
• encoder = • y_encoded =
OneHotEncoder(sparse=False) encoder.fit_transform(y_categori
• X_encoded = cal)
encoder.fit_transform(X_categorical)
3.Handling Missing Values:
• Imputation: Filling missing values using the mean, median, or mode
• from sklearn.impute import SimpleImputer
• imputer = SimpleImputer(strategy='mean')
• X_imputed = imputer.fit_transform(X)
Cont.…
4.Feature Engineering:sklearn.preprocessing.PolynomialFeatures: Generates
polynomial and interaction features from the original features, which can be
useful for models that benefit from non-linear relationships.
5.Dimensionality Reduction:sklearn.decomposition.PCA: Reduces the
dimensionality of data by projecting it onto a lower-dimensional space while
preserving as much variance as possible.sklearn.decomposition.NMF: Factorizes
the data into non-negative matrices, useful for dimensionality reduction and
feature extraction.
6.Pipeline Integration:sklearn.pipeline.Pipeline: Combines multiple
preprocessing steps and modeling into a single object, allowing for streamlined
and reproducible workflows. For example, you can chain feature selection,
preprocessing, and model training in one pipeline.
Feature Selection
• Feature selection involves choosing the most relevant features for your
model.
• It can help in reducing overfitting, improving model performance, and
speeding up the training process.
1.Filter Methods:
• Univariate Selection: Selects features based on univariate statistical tests
from sklearn.feature_selection import SelectKBest, chi2
• selector = VarianceThreshold(threshold=0.01)
• X_reduced = selector.fit_transform(X)
Wrapper Methods:
• model = LogisticRegression()
• rfe = RFE(model, n_features_to_select=5)
• X_rfe = rfe.fit_transform(X, y)
Embedded Methods:
• model = RandomForestClassifier()
• model.fit(X, y)
• importances = model.feature_importances_
Cont…
• L1 Regularization (Lasso): Can be used for feature selection by
penalizing the absolute size of the coefficients.
• from sklearn.linear_model import Lasso
• model = Lasso(alpha=0.1)
• model.fit(X, y)
• selected_features = model.coef_ != 0
Nptel links
• https://www.google.com/search?q=sklearn++nptel+%5Cvideos&sca_esv=d508bae1ee411921&sca_upv=1&rlz=1C1CHBD_en-
GBIN1125IN1125&ei=ZW3RZrLbJPG7seMP9PSZyAc&ved=0ahUKEwjyoY7ckZyIAxXxXWwGHXR6BnkQ4dUDCBA&uact=5&oq=sklearn++nptel+
%5Cvideos&gs_lp=Egxnd3Mtd2l6LXNlcnAiFnNrbGVhcm4gIG5wdGVsIFx2aWRlb3MyCBAhGKABGMMESNUwUNkSWOQucAJ4AZABAJgBnwGgAcIHqgEDMS43uAEDyAEA-
AEBmAIKoALXB8ICChAAGLADGNYEGEfCAggQABiABBiiBMICChAhGKABGMMEGArCAgQQIRgKmAMAiAYBkAYIkgcDMy43oAfMGw&sclient=gws-wiz-
serp#fpstate=ive&vld=cid:359e7415,vid:4Lo10fugSOE,st:0 =================sklearn
• https://www.google.com/search?q=PANDAS+DATA+FRAME%3A+OPERATIONS+nptel+%5Cvideos&sca_esv=d508bae1ee411921&sca_upv=1&rlz=1C1CHBD_en-
GBIN1125IN1125&ei=iG3RZqTQEYGRseMPmOyBuQk&ved=0ahUKEwiktNPskZyIAxWBSGwGHRh2IJcQ4dUDCBA&uact=5&oq=PANDAS+DATA+FRAME%3A+OPERATIONS+nptel+
%5Cvideos&gs_lp=Egxnd3Mtd2l6LXNlcnAiK1BBTkRBUyBEQVRBIEZSQU1FOiBPUEVSQVRJT05TIG5wdGVsIFx2aWRlb3MyChAhGKABGMMEGApInSBQsAxYsAxwAXgBkAEAmAG
cAaABnAGqAQMwLjG4AQPIAQD4AQL4AQGYAgKgAqIBwgIKEAAYsAMY1gQYR5gDAIgGAZAGCJIHAzEuMaAHpAM&sclient=gws-wiz-
serp#fpstate=ive&vld=cid:68cf4756,vid:6DTFIKF8QIg,st:0==============pandas