0% found this document useful (0 votes)

12 views63 pages

HCA2 (1)

Uploaded by

953622243023

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views63 pages

HCA2 (1)

Uploaded by

953622243023

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 63

Unit-2

ANALYTICS ON MACHINE
LEARNING
SYNOPSIS
• Machine Learning Pipeline – Pre-processing –Visualization –
Feature Selection – Training model parameter – Evaluation model
: Sensitivity , Specificity , PPV ,NPV, FPR ,Accuracy , ROC ,
Precision Recall Curves , Valued target variables –Python:
Variables and types, Data Structures and containers , Pandas Data
Frame :Operations – Scikit –Learn : Pre-processing , Feature
Selection
Machine Learning Pipeline
• machine learning pipeline is a sequence of data processing steps,
where each step in the pipeline feeds into the next one, ultimately
leading to the creation of a machine learning model.
• A well organized pipeline helps automate and streamline the end-to-
end process of building, training, and deploying machine learning
models.
• the key components of a typical machine learning pipeline:
key components
1.Data Collection: Gather relevant data from various sources. This may
involve accessing databases, APIs, or other data repositories.
2. Data Cleaning and Preprocessing: Handle missing data, outliers, and
inconsistencies.
• Transform and normalize data to make it suitable for machine learning
algorithms.
• Perform feature engineering to create new features or modify existing
ones.
3. Exploratory Data Analysis (EDA): Understand the characteristics of
the data through statistical analysis and visualization.
• Identify patterns, trends, and relationships that may inform model
selection and feature engineering.
CONT..
4.Feature Selection: Choose the most relevant features to be used in
the model.
• This step is essential for improving model efficiency and reducing
overfitting.
5. Model Selection: Choose the appropriate machine learning
algorithm based on the nature of the problem (e.g., classification,
regression, clustering).
• Split the data into training and testing sets for model evaluation.
6. Model Training: Train the selected model on the training data.
Adjust hyper parameters to optimize model performance.
CONT…
7. Model Evaluation: Assess the model's performance on the testing data
using relevant evaluation metrics.
• Use techniques like cross-validation to ensure robustness.
8. Model Tuning: Fine-tune the model based on performance metrics. Adjust
hyper parameters or try different algorithms to improve results.
9. Model Deployment: Integrate the trained model into the production
environment where it can make predictions on new, unseen data.
• Set up monitoring for the deployed model's performance.
10. Feedback Loop: Collect feedback on model performance from real-
world usage.
• Iterate on the model or pipeline based on user feedback or changes in the
data distribution
PREPROCESSING

• Preprocessing is a crucial step in the machine learning pipeline where

raw data is transformed, cleaned, and organized to make it suitable
for training machine learning models.
• The goal of preprocessing is to enhance the quality of the data and
improve the performance and interpretability of the models. These
are common preprocessing steps
1.Data Cleaning:

• Handling Missing Data: Healthcare datasets may have missing

values due to incomplete records or data entry errors. Techniques
include imputation using mean, median, or predictive models, or
excluding incomplete records if they are minimal.
• Removing Duplicates: Ensuring that patient records or other data
entries are unique to avoid redundancy.
• Correcting Errors: Identifying and rectifying inaccuracies in
medical data, such as incorrect diagnosis codes or mismatched patient
identifiers.
2.Data Transformation:

• Normalization/Standardization: Scaling continuous variables like

lab results or vital signs to ensure consistency and comparability.
• This can be important for algorithms sensitive to feature scales.
• Encoding Categorical Variables: Converting categorical data such as
diagnosis codes or treatment types into numerical formats.
• Methods include one-hot encoding for categorical variables like
medication types or label encoding for hierarchical categories.
3.Feature Engineering:

Creating New Features: Generate additional features that may

capture more information about the problem.
• This can include interaction terms, polynomial features, or domain
specific features.
Dimensionality Reduction: Reduce the number of features while
preserving important information.
• Techniques include Principal Component Analysis (PCA) or feature
selection methods.
4.Data Integration:

• Combining Data Sources: Merging data from various sources such

as electronic health records (EHRs), lab results, imaging systems, and
patient surveys to create a comprehensive dataset.
• Harmonizing Data Formats: Standardizing data formats and
terminologies across different sources to ensure consistency.
• For instance, converting different coding systems like ICD-9 to ICD-
10.
Cont.…
5.Handling Imbalanced Data:
• Resampling Techniques(observed data): Addressing class
imbalances in medical conditions, such as rare diseases, by
oversampling underrepresented classes or under sampling
overrepresented classes.
6. Data Reduction:
• Dimensionality Reduction: Applying techniques such as Principal
Component Analysis (PCA) to reduce the number of features while
retaining important information, which can be useful in handling
high-dimensional data.
Cont.…

7.Outlier Detection and Handling:

• Identifying Outliers: Detecting anomalies in patient data, such as
unusually high or low lab results, which may indicate errors or rare
conditions.
• Handling Outliers: Deciding whether to adjust, exclude, or retain
outliers based on their impact on the analysis.
8. Data Augmentation:
• Enhancing Data Quality: For certain types of data, such as medical
images, augmentation techniques like rotation or flipping can help
improve the model's robustness and generalization.
Cont.…

9.Data Privacy and Security:

• Ensuring Compliance: Implementing measures to anonymize or de-
identify sensitive patient information to comply with regulations like
HIPAA (Health Insurance Portability and Accountability Act) and
GDPR (General Data Protection Regulation).
10. Data Validation:
• Ensuring Data Accuracy: Performing checks to validate data
accuracy and consistency before analysis. This can involve cross-
referencing data with clinical guidelines or external sources.
Challenges in Healthcare Data
Preprocessing:
• Data Quality Issues: Healthcare data can be noisy, incomplete, or
inaccurate, which requires careful cleaning and validation.
• Complex Data Types: Handling diverse data types such as structured
EHRs, unstructured clinical notes, and images requires different
preprocessing techniques.
• Regulatory Compliance: Ensuring that preprocessing activities
adhere to data protection laws and ethical guidelines.
VISUALIZATION
• Visualization is a powerful tool in the field of machine learning, aiding
in the exploration, analysis, and communication of patterns and
insights within data.
• some key aspects of visualization in the context of machine learning:
Cont.…

1.Exploratory Data Analysis (EDA): Understand the structure,

distribution, and relationships within the data before applying machine
learning algorithms.
Helps identify data patterns, anomalies, and feature relationships that
inform feature engineering and model selection
• Univariate Plots: Histograms, box plots, and kernel density plots help
understand the distribution of individual features.
• Histograms: Display the distribution of a single feature.
• Box Plots: Show the distribution, median, and outliers of a feature.
• Pair Plots: Visualize pairwise relationships between features.
• Correlation Matrices: Show the correlation coefficients between pairs
of features.
Cont..
• Bivariate Plots: Scatter plots, pair plots, and heatmaps reveal
relationships between pairs of features.
1. Scatter Plots: Scatter plots are used to visualize the relationship
between two numerical variables.
They help identify correlations, trends, and patterns in data.
2.Heatmaps
• Purpose: Heatmaps display data in a matrix format, where
individual values are represented by colors.
• They are particularly useful for visualizing correlations or other
metrics between pairs of variables.
Cont.…
2. Feature Distribution: Visualize the distribution of individual
features to understand their characteristics and identify outliers or
anomalies.
3. Correlation Analysis: Heat maps or correlation matrices help
visualize the correlation between different features, assisting in
feature selection and understanding relationships.
4. Data Summary: Use summary statistics and visualizations to
provide an overall picture of the dataset, including mean, median,
and standard deviation.
Cont.…
5.Model Performance: Visualize model performance metrics such as
accuracy, precision, recall, and F1 score using bar charts, line graphs,
or confusion matrices.
6. Learning Curves: Plot learning curves to visualize how the
performance of a machine learning model changes over time as it is
trained on more data.
7. ROC [Receiver Operating Characteristic] Curves and Precision-
Recall Curves: These curves visualize the trade-off between true
positive rate and false positive rate, or precision and recall, providing
insights into the model's performance across different thresholds.
Cont.…
8.Feature Importance: Bar charts or horizontal bar plots can
be used to display the importance of different features in a
model, helping with feature selection.
9. Decision Boundaries: Visualize decision boundaries for
classification models to understand how the model separates
different classes in the feature space.
10. Error Analysis: Visualize misclassified instances or
prediction errors to understand where the model is struggling
and identify potential areas for improvement.
FEATURE SELECTION
• Feature selection is the process of choosing a subset of relevant
features from a larger set of features to build more efficient and
accurate machine learning models.
• By selecting the most informative features, you can improve model
performance, reduce overfitting, and enhance interpretability. These
are some common techniques for feature selection:
1.Filter Methods:

• Correlation-based Methods: Remove features that are highly

correlated with each other since they may provide redundant
information.
• Pearson correlation coefficient or other correlation measures can be
used.
• Variance Thresholding: Eliminate features with low variance.
Features with little variation are less informative.
• Statistical Tests: Use statistical tests (e.g., t-tests, chi-square tests) to
assess the relevance of each feature to the target variable
2.Wrapper Methods:
• Recursive Feature Elimination (RFE): Iteratively remove
the least important features and train the model until the
desired number of features is reached.
• Forward Selection: Start with an empty set of features and
add the most relevant feature in each iteration until a
stopping criterion is met.
• Backward Elimination: Start with all features and eliminate
the least important feature in each iteration until a stopping
criterion is met.
3.Embedded Methods:
• LASSO (Least Absolute Shrinkage and Selection Operator):
Introduce a penalty term based on the absolute magnitude of
coefficients during model training. This encourages sparsity in the
feature space, effectively performing feature selection.
• Tree-based Methods: Decision trees and ensemble methods like
Random Forests can provide feature importance. Features with
higher importance are more relevant.
• Regularization Techniques: Include regularization terms (e.g., L1
regularization) in the model training process to penalize the
magnitude of coefficients, leading to feature selection.
Cont..
4.Dimensionality Reduction:
• Principal Component Analysis (PCA): Transform the original
features into a new set of uncorrelated features (principal
components) that retain most of the variance in the data.
• Linear Discriminant Analysis (LDA): Similar to PCA, but LDA also
considers class labels and aims to maximize the separability between
classes.
5. Information Gain/Mutual Information: Calculate the information
gain or mutual information between each feature and the target variable.
Features with higher information gain are considered more
informative.
Cont..

6.Recursive Feature Addition (RFA): Similar to RFE but in

the opposite direction.
• Start with an empty set and add features in each iteration
based on their relevance.
7. SelectKBest and SelectPercentile:
From scikit-learn library, these functions allow you to select
the top k features or the top percentage of features based on
statistical tests.
TRAINING MODEL PARAMETER
• Training a machine learning model involves setting its parameters to
specific values so that it can learn patterns from the training data.
• Parameters are the internal variables that the model adjusts during
the training process.
• The values of these parameters determine the performance and
behavior of the model.
• Here are some key concepts related to training model parameters:
1. Hyper parameters:
• Hyperparameters are external configuration settings for the model.
They are set before the training process begins and are not learned
from the data.
• Examples of hyperparameters include the learning rate,
regularization strength, the number of hidden layers in a neural
network, and the number of trees in a random forest.
• Tuning hyperparameters is a critical step in optimizing the
performance of a machine learning model
Cont.…
2.Learning Rate: The learning rate is a hyperparameter that controls the
step size during the optimization process.
It determines how much the model's parameters are updated in each iteration.
• Too high of a learning rate can cause the model to overshoot the optimal
values, while too low of a learning rate can lead to slow convergence.
3. Regularization:
• Regularization is a technique used to prevent overfitting by adding a
penalty term to the loss function based on the complexity of the model.
• Common regularization methods include L1 regularization (Lasso) and L2
regularization (Ridge).
• The strength of regularization is controlled by a hyperparameter.
Cont.…
4. Number of Hidden Layers and Neurons (for Neural Networks):
In neural networks, the architecture is defined by the number of hidden
layers and the number of neurons (nodes) in each layer.
The choice of architecture depends on the complexity of the problem
and the amount of available data.
These are hyperparameters that need to be tuned.
5. Batch Size: Batch size is the number of training examples used in
one iteration of gradient descent.
It is a hyperparameter that affects the convergence and computational
efficiency of the training process.
Cont.…
6.Number of Trees (for Tree-based Models): In ensemble models like
random forests or gradient boosting, the number of trees is a
hyperparameter.
• Increasing the number of trees can improve model performance, but it
also increases computational complexity.
7. Activation Functions (for Neural Networks): Activation functions
control the output of each neuron in a neural network.
• Common activation functions include ReLU, Sigmoid, and Tanh.
• Choosing the appropriate activation function is a hyperparameter
decision.
Cont.…
8.Loss Function: The loss function measures the difference between the
model's predictions and the actual target values.
Different models and tasks may require different loss functions (e.g., mean
squared error for regression, cross-entropy for classification).
9. Optimizer: The optimizer is the algorithm used to update the model's
parameters during training.
Examples include Stochastic Gradient Descent (SGD), Adam, and RMSprop.
The choice of optimizer is a hyperparameter.
10. Epochs: An epoch is one complete pass through the entire training
dataset.
The number of epochs is a hyperparameter that determines how many times the
model will see the entire dataset during training.
EVALUATION

• When evaluating the performance of a classification model, several metrics

are used to assess its effectiveness in predicting the correct class labels.
Here are some commonly used metrics
1.Sensitivity (True Positive Rate or Recall): Sensitivity measures the
proportion of actual positive instances that are correctly identified by the
model. Sensitivity = True Positives / (True Positives + False Negatives)
2. Specificity (True Negative Rate): Specificity measures the proportion of
actual negative instances that are correctly identified by the model.
Specificity = True Negatives / (True Negatives + False Positives)
Cont.…
3.Precision (Positive Predictive Value): Precision measures the
proportion of predicted positive instances that are actually positive.
Precision = True Positives / (True Positives + False Positives)
4. Negative Predictive Value (NPV): NPV measures the proportion of
predicted negative instances that are actually negative.
NPV = True Negatives / (True Negatives + False Negatives)
5. False Positive Rate (FPR): FPR measures the proportion of actual
negative instances that are incorrectly classified as positive by the
model. FPR = False Positives / (False Positives + True Negatives)
Cont.…

6.Accuracy: Accuracy measures the overall correctness of the model,

considering both true positive and true negative instances.
Accuracy = (True Positives + True Negatives) / Total Instances
7. Receiver Operating Characteristic (ROC) Curve: The ROC curve
is a graphical representation of the trade-off between sensitivity and
specificity at various thresholds.
It is created by plotting the true positive rate against the false positive
rate at different threshold values.
Cont.…
8.Area Under the ROC Curve (AUC-ROC): AUC-ROC quantifies
the overall performance of a classification model.
A higher AUC indicates better discrimination between positive and
negative instances.
9. Precision-Recall Curve: Similar to the ROC curve, the precision-
recall curve is a graphical representation of the trade-off between
precision and recall at different thresholds
Cont.…
10. F1 Score: The F1 score is the harmonic mean of precision and
recall, providing a balance between the two metrics.
F1 Score = 2 * (Precision * Recall) / (Precision + Recall)
11. Matthews Correlation Coefficient (MCC): MCC takes into
account true positives, true negatives, false positives, and false
negatives to provide a balanced measure of classification
performance.
Cont.…
12. Balanced Accuracy:Balanced accuracy considers the imbalance in
the distribution of classes and calculates an accuracy score that
accounts for this imbalance.
13. Cohen's Kappa: Cohen's Kappa measures the agreement
between the predicted and actual labels, adjusted for the possibility of
random agreement.
14. Confusion Matrix: A confusion matrix provides a tabular
summary of the number of true positives, true negatives, false
positives, and false negatives.
• Variables are named storage locations for data.
• Data Types include integers, floats, strings, and booleans, each
representing different kinds of data.
• Lists are mutable, ordered collections.
• Tuples are immutable, ordered collections.
• Dictionaries store key-value pairs and are mutable.
• Sets are unordered collections of unique items.
• Strings are immutable sequences of characters used for text.
VARIABLE AND TYPES
1.Variables: In Python, variables are used to store data. They are
created by assigning a value to a name using the ‘= ‘operator.
Python is dynamically typed, meaning you don't need to declare the
type of a variable explicitly.
The type is inferred from the value assigned to the variable.
x=5 'x' is a variable storing an integer
name = "Alice" 'name' is a variable storing a string
is_active = True
2. Data Types

• Integers (int):Whole numbers without a fractional part.

• Explanation: Integers can be positive, negative, or zero.
• Example: age = 25 # An integer value
• Floating-Point Numbers (float):Numbers that contain a decimal
point.
• Explanation: Floats represent real numbers and can be positive or
negative.
• Example: height = 5.9 # A floating-point value
Cont..
• Strings (str):Sequences of characters enclosed in quotes.
• Explanation: Strings are used to represent text and are immutable
(cannot be changed after creation).
• Example: greeting = "Hello, world!" # A string value
• Booleans (bool):A type with two possible values: True or False.
• Explanation: Booleans are often used in conditional statements and
loops.
• Example: python Copy code is_active = True # A Boolean value
DATA STRUCTURES AND CONTAINERS
1. Lists: Lists are ordered collections of items that are mutable (can be
changed). They can contain items of different types, including other
lists.
• Characteristics:
• Ordered: Items maintain the order in which they were added.
• Indexed: Items can be accessed using their index (position).
• Mutable: You can change, add, or remove items after creation.
2. Tuples
Definition: Tuples are ordered collections of items that are immutable
(cannot be changed once created). They can contain items of different
types.
Characteristics:
• Ordered: Items maintain the order in which they were added.
• Indexed: Items can be accessed using their index.
• Immutable: Once created, the items in a tuple cannot be modified
3. Dictionaries
• Definition: Dictionaries are unordered collections of key-value pairs.
Each key must be unique, and values are accessed via their keys.
• Characteristics:
• Unordered: The order of items is not guaranteed (Python 3.7+
maintains insertion order).
• Key-Value Pairs: Data is stored in pairs, where each key maps to a
specific value.
• Mutable: You can add, change, or remove key-value pairs.
4. Sets
Definition: Sets are unordered collections of unique items. They are
useful for membership tests, removing duplicates, and set operations
(like union, intersection).
• Characteristics:
• Unordered: The order of items is not guaranteed.
• Unique Elements: Sets do not allow duplicate values.
• Mutable: You can add or remove items, but the items themselves must
be immutable.
5. Strings : Strings are immutable sequences of characters. They are
used for storing and manipulating text data.
Characteristics:
• Immutable: Once created, the contents of a string cannot be changed.
• Indexed: Characters in a string can be accessed using their index.
• Support Various Methods: Strings have a variety of methods for
manipulation (e.g., .upper(), .replace()).
PANDAS DATA FRAME: OPERATIONS
• Creating a DataFrame: From various sources (e.g., dictionary, CSV).
• Viewing Data: Using .head(), .tail(), .info(), and .describe().
• Selecting Data: Accessing rows and columns with .loc[], .iloc[], and
conditions.
• Modifying Data: Adding, updating, or deleting columns and values.
• Aggregating Data: Calculating statistics, grouping, and pivot tables.
• Sorting and Ordering: Sorting by values or index.
• Handling Missing Data: Detecting, filling, or dropping missing values.
• Merging and Joining: Combining DataFrames based on common
columns or indices.
Cont…

Pandas is a powerful data manipulation library in Python. DataFrames are

two-dimensional labeled data structures.
import pandas as pd
# Creating a DataFrame
data = {'Name': ['John', 'Alice', 'Bob'],
'Age': [28, 25, 32],
'City': ['New York', 'San Francisco', 'Los Angeles']}
df = pd.DataFrame(data)
# Displaying DataFrame
• print(df)
Operations on Pandas DataFrames:
# Accessing columns
print(df['Name'])
print(df.Age)
# Descriptive statistics
print(df.describe())
# Filtering data
filtered_df = df[df['Age'] > 25]
# Adding a new column
df['Salary'] = [50000, 60000, 75000]
# Grouping data
grouped_df = df.groupby('City').mean()
# Merging Data Frames
other_data = {'City': ['New York', 'San Francisco', 'Los Angeles'],
'Population': [8500000, 870887, 3980400]}
other_df = pd.DataFrame(other_data)
merged_df = pd.merge(df, other_df, on='City')
SCIKIT-LEARN:PREPROCESSING,FEATURE
SELECTION
• Scikit-learn is a popular machine learning library in Python that
provides tools for data preprocessing, feature selection, and various
machine learning algorithms.
Preprocessing:
• Preprocessing involves preparing your data to improve the
performance of your machine learning model.
• Common preprocessing tasks include scaling features, encoding
categorical variables, and handling missing values.
1.Scaling Features:

• Standardization: This process involves scaling features so they have

a mean of 0 and a standard deviation of 1.
• It’s useful when features have different units or scales.

from sklearn.preprocessing import StandardScaler

• scaler = StandardScaler()
• X_scaled = scaler.fit_transform(X)
• Normalization: This scales features to lie between a specific range,
usually [0, 1]
• from sklearn.preprocessing import MinMaxScaler

• scaler = MinMaxScaler()
• X_normalized = scaler.fit_transform(X)
2.Encoding Categorical Variables:
• One-Hot Encoding: Converts • Label Encoding: Converts
categorical variables into a format categorical labels into numeric
that can be provided to ML
values.
algorithms to do a better job in
prediction. • from sklearn.preprocessing
• from sklearn.preprocessing import import LabelEncoder
OneHotEncoder • encoder = LabelEncoder()
• encoder = • y_encoded =
OneHotEncoder(sparse=False) encoder.fit_transform(y_categori
• X_encoded = cal)
encoder.fit_transform(X_categorical)
3.Handling Missing Values:
• Imputation: Filling missing values using the mean, median, or mode
• from sklearn.impute import SimpleImputer
• imputer = SimpleImputer(strategy='mean')
• X_imputed = imputer.fit_transform(X)
Cont.…
4.Feature Engineering:sklearn.preprocessing.PolynomialFeatures: Generates
polynomial and interaction features from the original features, which can be
useful for models that benefit from non-linear relationships.
5.Dimensionality Reduction:sklearn.decomposition.PCA: Reduces the
dimensionality of data by projecting it onto a lower-dimensional space while
preserving as much variance as possible.sklearn.decomposition.NMF: Factorizes
the data into non-negative matrices, useful for dimensionality reduction and
feature extraction.
6.Pipeline Integration:sklearn.pipeline.Pipeline: Combines multiple
preprocessing steps and modeling into a single object, allowing for streamlined
and reproducible workflows. For example, you can chain feature selection,
preprocessing, and model training in one pipeline.
Feature Selection

• Feature selection involves choosing the most relevant features for your
model.
• It can help in reducing overfitting, improving model performance, and
speeding up the training process.
1.Filter Methods:
• Univariate Selection: Selects features based on univariate statistical tests
from sklearn.feature_selection import SelectKBest, chi2

selector = SelectKBest(score_func=chi2, k=10)

X_new = selector.fit_transform(X, y)
• Variance Threshold: Removes features with low variance
• from sklearn.feature_selection import VarianceThreshold

• selector = VarianceThreshold(threshold=0.01)
• X_reduced = selector.fit_transform(X)
Wrapper Methods:

• Recursive Feature Elimination (RFE): Recursively removes the

least important features
• from sklearn.feature_selection import RFE
• from sklearn.linear_model import LogisticRegression

• model = LogisticRegression()
• rfe = RFE(model, n_features_to_select=5)
• X_rfe = rfe.fit_transform(X, y)
Embedded Methods:

• Feature Importance from Trees: Tree-based methods like Random

Forest can be used to determine feature importance.
• from sklearn.ensemble import RandomForestClassifier

• model = RandomForestClassifier()
• model.fit(X, y)
• importances = model.feature_importances_
Cont…
• L1 Regularization (Lasso): Can be used for feature selection by
penalizing the absolute size of the coefficients.
• from sklearn.linear_model import Lasso

• model = Lasso(alpha=0.1)
• model.fit(X, y)
• selected_features = model.coef_ != 0
Nptel links
• https://www.google.com/search?q=sklearn++nptel+%5Cvideos&sca_esv=d508bae1ee411921&sca_upv=1&rlz=1C1CHBD_en-
GBIN1125IN1125&ei=ZW3RZrLbJPG7seMP9PSZyAc&ved=0ahUKEwjyoY7ckZyIAxXxXWwGHXR6BnkQ4dUDCBA&uact=5&oq=sklearn++nptel+
%5Cvideos&gs_lp=Egxnd3Mtd2l6LXNlcnAiFnNrbGVhcm4gIG5wdGVsIFx2aWRlb3MyCBAhGKABGMMESNUwUNkSWOQucAJ4AZABAJgBnwGgAcIHqgEDMS43uAEDyAEA-
AEBmAIKoALXB8ICChAAGLADGNYEGEfCAggQABiABBiiBMICChAhGKABGMMEGArCAgQQIRgKmAMAiAYBkAYIkgcDMy43oAfMGw&sclient=gws-wiz-
serp#fpstate=ive&vld=cid:359e7415,vid:4Lo10fugSOE,st:0 =================sklearn

• https://www.google.com/search?q=PANDAS+DATA+FRAME%3A+OPERATIONS+nptel+%5Cvideos&sca_esv=d508bae1ee411921&sca_upv=1&rlz=1C1CHBD_en-
GBIN1125IN1125&ei=iG3RZqTQEYGRseMPmOyBuQk&ved=0ahUKEwiktNPskZyIAxWBSGwGHRh2IJcQ4dUDCBA&uact=5&oq=PANDAS+DATA+FRAME%3A+OPERATIONS+nptel+
%5Cvideos&gs_lp=Egxnd3Mtd2l6LXNlcnAiK1BBTkRBUyBEQVRBIEZSQU1FOiBPUEVSQVRJT05TIG5wdGVsIFx2aWRlb3MyChAhGKABGMMEGApInSBQsAxYsAxwAXgBkAEAmAG
cAaABnAGqAQMwLjG4AQPIAQD4AQL4AQGYAgKgAqIBwgIKEAAYsAMY1gQYR5gDAIgGAZAGCJIHAzEuMaAHpAM&sclient=gws-wiz-
serp#fpstate=ive&vld=cid:68cf4756,vid:6DTFIKF8QIg,st:0==============pandas

Unit - II MLT
No ratings yet
Unit - II MLT
75 pages
Project Proposal Machine Learning
No ratings yet
Project Proposal Machine Learning
6 pages
Python - Data Analysis
No ratings yet
Python - Data Analysis
11 pages
Hands On Machine Learning With Scikit Learn and TensorFlow-427-432
No ratings yet
Hands On Machine Learning With Scikit Learn and TensorFlow-427-432
6 pages
CENG3300 Lecture 3
No ratings yet
CENG3300 Lecture 3
24 pages
Northbay Summarizes Data Pre-Processing Algorithms
No ratings yet
Northbay Summarizes Data Pre-Processing Algorithms
10 pages
data science
No ratings yet
data science
5 pages
AI Strategy Flow Chart Share by WorldLine Technology
No ratings yet
AI Strategy Flow Chart Share by WorldLine Technology
1 page
Physics 1 Laboratory Manual Year 2021
No ratings yet
Physics 1 Laboratory Manual Year 2021
66 pages
Week-1 ML Slides
No ratings yet
Week-1 ML Slides
16 pages
Computaional Fluid Dynamics by W.H. Mason
No ratings yet
Computaional Fluid Dynamics by W.H. Mason
450 pages
Machine Learning Model Workflow
No ratings yet
Machine Learning Model Workflow
3 pages
AML MIDSEM
No ratings yet
AML MIDSEM
59 pages
L3 Overview of ML Model Development Lifecycle-1
No ratings yet
L3 Overview of ML Model Development Lifecycle-1
30 pages
1725892639Module 3 the Machine Learning Process
No ratings yet
1725892639Module 3 the Machine Learning Process
17 pages
Unit 7 ML
No ratings yet
Unit 7 ML
33 pages
Amar Shaheed Kanchan Singh Autonomous P.G.College Shivpuri Fatehpur (02003)
No ratings yet
Amar Shaheed Kanchan Singh Autonomous P.G.College Shivpuri Fatehpur (02003)
15 pages
What is Data Mining_ Key Techniques & Examples
No ratings yet
What is Data Mining_ Key Techniques & Examples
21 pages
BUSINESS ANALYTICS
No ratings yet
BUSINESS ANALYTICS
14 pages
CSCI946 w3_DataPrep
No ratings yet
CSCI946 w3_DataPrep
58 pages
S-9
No ratings yet
S-9
18 pages
ds sem
No ratings yet
ds sem
71 pages
DPT Week 1
No ratings yet
DPT Week 1
3 pages
Unit 5
No ratings yet
Unit 5
11 pages
PYTHON PROGRAMMING FOR MACHINE LEARNING-220901004_compressed (1)
No ratings yet
PYTHON PROGRAMMING FOR MACHINE LEARNING-220901004_compressed (1)
6 pages
Quantum Mechanics For Dummies Summary
No ratings yet
Quantum Mechanics For Dummies Summary
73 pages
Lecture 1
No ratings yet
Lecture 1
21 pages
Ch8 Data and Its Processing
No ratings yet
Ch8 Data and Its Processing
32 pages
Manual Data
No ratings yet
Manual Data
13 pages
MSDSModule 2
No ratings yet
MSDSModule 2
35 pages
Averages PixiPPt
No ratings yet
Averages PixiPPt
20 pages
Explanation: Energy Spectral Density
No ratings yet
Explanation: Energy Spectral Density
19 pages
Gretl Guide (401 450)
No ratings yet
Gretl Guide (401 450)
50 pages
DS Model Steps
No ratings yet
DS Model Steps
8 pages
Machine learning Life cycle
No ratings yet
Machine learning Life cycle
11 pages
Unit 1
No ratings yet
Unit 1
41 pages
ADS-IMP-QNA-2025-15-04-06-06-35_copy
No ratings yet
ADS-IMP-QNA-2025-15-04-06-06-35_copy
33 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
Each Stage of A Data Mining Project
No ratings yet
Each Stage of A Data Mining Project
5 pages
ML Question Answer
No ratings yet
ML Question Answer
4 pages
Data Preprocessing
No ratings yet
Data Preprocessing
9 pages
Ml Notes All
No ratings yet
Ml Notes All
32 pages
Mini Project Report
No ratings yet
Mini Project Report
21 pages
DS PPT Aman
No ratings yet
DS PPT Aman
9 pages
Does Quran Memorization Influence Iq
No ratings yet
Does Quran Memorization Influence Iq
9 pages
UNIT - 2 ML
No ratings yet
UNIT - 2 ML
8 pages
Evaluation of Actuated, Coordinated, and Adaptive Signal Control Systems PDF
No ratings yet
Evaluation of Actuated, Coordinated, and Adaptive Signal Control Systems PDF
10 pages
Data Science Checklist
No ratings yet
Data Science Checklist
22 pages
5_Unit 2 - Lecture 2-Data Handling
No ratings yet
5_Unit 2 - Lecture 2-Data Handling
15 pages
Research - and - Application - of - Buckling - Res - PDF TAIWAN
No ratings yet
Research - and - Application - of - Buckling - Res - PDF TAIWAN
15 pages
Chap I
No ratings yet
Chap I
56 pages
How To Apply ML
No ratings yet
How To Apply ML
4 pages
Missing Number
No ratings yet
Missing Number
34 pages
UNIT 2 ML
No ratings yet
UNIT 2 ML
14 pages
Machine Learning Project Checklist
100% (1)
Machine Learning Project Checklist
10 pages
UNIT - 2 ML
No ratings yet
UNIT - 2 ML
8 pages
Unit-1 Introduction to Machine Learning [5hrs]
No ratings yet
Unit-1 Introduction to Machine Learning [5hrs]
8 pages
Program Reasoning Lab Manual Part1
No ratings yet
Program Reasoning Lab Manual Part1
11 pages
List of Questions Mathematics ML DL
No ratings yet
List of Questions Mathematics ML DL
11 pages
View PDF
No ratings yet
View PDF
8 pages
Cauchy The Limit Concept and Calculus
No ratings yet
Cauchy The Limit Concept and Calculus
12 pages
4.Introductin to Machine Learning
No ratings yet
4.Introductin to Machine Learning
28 pages
Physics I Class 09: Work and Kinetic Energy
No ratings yet
Physics I Class 09: Work and Kinetic Energy
11 pages
What Is Machine Learning
No ratings yet
What Is Machine Learning
22 pages
Non-Newtonian Flow in Annuli: of of
No ratings yet
Non-Newtonian Flow in Annuli: of of
6 pages
The Friendly Data Science Handbook 2020
No ratings yet
The Friendly Data Science Handbook 2020
17 pages
REVIEWER
No ratings yet
REVIEWER
9 pages
Steps in the Implementation of Data Analysis
No ratings yet
Steps in the Implementation of Data Analysis
2 pages
Week 2
No ratings yet
Week 2
3 pages
HEAT TRANSFER - Chapter 2
No ratings yet
HEAT TRANSFER - Chapter 2
2 pages
AI Project Report: By: Neha Kalra (17csu122) and Prerna Pathak (17csu143)
No ratings yet
AI Project Report: By: Neha Kalra (17csu122) and Prerna Pathak (17csu143)
22 pages
Example: Fire Design of An Unprotected Beam Using Graphs
No ratings yet
Example: Fire Design of An Unprotected Beam Using Graphs
5 pages
Static Pushover Analysis For Seismic Design (Suharwardy, I. 2009)
100% (1)
Static Pushover Analysis For Seismic Design (Suharwardy, I. 2009)
58 pages
Home Asgn2 BEE 9AB
No ratings yet
Home Asgn2 BEE 9AB
2 pages
Questions + Solutions - (EPH105C) - 2023
No ratings yet
Questions + Solutions - (EPH105C) - 2023
6 pages
DSUR_EA2352001010391_W7
No ratings yet
DSUR_EA2352001010391_W7
3 pages
(A) What Is Machine Learning? Explain The Impact of Various Machine Learning Techniques in Today's World
No ratings yet
(A) What Is Machine Learning? Explain The Impact of Various Machine Learning Techniques in Today's World
6 pages
21cs54-Module 1
No ratings yet
21cs54-Module 1
15 pages
Syllabus For Electromagnetic Fields and Waves
No ratings yet
Syllabus For Electromagnetic Fields and Waves
10 pages
Romberg
No ratings yet
Romberg
4 pages
Module_-1
No ratings yet
Module_-1
9 pages
ML Checklist PDF
No ratings yet
ML Checklist PDF
4 pages
Fejer Kernel PDF
No ratings yet
Fejer Kernel PDF
2 pages
Chicken Zombie Apocolypse
No ratings yet
Chicken Zombie Apocolypse
4 pages
10th Class Math Test Set Chapter Aimers' Academy
No ratings yet
10th Class Math Test Set Chapter Aimers' Academy
1 page
DSA FAT Model Question Paper
No ratings yet
DSA FAT Model Question Paper
2 pages
Applied Machine Learning with Scikit-learn: Definitive Reference for Developers and Engineers
From Everand
Applied Machine Learning with Scikit-learn: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
Applied Statistical Analysis with SPSS: Definitive Reference for Developers and Engineers
From Everand
Applied Statistical Analysis with SPSS: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
From Everand
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
César Pérez López
No ratings yet

HCA2 (1)

Uploaded by

HCA2 (1)

Uploaded by

Unit-2

• Preprocessing is a crucial step in the machine learning pipeline where

• Handling Missing Data: Healthcare datasets may have missing

• Normalization/Standardization: Scaling continuous variables like

Creating New Features: Generate additional features that may

• Combining Data Sources: Merging data from various sources such

7.Outlier Detection and Handling:

9.Data Privacy and Security:

1.Exploratory Data Analysis (EDA): Understand the structure,

• Correlation-based Methods: Remove features that are highly

6.Recursive Feature Addition (RFA): Similar to RFE but in

• When evaluating the performance of a classification model, several metrics

6.Accuracy: Accuracy measures the overall correctness of the model,

• Integers (int):Whole numbers without a fractional part.

Pandas is a powerful data manipulation library in Python. DataFrames are

• Standardization: This process involves scaling features so they have

from sklearn.preprocessing import StandardScaler

selector = SelectKBest(score_func=chi2, k=10)

• Recursive Feature Elimination (RFE): Recursively removes the

• Feature Importance from Trees: Tree-based methods like Random

You might also like