unit 1 ml pdf
unit 1 ml pdf
UNIT-I
Introduc on to machine learning, Machine learning life cycle, Types of Machine Learning System (supervised and
unsupervised learning, Batch and online learning, Instance-Based and Model based Learning), scope and limita ons,
Challenges of Machine learning, data visualiza on, hypothesis func on and tes ng, data pre-processing, data
augmenta on, normalizing data sets, Bias-Variance tradeoff, Rela on between AI (Ar ficial Intelligence), ML (Machine
Learning), DL (Deep Learning) and DS (Data Science).
UNIT-II
Clustering in Machine Learning: Types of Clustering Method: Par oning Clustering, Distribu on Model-Based
Clustering, Hierarchical Clustering, Fuzzy Clustering. Birch Algorithm, CURE Algorithm. Gaussian Mixture Models and
Expecta on Maximiza on. Parameters es ma ons – MLE, MAP. Applica ons of Clustering.
UNIT-III
Classifica on algorithm: - Logis c Regression, Decision Tree Classifica on, Neural Network, K-Nearest Neighbors (K-NN),
Support Vector Machine, Naive Bayes (Gaussian, Mul nomial, Bernoulli). Performance Measures: Confusion Matrix,
Classifica on Accuracy, Classifica on Report: Precisions, Recall, F1 score and Support.
UNIT-IV
Ensemble Learning and Random Forest: Introduc on to Ensemble Learning, Basic Ensemble Techniques (Max Vo ng,
Averaging, Weighted Average), Vo ng Classifiers, Bagging and Pas ng, Out-of-Bag Evalua on, Random Patches and
Random Subspaces, Random Forests (Extra-Trees, Feature Importance), Boos ng (AdaBoost, Gradient Boos ng),
Stacking.
UNIT-V
Dimensionality Reduc on: The Curse of Dimensionality, Main Approaches for Dimensionality Reduc on (Projec on,
Manifold Learning) PCA: Preserving the Variance, Principal Components, Projec ng Down to d Dimensions, Explained
Variance Ra o, Choosing the Right Number of Dimensions, PCA for Compression, Randomized PCA, Incremental PCA.
Kernel PCA: Selec ng a Kernel and Tuning Hyper parameters. Learning Theory: PAC and VC model.
What is Machine Learning?
Machine learning (ML) is a type of Ar ficial Intelligence (AI) that allows computers to learn without being explicitly
programmed. It involves feeding data into algorithms that can then iden fy pa erns and make predic ons on new
data. Machine learning is used in a wide variety of applica ons, including image and speech recogni on, natural
language processing, and recommender systems.
Defini on of Learning
A computer program is said to learn from experience E concerning some class of tasks T and performance measure P, if
its performance at tasks T, as measured by P, improves with experience E.
Examples
Handwri ng recogni on learning problem
o Task T : Recognizing and classifying handwri en words within images
o Performance P : Percent of words correctly classified
o Training experience E : A dataset of handwri en words with given classifica ons
A robot driving learning problem
o Task T : Driving on highways using vision sensors
o Performance P : Average distance traveled before an error
o Training experience E : A sequence of images and steering commands recorded while observing a
human driver
1. Problem Defini on: Clearly define the problem you want to solve. For example, predic ng customer churn in a
telecom company. The objec ve could be to iden fy customers likely to leave based on usage pa erns.
2. Data Collec on: Gather data relevant to the problem. For example, the bank collects data from past loan applicants,
including income, employment type, loan amount, and repayment history. Addi onal data, like credit scores from third-
party agencies, could also be sourced to enrich the dataset.
3. Data Cleaning and Preprocessing: Handle missing values, remove duplicates, and standardize data formats. For
instance, if the dataset has missing values for income, replace them with the median value. Convert categorical features
like "employment type" into numerical codes for model compa bility.
4. Exploratory Data Analysis (EDA): Analyze the dataset to find pa erns and rela onships. For example, a sca er plot
might reveal that higher credit scores correlate with a lower chance of default. Use heatmaps to iden fy feature
correla ons or box plots to spot outliers in income data.
5. Feature Engineering and Selec on: Create new features or select the most relevant ones. For instance, derive a
"debt-to-income ra o" feature from the exis ng "total debt" and "income" columns. Remove less useful features like
"customer ID" that do not contribute to the predic on.
6. Model Selec on: Choose an appropriate algorithm based on the problem. For example, use logis c regression for its
simplicity and interpretability in predic ng loan defaults, or try advanced models like Gradient Boos ng Machines for
be er accuracy.
7. Model Training: Split the dataset into training and tes ng subsets (e.g., 80%-20%) and train the model on the training
set. For instance, use the scikit-learn library in Python to train a logis c regression model with applicant data.
8. Model Evalua on and Tuning: Evaluate model performance using metrics like accuracy, precision, recall, or AUC-ROC.
For example, tune the hyperparameters of a Random Forest model (e.g., the number of trees) to improve accuracy from
85% to 90%.
9. Model Deployment: Deploy the trained model in a produc on environment to make predic ons. For example,
integrate the model into the bank's loan applica on system to assess default risks in real- me during the approval
process.
10. Model Monitoring and Maintenance: Con nuously monitor the model's performance post-deployment. For
instance, track accuracy over me, and if it declines due to changes in applicant behavior or economic condi ons, retrain
the model with updated data.
Supervised learning is a type of machine learning algorithm that learns from labeled data. Labeled data is data that has
been tagged with a correct answer or classifica on.
Supervised learning, as the name indicates, has the presence of a supervisor as a teacher. Supervised learning is when
we teach or train the machine using data that is well-labelled. Which means some data is already tagged with the
correct answer. A er that, the machine is provided with a new set of examples(data) so that the supervised learning
algorithm analyses the training data(set of training examples) and produces a correct outcome from labeled data.
For example, a labeled dataset of images of Elephant, Camel and Cow would have each image tagged with either
“Elephant” , “Camel”or “Cow.”
Example: Fruit Classifica on
1. Training Phase:
o The machine is trained using labeled data:
Apple: Rounded shape, depression at the top, red color.
Banana: Long, curved cylinder, green-yellow color.
2. Tes ng Phase:
o A new fruit is presented (e.g., a banana).
o The machine extracts features (shape and color).
o It compares these features to the training data and iden fies the fruit as a Banana.
Key Components:
Input: Features of the fruit (shape, color, texture).
Output: Predicted label (Apple or Banana).
Algorithm: A classifica on model that learns from labeled data
Supervised learning deals with or learns with “labeled” data. This implies that some data is already tagged with the
correct answer.
1- Regression
Regression is a type of supervised learning that is used to predict con nuous values, such as house prices, stock prices,
or customer churn. Regression algorithms learn a func on that maps from the input features to the output value.
Some common regression algorithms include:
Linear Regression
Polynomial Regression
Support Vector Machine Regression
Decision Tree Regression
Random Forest Regression
2- Classifica on
Classifica on is a type of supervised learning that is used to predict categorical values, such as whether a customer will
churn or not, whether an email is spam or not, or whether a medical image shows a tumor or not. Classifica on
algorithms learn a func on that maps from the input features to a probability distribu on over the output classes.
Some common classifica on algorithms include:
Logis c Regression
Support Vector Machines
Decision Trees
Random Forests
Naive Baye
#Unsupervised Learning
Unsupervised learning is a type of machine learning that learns from unlabeled data. This means that the data does not
have any pre-exis ng labels or categories. The goal of unsupervised learning is to discover pa erns and rela onships in
the data without any explicit guidance.
Unsupervised learning is the training of a machine using informa on that is neither classified nor labeled and allowing
the algorithm to act on that informa on without guidance. Here the task of the machine is to group unsorted
informa on according to similari es, pa erns, and differences without any prior training of data.
Unlike supervised learning, no teacher is provided that means no training will be given to the machine. Therefore the
machine is restricted to find the hidden structure in unlabeled data by itself.
You can use unsupervised learning to examine the animal data that has been gathered and dis nguish between several
groups according to the traits and ac ons of the animals. These groupings might correspond to various animal species,
providing you to categorize the creatures without depending on labels that already exist.
Key Points
Unsupervised learning allows the model to discover pa erns and rela onships in unlabeled data.
Clustering algorithms group similar data points together based on their inherent characteris cs.
Feature extrac on captures essen al informa on from the data, enabling the model to make meaningful
dis nc ons.
Label associa on assigns categories to the clusters based on the extracted pa erns and characteris cs.
Example
Imagine you have a machine learning model trained on a large dataset of unlabeled images, containing both dogs and
cats. The model has never seen an image of a dog or cat before, and it has no pre-exis ng labels or categories for these
animals. Your task is to use unsupervised learning to iden fy the dogs and cats in a new, unseen image.
For instance, suppose it is given an image having both dogs and cats which it has never seen.
Thus the machine has no idea about the features of dogs and cats so we can’t categorize it as ‘dogs and cats ‘. But it can
categorize them according to their similari es, pa erns, and differences, i.e., we can easily categorize the above picture
into two parts. The first may contain all pics having dogs in them and the second part may contain all pics having cats in
them. Here you didn’t learn anything before, which means no training data or examples.
It allows the model to work on its own to discover pa erns and informa on that was previously undetected. It mainly
deals with unlabelled data.
Types of Unsupervised Learning
Unsupervised learning is classified into two categories of algorithms:
Clustering: A clustering problem is where you want to discover the inherent groupings in the data, such as
grouping customers by purchasing behavior.
Associa on: An associa on rule learning problem is where you want to discover rules that describe large
por ons of your data, such as people that buy X also tend to buy Y.
Clustering
Clustering is a type of unsupervised learning that is used to group similar data points together. Clustering
algorithms work by itera vely moving data points closer to their cluster centers and further away from data points in
other clusters.
1. Exclusive (par oning)
2. Agglomera ve
3. Overlapping
4. Probabilis c
Clustering Types:-
1. Hierarchical clustering
2. K-means clustering
3. Principal Component Analysis
4. Singular Value Decomposi on
5. Independent Component Analysis
6. Gaussian Mixture Models (GMMs)
7. Density-Based Spa al Clustering of Applica ons with Noise (DBSCAN)
Associa on rule learning
Associa on rule learning is a type of unsupervised learning that is used to iden fy pa erns in a data. Associa on
rule learning algorithms work by finding rela onships between different items in a dataset.
Some common associa on rule learning algorithms include:
Apriori Algorithm
Eclat Algorithm
FP-Growth Algorithm
Evalua ng Non-Supervised Learning Models
Evalua ng non-supervised learning models is an important step in ensuring that the model is effec ve and useful.
However, it can be more challenging than evalua ng supervised learning models, as there is no ground truth data to
compare the model’s predic ons to.
There are a number of different metrics that can be used to evaluate non-supervised learning models, but some of the
most common ones include:
Silhoue e score: The silhoue e score measures how well each data point is clustered with its own cluster
members and separated from other clusters. It ranges from -1 to 1, with higher scores indica ng be er
clustering.
Calinski-Harabasz score: The Calinski-Harabasz score measures the ra o between the variance between clusters
and the variance within clusters. It ranges from 0 to infinity, with higher scores indica ng be er clustering.
Adjusted Rand index: The adjusted Rand index measures the similarity between two clusterings. It ranges from -
1 to 1, with higher scores indica ng more similar clusterings.
Davies-Bouldin index: The Davies-Bouldin index measures the average similarity between clusters. It ranges
from 0 to infinity, with lower scores indica ng be er clustering.
F1 score: The F1 score is a weighted average of precision and recall, which are two metrics that are commonly
used in supervised learning to evaluate classifica on models. However, the F1 score can also be used to evaluate
non-supervised learning models, such as clustering models.
Applica on of Unsupervised learning
Non-supervised learning can be used to solve a wide variety of problems, including:
Anomaly detec on: Unsupervised learning can iden fy unusual pa erns or devia ons from normal behavior in
data, enabling the detec on of fraud, intrusion, or system failures.
Scien fic discovery: Unsupervised learning can uncover hidden rela onships and pa erns in scien fic data,
leading to new hypotheses and insights in various scien fic fields.
Recommenda on systems: Unsupervised learning can iden fy pa erns and similari es in user behavior and
preferences to recommend products, movies, or music that align with their interests.
Customer segmenta on: Unsupervised learning can iden fy groups of customers with similar characteris cs,
allowing businesses to target marke ng campaigns and improve customer service more effec vely.
Image analysis: Unsupervised learning can group images based on their content, facilita ng tasks such as image
classifica on, object detec on, and image retrieval.
Advantages of Unsupervised learning
It does not require training data to be labeled.
Dimensionality reduc on can be easily accomplished using unsupervised learning.
Capable of finding previously unknown pa erns in data.
Unsupervised learning can help you gain insights from unlabeled data that you might not have been able to get
otherwise.
Unsupervised learning is good at finding pa erns and rela onships in data without being told what to look for.
This can help you learn new things about your data.
Disadvantages of Unsupervised learning
Difficult to measure accuracy or effec veness due to lack of predefined answers during training.
The results o en have lesser accuracy.
The user needs to spend me interpre ng and label the classes which follow that classifica on.
Unsupervised learning can be sensi ve to data quality, including missing values, outliers, and noisy data.
Without labeled data, it can be difficult to evaluate the performance of unsupervised learning models, making it
challenging to assess their effec veness.
Training data Use training data to infer model. No training data is used.
Model We can test our model. We can not test our model.
2. Online Learning
Scope:
Dynamic environments: Perfect for applica ons where data arrives incrementally or changes over me (e.g., stock market
predic on, recommenda on systems).
Resource-efficient: Processes small chunks of data, requiring less memory and computa onal power.
Real- me updates: Enables models to adapt to new data immediately.
Con nuous learning: Suited for systems that must evolve with me (e.g., adap ve personaliza on).
Limita ons:
Suscep ble to noise: Incremental updates can make the model vulnerable to outliers or noisy data, affec ng performance.
Lower ini al accuracy: Models may take longer to converge to op mal accuracy compared to batch learning.
Limited complexity: Computa onal constraints might limit the complexity of models in real- me scenarios.
Catastrophic forge ng: Without careful design, the model may "forget" earlier data trends while adap ng to new data.
2. Model Challenges:
Overfi ng: Models that perform well on training data but fail to generalize to new data.
Underfi ng: Models that fail to capture pa erns in the data, leading to poor performance.
Hyperparameter Tuning: Choosing the right set of hyperparameters is me-consuming and computa onally expensive.
Model Complexity: Complex models (e.g., deep learning) require significant exper se to design and train effec vely.
5. Applica on Challenges:
Domain Exper se: Building domain-specific ML models o en requires close collabora on with domain experts.
Changing Data Distribu ons: Models can fail when the distribu on of data changes over me (concept dri ).
Ethical and Social Impacts: Ensuring ML applica ons are fair, unbiased, and aligned with societal values is challenging.
6. Security Challenges:
Adversarial A acks: Models can be fooled by maliciously cra ed inputs designed to produce incorrect predic ons.
Model Stealing: Trained models can be reverse-engineered or exploited by a ackers.
Data Poisoning: Inser ng malicious data into training datasets to corrupt the model.
Addressing These Challenges
Data Preprocessing: Use techniques like normaliza on, augmenta on, and imputa on to improve data quality.
Regulariza on and Cross-Valida on: Reduce overfi ng and ensure be er generaliza on.
AutoML and Hyperparameter Op miza on: Simplify model and hyperparameter tuning with automated tools.
Ethical Frameworks: Incorporate fairness and bias detec on into ML pipelines.
Model Monitoring: Con nuously monitor deployed models for performance dri or security vulnerabili es.
By recognizing and addressing these challenges, organiza ons can build more robust, ethical, and effec ve ML solu ons.
Data visualiza on helps machine learning analysts to be er understand and analyze complex data sets by presen ng them in an
easily understandable format. Data visualiza on is an essen al step in data prepara on and analysis as it helps to iden fy outliers,
trends, and pa erns in the data that may be missed by other forms of analysis.
With the increasing availability of big data, it has become more important than ever to use data visualiza on techniques to explore
and understand the data. Machine learning algorithms work best when they have high-quality and clean data, and data visualiza on
can help to iden fy and remove any inconsistencies or anomalies in the data.
Machine learning may make use of a wide variety of data visualiza on approaches. That include:
1. Line Charts: In a line chart, each data point is represented by a point on the graph, and these points are connected by a line.
We may find pa erns and trends in the data across me by using line charts. Time-series data is frequently displayed using
line charts.
2. Sca er Plots: A quick and efficient method of displaying the rela onship between two variables is to use sca er plots. With
one variable plo ed on the x-axis and the other variable drawn on the y-axis, each data point in a sca er plot is represented
by a point on the graph. We may use sca er plots to visualize data to find pa erns, clusters, and outliers.
3. Bar Charts: Bar charts are a common way of displaying categorical data. In a bar chart, each category is represented by a
bar, with the height of the bar indica ng the frequency or propor on of that category in the data. Bar graphs are useful for
comparing several categories and seeing pa erns over me.
4. Heat Maps: Heat maps are a type of graphical representa on that displays data in a matrix format. The value of the data
point that each matrix cell represents determines its hue. Heatmaps are o en used to visualize the correla on variables or
to iden fy pa erns in me-series data.
5. Tree Maps: Tree maps are used to display hierarchical data in a compact format and are useful in showing the rela onship
between different levels of a hierarchy.
6. Box Plots: Box plots are a graphical representa on of the distribu on of a set of data. In a box plot, the median is shown by
a line inside the box, while the center box depicts the range of the data. The whiskers extend from the box to the highest
and lowest values in the data, excluding outliers. Box plots can help us to iden fy the spread and skewness of the data.
Data visualiza on has several uses in machine learning. It can be used to:
o Iden fy trends and pa erns in data: It may be challenging to spot trends and pa erns in data using conven onal
approaches, but data visualiza on tools may be u lized to do so.
o Communicate insights to stakeholders: Data visualiza on can be used to communicate insights to stakeholders in a format
that is easily understandable and can help to support decision-making processes.
o Monitor machine learning models: Data visualiza on can be used to monitor machine learning models in real me and to
iden fy any issues or anomalies in the data.
o Improve data quality: Data visualiza on can be used to iden fy outliers and inconsistencies in the data and to improve data
quality by removing them.
1. Improves Model Accuracy: Clean and standardized data helps models be er understand rela onships and pa erns.
2. Ensures Data Compa bility: ML algorithms have specific requirements for data types, scales, and distribu ons.
Enhances Efficiency: Properly structured data reduces computa onal costs and speeds up training.
Steps in Data Pre-Processing
1. Data Cleaning
Handling Missing Values:
o Missing values can result from data entry errors, equipment malfunc ons, or other issues.
Imputa on: Replace missing values with the mean, median, or mode.
Removing Noise:
o Methods:
Handling Outliers:
Deduplica on:
2. Data Transforma on
Normaliza on:
o Suitable for algorithms like KNN or neural networks sensi ve to feature magnitudes.
Standardiza on:
o Commonly used in algorithms requiring Gaussian-like distribu ons (e.g., SVM, Logis c Regression).
3. Feature Engineering
o Choose the most relevant features to improve performance and reduce overfi ng.
o Methods:
o Derive new features that be er capture the data’s essence (e.g., text embeddings).
Feature Scaling:
o Reduce feature count using techniques like Principal Component Analysis (PCA) or t-SNE.
4. Data Spli ng
Divide data into subsets for training, valida on, and tes ng:
o Valida on Set: Used to tune hyperparameters and prevent overfi ng (e.g., 10–15% of data).
o Test Set: Used to evaluate the model’s performance (e.g., 10–15% of data).
5. Data Integra on
Combine data from mul ple sources (e.g., databases, APIs) into a unified format.
6. Data Reduc on
Reduce data size or complexity without losing cri cal informa on:
o Aggrega on: Summarize data (e.g., monthly averages instead of daily data).
1. Python Libraries:
2. ETL Tools:
o Tools like Talend, Apache Nifi, or Alteryx help in data integra on and cleaning.
Example Workflow
1. Data Cleaning:
o Handle missing values in columns like "Lot Size" by impu ng the mean.
2. Transforma on:
3. Feature Engineering:
o Combine "Year Built" and "Year Renovated" into a single feature: "Age".
4. Spli ng:
o Split data into 70% training, 15% valida on, and 15% test sets.
1. Time-Consuming: Cleaning and organizing data can take up a significant por on of project me.
2. Data Imbalance: Uneven class distribu ons require techniques like SMOTE or stra fied sampling.
3. Scalability: Processing massive datasets efficiently demands distributed compu ng tools like Apache Spark.
Impact of Pre-Processing on ML
Effec ve data pre-processing can significantly enhance model accuracy, efficiency, and reliability. Neglec ng it o en results in poor
model performance, unreliable insights, and biased predic ons. By inves ng me and effort in this step, prac oners set the
founda on for successful machine learning projects.
# Data Augmenta on in Machine Learning
In machine learning, data augmenta on is a common method for manipula ng exis ng data to ar ficially increase the
size of a training dataset. In an a empt to enhance the efficiency and flexibility of machine learning models, data
augmenta on looks for the boost in the variety and vola lity of the training data.
Data augmenta on can be especially beneficial when the original set of data is small as it enables the system to learn
from a larger and more varied group of samples.
By applying arbitrary changes to the informa on, the expanded dataset can catch various varie es of the first examples,
like various perspec ves, scales, revolu ons, interpreta ons, and mishappenings. As a result, the model can be er adapt
to unknown data and become more resilient to such varia ons.
Techniques for data augmenta on can be used with a variety of data kinds, including me series, text, photos, and
audio. Here are a few frequently used methods of data augmenta on for image data:
1.Images can be rotated at different angles and flipped horizontally or ver cally to create alterna ve points of view.
2.Random cropping and padding: By applying random cropping or padding to the photos, various scales, and
transla ons can be simulated.
3.Scaling and zooming: The model can manage various item sizes and resolu ons by rescaling the photos to different
sizes or zooming in and out.
4.Shearing and perspec ve transform: Changing an image's shape or perspec ve can imitate various viewing angles
while also introducing deforma ons.
5.Color ji ering: By adjus ng the color characteris cs of the images, including their brightness, contrast, satura on, and
hue, the model can be made to be more resilient to varia ons in illumina on.
6.Gaussian noise: By introducing random Gaussian noise to the images, the model's resistance to noisy inputs can be
strengthened.
Types of Data Augmenta on
Real Data Augmenta on
Real data augmenta on involves modifying real-world data to improve model training. This method aims to simulate
real-world varia ons like environmental changes or noise.
Sensor Noise: Adding noise to sensor data (e.g., Gaussian noise) to simulate real-world measurement errors.
Occlusion: Par ally blocking areas of an image to simulate obstacles or objects hiding parts of the scene.
Weather: Simula ng weather condi ons (e.g., rain, snow) to make models robust to environmental changes.
Time Series Perturba ons: Altering me series data (e.g., shi s, scaling) to mimic temporal changes.
Label Smoothing: Adding noise to labels to prevent overfi ng and improve predic on reliability.
Synthe c Data Augmenta on
Synthe c data augmenta on creates ar ficial data samples to increase the dataset size and variety.
Image Synthesis: Using models like GANs or VAEs to generate new images from exis ng ones.
Text Genera on: Crea ng new text samples using language models or sequence-to-sequence models to
diversify language pa erns.
Oversampling and Undersampling: Balancing class distribu ons by crea ng synthe c examples of the minority
class or reducing the majority class.
Data Interpola on/Extrapola on: Genera ng new samples by interpola ng between or extrapola ng beyond
exis ng data points.
Feature Perturba on: Modifying input features (e.g., adding noise) to make the model more robust to input
fluctua ons.
Challenges Faced by Data Augmenta on
Maintaining Label Integrity: Ensuring that labels remain accurate a er transforma ons, such as when images
are flipped or altered.
Overfi ng from Excessive Augmenta on: Over-augmenta on can lead to models learning pa erns specific to
the augmented data, causing poor generaliza on on real data.
Increased Computa onal Cost: Augmenta on can significantly increase the dataset size, demanding more
storage and processing power, especially for deep learning models.
Data Privacy and Security: Genera ng augmented data from sensi ve informa on can pose privacy risks and
violate ethical standards.
Interpretability and Explainability: Augmented data can complicate the model’s decision-making, making it
harder to explain predic ons and affec ng transparency in cri cal applica ons.
Addressing these challenges requires careful design, valida on, and balancing of augmenta on techniques to ensure
they improve model performance without introducing biases or complica ons.
What is Bias?
The bias is known as the difference between the predic on of the values by the Machine Learning model and the correct
value. Being high in biasing gives a large error in training as well as tes ng data. It recommended that an algorithm
should always be low-biased to avoid the problem of underfi ng. By high bias, the data predicted is in a straight line
format, thus not fi ng accurately in the data in the data set. Such fi ng is known as the Underfi ng of Data. This
happens when the hypothesis is too simple or linear in nature. Refer to the graph given below for an example of such a
situa on.
High Bias in the Model