Unit-1 ML notes
Unit-1 ML notes
Unit-1
Introduction to Machine Learning
Evolution of Machine Learning (or) History of Machine Learning:
Machine learning has evolved from simple algorithms to sophisticated models, driven by
the availability of vast datasets and advancements in computing power, with key
milestones including the development of neural networks and deep learning.
Early Roots (Mid-20th Century):
Machine learning emerged from the field of artificial intelligence, with early focus on
simple algorithms and rule-based systems.
Arthur Samuel: is credited with creating the first computer program to play
championship-level checkers in 1952, using alpha-beta pruning.
Rise of Statistical Methods:
Machine learning algorithms began to incorporate statistical methods, leading to the
development of techniques like linear regression, logistic regression, and support vector
machines.
The Dawn of Neural Networks:
Geoffrey Hinton, often called the "father of machine learning", made pioneering efforts
on artificial neural networks, revolutionizing how machines learn from large datasets.
Deep Learning Revolution:
Deep learning, a subfield of machine learning, uses artificial neural networks with
multiple layers to analyze data, leading to breakthroughs in areas like image recognition
and natural language processing.
Current Trends:
Machine learning is now a ubiquitous technology, used in a wide range of applications,
from self-driving cars to medical diagnosis.
The three primary machine learning paradigms are supervised learning, unsupervised
learning, and reinforcement learning, each differing in how they learn from data and the
type of problems they address.
Supervised Learning:
Supervised learning is a type of machine learning where a model learns from labeled
data, meaning the input data comes with corresponding correct output or "label" values.
The model is trained to map inputs to these labels, allowing it to predict outcomes on
new, unseen data.
Supervised learning is a process of providing input data as well as correct output data to
the machine learning model.
The aim of a supervised learning algorithm is to find a mapping function to map the
input variable(x) with the output variable(y).
The foundation of supervised learning is the availability of labeled data, where each
input example has a corresponding correct output.
The model learns the relationship between inputs and outputs by analyzing the labeled
data.
Types of Supervised Learning:
Regression:
Regression algorithms are used if there is a relationship between the input variable and
the output variable. It is used for the prediction of continuous variables, such as Weather
forecasting, Market Trends, etc.
Classification:
Classification algorithms are used when the output variable is categorical, which means
there are two classes such as Yes-No, Male-Female, True-false, etc.
Examples of Supervised Learning Applications:
Image Recognition: Training a model to identify objects in images.
Spam Detection: Classifying emails as spam or not spam.
Fraud Detection: Identifying suspicious transactions.
Advantages:
3
Predictive Power: Supervised learning models can make accurate predictions on new
data.
Versatility: Applicable to various problems, including regression and classification.
Relatively Easy to Implement: Requires labeled data and standard algorithms.
Disadvantages:
Requires Labeled Data: Can be time-consuming and expensive to label large datasets.
Can Be Overfit: May perform well on training data but poorly on new data if not
properly regularized.
Unsupervised Learning:
Unsupervised learning is a machine learning technique where models learn from
unlabeled data, identifying hidden patterns and structures without explicit guidance.
which relies on labeled data for training, unsupervised learning focuses on discovering
relationships and groupings within the data itself.
This allows the models to autonomously explore the data and extract valuable insights
without prior knowledge of the desired outcomes.
Types of Unsupervised Learning Algorithm:
Clustering:
Clustering is a method of grouping the objects into clusters such that objects with most
similarities remains into a group and has less or no similarities with the objects of
another group.
4
Cluster analysis finds the commonalities between the data objects and categorizes them
as per the presence and absence of those commonalities.
Association:
An association rule is an unsupervised learning method which is used for finding the
relationships between variables in the large database.
It determines the set of items that occurs together in the dataset. Association rule makes
marketing strategy more effective.
Applications:
Customer Segmentation: Grouping customers based on their purchasing behavior or
demographics.
Fraud Detection: Identifying unusual transaction patterns that might indicate fraudulent
activity.
Image Recognition: Discovering features in images for tasks like image segmentation
or object detection.
Reinforcement Learning:
Reinforcement learning (RL) is a subfield of machine learning where an agent learns to
make decisions by interacting with an environment to maximize a reward.
It differs from supervised and unsupervised learning because it doesn't rely on pre-
labeled data or an explicit training set; instead, the agent learns through trial and error
and feedback from the environment.
Agent: The entity that interacts with the environment and makes decisions.
Environment: The context in which the agent operates, providing feedback and rewards.
5
Action: The choices the agent makes to interact with the environment.
State: The current situation of the environment that the agent is in.
Reward: A signal that indicates the desirability of a particular action or state.
Policy: The strategy the agent uses to select actions in different states.
Works:
1. Interaction: The agent interacts with the environment, taking actions and observing the
resulting state and reward.
2. Learning: The agent uses the feedback (reward) to update its policy, learning which
actions lead to higher rewards.
3. Optimization: The goal is to find the optimal policy that maximizes the cumulative reward
over time.
Examples of Reinforcement Learning Applications:
Robotics: Training robots to perform tasks like navigation or manipulation.
Game Playing: Developing AI agents that can play games like chess,video games.
Natural Language Processing: Optimizing text generation models or chatbots.
Finance: Developing algorithms for trading or portfolio management.
Healthcare: Designing personalized treatment plans or drug discovery.
---------------------------------------------------------------------------------------------------------------
6
Learning by Rote:
In the context of machine learning, "rote learning" refers to a model's ability to
memorize data and patterns without truly understanding the underlying concepts or
context, leading to limited generalization and application of knowledge.
Memorization over Understanding:
Rote learning in AI is characterized by a model simply storing and reproducing data or
patterns without the ability to generalize or apply knowledge effectively.
Lack of Generalization:
A model that relies on rote learning might perform well on the specific data it was
trained on, but struggle when presented with new, unseen data or situations.
Simple Learning Pattern:
Rote learning can be viewed as a simple learning pattern where a machine is
programmed to keep a history of calculations and compare new input against its history of
inputs and outputs, retrieving the stored output if present.
Contrast with Meaningful Learning:
Rote learning is often contrasted with "meaningful learning," where a model not only
memorizes information but also understands the underlying concepts and relationships.
------------------------------------------------------------------------------------------------------------------
Learning by Induction:
In machine learning, learning by induction, or inductive learning, involves inferring
general rules or patterns from specific instances or examples to make broader
generalizations or predictions.
It's a fundamental approach where algorithms learn from data to identify underlying
structures and relationships, enabling them to make predictions or classifications on
new, unseen data.
Generalization from Specifics:
Inductive learning focuses on identifying common patterns or features within a dataset
and using these patterns to create a model that can accurately predict or classify new,
unseen data points.
7
Inductive Bias:
Inductive learning algorithms often incorporate inductive biases, which are assumptions
or constraints that guide the learning process and influence the types of models they can
learn.
Relationship to Other Machine Learning Paradigms:
Supervised Learning: Inductive learning is often used in supervised learning, where the
algorithm learns from labeled data to predict target values.
Unsupervised Learning: While less common, inductive learning can also be applied in
unsupervised learning to discover patterns and structures in unlabeled data.
Examples of Inductive Learning Algorithms:
Decision Trees: Algorithms that create tree-like structures to represent decisions and
predictions based on data features.
Support Vector Machines (SVMs): Algorithms that find the optimal hyperplane to
separate data points into different classes.
Neural Networks: Algorithms inspired by the structure and function of the human
brain, capable of learning complex patterns.
------------------------------------------------------------------------------------------------------------------
Reinforcement Learning:
Reinforcement learning (RL) is a machine learning paradigm where an agent learns to
make decisions in an environment through trial and error, aiming to maximize a
cumulative reward.
It differs from supervised learning by not requiring explicit training data, instead relying
on feedback from the environment.
Reinforcement learning is a type of machine learning where an agent learns to take
actions in an environment to achieve a specific goal.
The agent interacts with the environment, taking actions and receiving feedback in the
form of rewards or penalties.
Agent: The entity that interacts with the environment and makes decisions.
Environment: The context in which the agent operates, providing feedback and states.
Actions: The choices the agent can make in the environment.
8
Rewards/Penalties: Feedback received by the agent based on its actions, used to guide
learning.
Policy: The strategy the agent uses to make decisions.
Examples:
Training a robot to navigate an environment.
Developing an AI to play games like Go or chess.
Optimizing resource allocation in a business.
Advantages:
Adaptability: RL agents can learn to adapt to changing environments.
Flexibility: RL can be used for a wide range of tasks.
Automation: RL can be used to automate tasks that would otherwise require human
intervention.
Disadvantages:
Exploration vs. Exploitation: Finding the right balance between exploring new actions
and exploiting known good actions can be challenging.
Computational Cost: Training RL models can be computationally expensive.
Reward Design: Designing effective reward functions can be difficult.
Types of Reinforcement Learning:
Model-based: The agent learns a model of the environment to predict the outcomes of
actions.
Model-free: The agent learns directly from its interactions with the environment without
needing a model.
Types of Data:
9
------------------------------------------------------------------------------------------------------------------
Matching:
In machine learning, "matching" or "data matching" refers to the process of
comparing and identifying similarities or relationships between datasets.
To determine if two records refer to the same entity, often used for tasks like merging
customer profiles or healthcare records.
Data matching, also known as record linkage or entity resolution, is the process of
comparing data from different sources to find commonalities, overlaps, or connections.
The goal is to link records that belong to the same entity or individual, even if they
appear in different datasets with variations in data format or structure.
1. Data Matching: To consolidate, reconcile, or link disparate datasets to reveal patterns,
associations, or duplicates.
Methods:
Deterministic Matching: Identifies exact matches based on predefined criteria.
Probabilistic Matching: Assesses the likelihood of matches based on various factors.
Machine Learning-based Matching: Uses algorithms to learn from data and identify
matches based on similarity.
2. Sentence/Text Matching: Determining if two or more sentences or pieces of text are
semantically similar.
Methods:
11
Bi-Encoders: Use a language model to encode each sentence into a vector and then
compare the vectors for similarity.
Cross-Encoders: Simultaneously analyze both sentences and predict a similarity score.
3. Matching in Graph Theory: Solving graph matching problems, where the goal is to find a
set of edges in a graph that do not share any vertices.
4. Matching in Natural Language Processing (NLP): Finding relationships between words,
phrases, or sentences based on their semantic meaning.
Methods:
Cosine Similarity: Measures the similarity between two vectors representing the
meaning of text.
Word Embeddings: Represent words as vectors in a high-dimensional space, allowing
for similarity comparisons.
5. Matching in Other Applications:
Product Matching: Connecting customer search queries to the most relevant products.
Item Matching: Ensuring consistency and avoiding duplication of product information
on online platforms.
Image Matching: Finding similar images in a dataset.
-----------------------------------------------------------------------------------------
Stages in Machine Learning:
12
The machine learning process typically involves several key stages: data collection and
preparation, model selection and training, evaluation, and deployment.
1. Data Collection and Preparation:
Data Collection: Gather relevant data from various sources.
Data Cleaning: Address missing values, inconsistencies, and errors.
Data Preprocessing: Transform and format data for model training, including scaling,
encoding, and feature engineering.
Data Splitting: Divide the data into training, validation, and testing sets.
2. Model Selection and Training:
Model Selection: Choose an appropriate machine learning algorithm based on the
problem type and data characteristics.
Model Training: Train the selected model using the training data to learn patterns and
relationships.
Hyperparameter Tuning: Optimize the model's performance by adjusting its
parameters.
3. Evaluation:
Model Evaluation: Assess the model's performance using the validation or testing set.
Performance Metrics: Use relevant metrics (e.g., accuracy, recall) to quantify model
performance.
Iterative Improvement: Refine the model based on evaluation results and repeat
training and evaluation as needed.
4. Deployment:
Model Deployment:
Integrate the trained model into a production environment for real-time predictions.
Monitoring and Maintenance:
Continuously monitor the model's performance and retrain it as needed to maintain
accuracy.
-----------------------------------------------------------------------------------------
Data Acquisition:
13
Data acquisition refers to the process of collecting, gathering, and preparing data from
various sources to build and train a machine learning model, ensuring the data is in a
usable format.
Data acquisition is the initial and crucial step in any machine learning project, focusing
on identifying, accessing, and preparing the data that will be used to train and evaluate
models.
The quality and relevance of the acquired data directly impact the performance and
accuracy of the resulting machine learning model.
Data Sources:
Data can be sourced from various places, including internal databases, external APIs,
public datasets, sensor readings, and more.
Data Preparation:
Once acquired, the data often needs cleaning, transformation, and formatting to make it
suitable for machine learning algorithms.
Data Acquisition Strategies:
Public Databases: Utilizing publicly available datasets for training machine learning
models.
Manual Data Collection: Manually acquiring data from real-world settings.
Data Collection Services: Utilizing specialized services for data acquisition.
------------------------------------------------------------------------------------------------------------------
Feature Engineering:
14
Feature engineering in machine learning is the process of transforming raw data into
more useful and meaningful features that improve model performance, accuracy, and
speed.
It involves selecting, modifying, or creating new variables to better represent the
underlying problem and enhance the predictive capabilities of machine learning models.
Improved Model Performance:
Well-engineered features can significantly boost the accuracy and predictive power of
machine learning models.
Enhanced Generalization:
Feature engineering helps models generalize better to unseen data, reducing overfitting.
Faster Training:
By providing relevant and concise features, feature engineering can speed up the training
process.
Better Interpretability:
Well-chosen features can make models easier to understand and interpret.
Data Preparation:
Feature engineering is a crucial preprocessing step, ensuring that the data is in a suitable
format for machine learning algorithms.
------------------------------------------------------------------------------------------------------------------
Data Representation:
15
------------------------------------------------------------------------------------------------------------------
Model Selection:
16
Model selection in machine learning is the process of choosing the best algorithm and
model architecture for a specific task by evaluating various options based on their
performance and alignment with the problem's requirements.
Factors to Consider During Model Selection:
Problem Type: Is it a classification, regression, or clustering problem?
Data Characteristics: What are the data types, features, and size?
Model Complexity: Consider the trade-off between model complexity and
performance.
Computational Resources: Some models require more computational power than
others.
Common Model Selection Techniques
Cross-Validation: Splitting the data into multiple folds and training/testing the model
on different combinations to assess its performance.
Train-Test Split: Dividing the data into training and testing sets to evaluate the model's
ability to generalize.
Model Comparison: Comparing the performance of different models using appropriate
metrics.
Ensemble Methods: Combining the predictions of multiple models to improve
performance.
-------------------------------------------------------------------------
17
Model Learning:
"Model learning" refers to the process where algorithms, or models, are trained on data
to identify patterns and make predictions or decisions, improving their performance over
time.
Model Learning Work:
Training Data: Machine learning models are trained on a dataset that contains both
input features and corresponding target values or unlabeled data.
Learning Algorithm: The learning algorithm is the core of the model, and it uses the
training data to learn patterns and relationships.
Iterative Process: The learning process is often iterative, meaning the model
continuously adjusts its internal parameters to improve its performance on the training
data.
Concepts:
Algorithm: The specific method used to train the model.
Parameters: The internal settings of the model that are adjusted during training.
Features: The input variables used to train the model.
Target Variable: The output variable that the model is trying to predict.
Accuracy: A measure of how well the model performs.
Generalization: The ability of the model to perform well on new, unseen data.
Examples of Model Learning:
Image Recognition
Spam Detection
Model Prediction:
18
Model prediction involves using a trained model to make predictions on new, unseen
data, based on patterns learned from historical data, enabling organizations to forecast
future outcomes and make data-driven decisions.
Working:
Training: The model is first trained on a dataset, allowing it to identify patterns and
relationships between input variables and the target variable.
Prediction: Once trained, the model can then be used to predict the value of the target
variable for new, unseen data points.
Examples: This can include predicting customer churn, identifying fraudulent
transactions, or forecasting stock prices.
Types of Models:
Classification: Predicts the category or class to which an input belongs.
Regression: Predicts a continuous numerical value.
Clustering: Groups similar data points together.
Time Series: Analyzes data over time to identify trends and patterns.
Applications:
Business: Predicting customer behavior, sales forecasting, and identifying marketing
trends.
Finance: Fraud detection, risk assessment, and algorithmic trading.
Healthcare: Disease diagnosis, drug discovery, and personalized medicine.
Engineering: Predictive maintenance, process optimization, and quality control.
------------------------------------------------------------------------------------------------------------------
Search and Learning:
"Search" refers to algorithms that find optimal solutions or relevant information, while
"learning" focuses on algorithms that improve performance through data and experience.
They often work together, with learning used to refine search processes and improve
relevance.
Search Algorithms in Machine Learning: Search algorithms are used to explore a space of
possible solutions to find the best or most relevant one.
19
It's essentially a set of data points that a machine learning algorithm can analyze to learn
patterns and make predictions.
Datasets are crucial for building accurate and reliable machine learning models.
Training Datasets
20
The AI training dataset is the largest subset and forms the foundation of model
development. The model uses this data to identify patterns, relationships, and trends.
Characteristics:
Large size for better learning opportunities.
Well-labeled for supervised learning tasks.
Validation Datasets
To fine-tune the model.
Validation datasets evaluate the model during training, helping you adjust parameters
like learning rates or weights to prevent overfitting.
Characteristics:
Separate from the training data.
Small but representative of the problem space.
Testing Datasets
To evaluate the model.
The testing dataset provides an unbiased assessment of the model’s performance on
unseen data.
Characteristics:
Exclusively used after training and validation.
Should remain untouched during the training process.