AI - Learning
AI - Learning
Chapter 1: Learning
03-08-2023 7
1. Supervised Machine Learning Algorithms:
• The primary purpose of supervised learning is to scale the
scope of data and to make predictions of unavailable, future or
unseen data based on labeled sample data.
• Supervised learning is where there are input variables (x) and
an output variable (Y) and an algorithm is used to learn the
mapping function from the input to the output Y = f(x) .
• The goal is to approximate the mapping function so well that
when there comes a new input data (x), the machine should be
able to predict the output variable (Y) for that data.
• Supervised machine learning includes two major
processes: classification and regression.
03-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 8
❑ Classification is the process which basically categorizes a set of data into
classes (yes/no, true/false, 0/1, yes/no/may be). There are various types
of Classification problems, such as: Binary Classification, Multi-class
Classification, Multi-label Classification. Examples for classification
problems are: Spam filtering, Image classification, Sentiment analysis,
Classifying cancerous and non-cancerous tumors, Customer churn
prediction etc.
❑ Dimensionality reduction: Most of the time, there is a lot of noise in the incoming data.
Machine learning algorithms use dimensionality reduction to remove this noise while
distilling the relevant information. Examples: Image compression, classify a database full of
emails into “not spam” and “spam”.
• For each algorithm, the bag generator specifies how many instances should
be in each bag.
• A bag can be an image, a text document, a set of records for stocks, and so
on.03-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 26
c) Inaccurate Supervision
• In this type, there are wrong or
low-quality labels.
• Inaccurate labels usually come from
collecting public or crowdsourcing data
sets.
• The idea is to identify the potential mislabeled instances and to
correct or remove them.
• One of the practical techniques to achieve this idea is the data-editing
approach.
• The data-editing approach constructs a graph of relative neighborhoods,
where each node is an instance, and an edge connects two nodes of
different labels. An instance (node) is considered suspicious if it is
connected to many edges. This suspicious instance is then removed or
re-labeled according to the majority.
• For crowdsourcing data sets,
03-08-2023 theRao,labels
Prof. Trupthi are
Dept. of AI & DS, GAT obtained according to the
27
6. Self Supervised Machine Learning Algorithms:
• Self-supervised learning (SSL) is an evolving machine learning
technique poised to solve the challenges posed by the
over-dependence of labeled data.
• Here the model trains itself to learn one part of the input from another
part of the input. It is also known as predictive or pretext learning.
• In this process, the unsupervised problem is transformed into a
supervised problem by auto-generating the labels.
• To make use of the huge quantity of unlabeled data, it is crucial to set
the right learning objectives to get supervision from the data itself.
• The process of the self-supervised learning method is to identify any
hidden part of the input from any unhidden part of the input.
03-08-2023 32
Contents
• Types of Data: Structured and Unstructured Data, Quantitative and
Qualitative Data.
• Four Levels of data (Nominal, Ordinal, Interval, Ratio Level).
1. Structured vs Unstructured
▪ Structured (Organized) Data: Data stored into a
row/column structure.
• Every row represents a single observation and column
represent the characteristics of that observation.
• Unstructured (Unorganized) Data: Type of data that is
in the free form and does not follow any standard
format/hierarchy.
• Eg: Text or raw audio signals that must be parsed
further to become organized.
Pros of Structured Data
▪ Structured data is generally thought of as being much
easier to work with and analyze.
▪ Most statistical and machine learning models were
built with structured data in mind and cannot work on
the loose interpretation of unstructured data.
▪ The natural row and column structure is easy to digest
for human and machine eyes.
Example of Data Pre-processing
for Text Data
• Text data is generally unstructured and hence there is
need to transform data into structured form.
• Few characteristics that describe the data to assist
transformation are:
❑ Word/phrase count
❑ The existence of certain special characters
❑ The relative length of text
❑ Picking out topics
Example: A Tweet
• This Wednesday morn, are you early to rise? Then look East.
The Crescent Moon joins Venus & Saturn. Afloat in the dawn
skies.
Topic
Astronomy
2. Qualitative/Quantitative
1. Quantitative data: Data that can be described using
numbers, and basic mathematical procedures,
including addition, subtraction etc can be performed.
08-08-2023 69
Feature Engineering
• Feature engineering is the pre-processing step of machine
learning, which is used to transform raw data into features that
can be used for creating a predictive model using Machine learning.
• All machine learning algorithms take input data to generate the
output.
• The input data remains in a tabular form consisting of rows
(instances or observations) and columns (variable or attributes),
and these attributes are often known as features.
• Example: An image is an instance in computer vision, but a line in
the image could be the feature. In NLP, a document can be an
observation, and the word count could be the feature.
• A feature is an attribute or individual measurable property or
characteristic of a phenomenon.
08-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 70
• Feature engineering process selects the most useful predictor
variables for the model.
2. Transformation:
• It involves adjusting the predictor variable to improve the accuracy and
performance of the model.
• It ensures that all the variables are on the same scale, making the
model easier to understand.
• It ensures that all the features are within the acceptable range to avoid
any computational error.
08-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 72
3. Feature Extraction:
• Feature extraction is an automated feature engineering process that
generates new variables by extracting them from the raw data.
• The aim is to reduce the volume of data so that it can be easily
used and managed for data modelling.
• Feature extraction methods include cluster analysis, text
analytics, edge detection algorithms, and principal components
analysis (PCA).
4. Feature Selection:
• Feature selection is a way of selecting the subset of the most
relevant features from the original features set by removing the
redundant, irrelevant, or noisy features.
• This is done in order to reduce overfitting in the model and improve
the performance.
08-08-2023 Prof. Trupthi Rao, Dept. of AI & DS, GAT 73
Feature Engineering Techniques
1. Imputation:
• Imputation deals with handling missing values in data.
• Deleting records that are missing is one way of dealing with missing
data issue. But it could lead to losing out on a chunk of valuable
data. This is where imputation can help.
• Data imputation can be classified into two types:
❑ Categorical Imputation: Missing categorical values are generally
replaced by the most commonly occurring value (mode) of the
feature.
❑ Numerical Imputation: Missing numerical values are generally
replaced by the mean or median of the corresponding feature.
08-08-2023 82
• Example 2: Time stamp is split into 6 different attributes.