machine learning unit 2
machine learning unit 2
Data set:
A dataset is a collection of related information or records.The information may be on some entity or some subject
area. Each row of a data set is called a record. Each data set also has multiple attributes, each of which gives
information on a specific characteristic. For example, in the dataset on students, there are four attributes namely Roll
Number, Name, Gender, and Age, each of which understandably is a specific characteristic about the student entity.
Attributes can also be termed as feature, variable, dimension or field.
Types of DATA:
Data can broadly be divided into following two types:
1. Qualitative data
2. Quantitative data
1.Qualitative data: Qualitative data provides information about the quality of an object or information which cannot
be measured. For example, if we consider the quality of performance of students in terms of ‘Good’, ‘Average’, and
‘Poor’. Also, name or roll number of students are information that cannot be measured using some scale of
measurement. Qualitative data is also called categorical data.
Qualitative data can be further subdivided into two types as follows:
Nominal data: has named value
Ordinal data:has named value which can be naturally ordered
Nominal data is one which has no numeric value, but a named value. It is used for assigning named
values to attributes. Nominal values cannot be quantified. Examples of nominal data are
Bloodgroup: A,B,O,AB,etc.
Nationality: Indian, American, British,etc.
Gender: Male, Female, Other
Ordinal data, in addition to possessing the properties of nominal data, can also be naturally ordered. This
means ordinal data also assigns named values to attributes but unlike nominal data,they can be arranged
in a sequence of increasing or decreasing value so that we can say whether a value is better than or
greater than another value. Examples of ordinal data are
Customer satisfaction:‘VeryHappy’,‘Happy’,‘Unhappy’,etc.
Grades: A,B,C,etc.
Hardness of Metal:‘VeryHard’,‘Hard’,‘Soft’,etc.
Like nominal data, basic counting is possible for ordinal data. Hence, the mode can be identified. Since
ordering is possible incase of ordinal data, median can be identified in addition. Mean can still not be
calculated.
2.Quantitative data: Quantitative data relates to information about the quantity of an object – hence it can be
measured. There are two types of quantitative data:
Intervel data
Ratio data
Interval data: numeric data for which the exact difference between values is known. An ideal
example of interval data is Celsius temperature. For example, the difference between 12°C and 18°C
degrees is measurable and is 6°C. However, such data do not have something called a ‘true zero’
value. For example, there is nothing called ‘0 temperature’ or ‘no temperature’. Hence, only addition
and subtraction applies for interval data.
For interval data, mathematical operations such as addition and subtraction are possible. For that
reason, for interval data, the central tendency can be measured by mean, median, or mode. Standard
deviation can also be calculated.
Ratio data: numeric data for which exact value can be measured and absolute zero is available.
Examples of ratio data include height, weight, age, salary, etc.
For Ratio data, mathematical operations such as addition and subtraction are possible. For that
reason, for interval data, the central tendency can be measured by mean, median, or mode. Standard
deviation can also be calculated.
Note: Measures of central tendency help to understand the central point of a set of data. Standard
measures of central tendency of data are mean, median, and mode.
Structure of data:
By now, we understand that in machine learning, we have two basic data types – numeric and categorical.
So, We need to understand that in a data set, which of the attributes are numeric and which are categorical
in nature.This is because, the approach of exploring numerical data is different than the approach of
exploring categorical data.
2.Histogram:
A histogram is a type of bar chart that represents the distribution of a dataset. It displays the frequency of
data points within specific intervals, called bins. Each bar in a histogram represents the count (or
frequency) of data points that fall within a certain range.
Key Features of a Histogram:
1. Bins (Intervals): The data is divided into intervals, also called bins. The bins represent a range of
values, and the width of each bin corresponds to the interval size.
2. Bars: Each bar's height represents the number of data points that fall within the corresponding bin's
range. Taller bars indicate higher frequencies, and shorter bars indicate lower frequencies.
3. X-Axis: Represents the values of the data, divided into bins.
4. Y-Axis: Represents the frequency or count of data points in each bin.
Exploring Categorical data: Most popularly we use scatter plot technic for Categorical data
Scatter plot:
A scatter plot is a type of data visualization that displays the relationship between two continuous
variables. Each point on the plot represents a pair of values from the dataset, with the horizontal axis (x-
axis) representing one variable and the vertical axis (y-axis) representing the other.
Key Features of a Scatter Plot:
1. Data Points: Each point represents an observation in the dataset, plotted based on its values for
the two variables.
2. Axes: The x-axis and y-axis represent the two variables you're comparing.
o X-Axis: One variable, typically independent (e.g., time, age, etc.).
o Y-Axis: The other variable, typically dependent (e.g., height, sales, etc.).
3. Trend or Correlation: A scatter plot can help identify the correlation between the two variables:
o Positive correlation: Points tend to rise from left to right (upward slope).
o Negative correlation: Points tend to fall from left to right (downward slope).
o No correlation: Points are scattered randomly with no discernible pattern.
Data quality refers to the condition or fitness of data for its intended use. High-quality data is accurate,
complete, consistent, timely, and relevant, while poor-quality data can lead to errors, misleading
conclusions, and poor decision-making.
Success of machine learning depends largely on the quality of data. A data which has the right quality
helps to achieve better prediction accuracy, in case of supervised learning. However, it is not realistic to
expect that the data will be flawless.
Data remediation involves the process of identifying, correcting, and improving data quality issues within a
dataset. It ensures that the data is suitable for analysis, reporting, and decision-making.
Common Data Quality Issues and Remediations:
1. Missing Data: When values for certain fields are missing (e.g., missing customer email addresses).
o Remediation: You can either fill in missing data (imputation), remove incomplete records, or
flag them as "unknown."
2. Duplicate Records: Identical or very similar records appearing multiple times in the dataset.
o Remediation: Use deduplication techniques to identify and merge duplicates or remove
redundant records.
3. Inconsistent Formatting: Data represented in multiple formats, such as date formats
(MM/DD/YYYY vs. DD/MM/YYYY) or inconsistent naming conventions.
o Remediation: Standardize data formats using automated rules, ensuring consistency across
records.
4. Outliers and Invalid Data: Extreme values or data entries that fall outside the expected range or
are clearly erroneous (e.g., a negative value for age).
o Remediation: Identify and investigate outliers to determine whether they are errors. You
may replace, remove, or correct such data points.
5. Inaccurate Data: Data entries that are incorrect or not up-to-date.
o Remediation: Validate data accuracy by cross-referencing with trusted sources or through
manual review, and update as necessary.
Data modeling:
To increase the level of accuracy of machine, human participation should be added to the machine learning
process. For this we follow mainly 4 steps:
Step 1-DATA PRE-PROCESSING :
a.Dimensionality reduction:
Dimensionality reduction in machine learning is the process of reducing the number of input variables
(features) in a dataset while preserving as much of the relevant information as possible. This helps
improve model performance, reduce overfitting, and decrease computational cost.
High-dimensional data sets need a high amount of computational space and time. At the same time, not
all features are useful – they degrade the performance of machine learning algorithms. Most of the
machine learning algorithms perform better if the dimensionality of dataset, i.e. the number of features in
the data set, is reduced. Dimensionality reduction helps in reducing irrelevance and redundancy in
features. Also, it is easier to understand a model if the number of features involved in the learning activity
is less.
Common techniques include:
Principal Component Analysis (PCA): A linear method that transforms the data into a smaller set of
orthogonal components (principal components), capturing the most variance in the data.
Singular Value Decomposition (SVD): A matrix factorization method that decomposes a matrix into
singular values, used in tasks like topic modeling and recommendation systems.
It may seem that a feature subset may lead to loss of useful information as certain features are going to
be excluded from the final set of features used for learning. However, for elimination only features which
are not relevant or redundant are selected. All irrelevant features are eliminated while selecting the final
feature subset.
There are three methods to perform feature subset selection, which can be categorized as:
1. Filter Methods
2. Wrapper Methods
3. Embedded Methods
Step 2:Learning of the Data model:
1.Selecting a Model in Machine Learning
Selecting the right model for your machine learning task is a crucial step in building an effective solution.
The model you choose depends on the nature of the data, the problem you're trying to solve, and the
computational resources available.
Choosing the right model is critical for the learning process. Different models have different strengths and
weaknesses, and their suitability depends on the type of problem you're trying to solve.
Types of Machine Learning Models:
Linear Models
Decision Trees
Support Vector Machines (SVM)
K-Nearest Neighbors (KNN)
1.classification: The Classification algorithm is a Supervised Learning technique that is used to identify
the category of new observations on the basis of training data. In Classification, a program learns from the
given dataset or observations and then classifies new observation into a number of classes or groups.
Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes can be called as targets/labels
or categories.
The main goal of the Classification algorithm is to identify the category of a given dataset, and these
algorithms are mainly used to predict the output for the categorical data.
Classification algorithms:
1. decision tree algorithm
2.random forest algorithm
3.Support vector machine algorithm
Classification of class A and Class B
2.Regression :
It is a supervised machine learning technique, used to predict the value of the dependent variable for
new, unseen data. It models the relationship between the input features and the target variable, allowing
for the estimation or prediction of numerical values.
Regression analysis problem works with if output variable is a real or continuous value, such as “salary”
or “weight”. Many different models can be used, the simplest is the linear regression. It tries to fit data
with the best hyper-plane which goes through the points.
Regression algorithms:
1.simple linear regression
2.multiple linear regression
Clustering:
The task of grouping data points based on their similarity with each other is called Clustering or Cluster
Analysis. This method is defined under the branch of Unsupervised Learning, which aims at gaining
insights from unlabelled data points, that is, unlike supervised learning we don’t have a target variable.
Clustering aims at forming groups of homogeneous data points from a heterogeneous dataset. It
evaluates the similarity based on a metric like Euclidean distance, Cosine similarity, Manhattan distance,
etc. and then group the points with highest similarity score together.
Clustering algorithms:
1.k-means algorithm
2. Hierarchical Clustering
Model parameter tuning is the process of adjusting the model fitting options. For example, in the popular
classification model k-Nearest Neighbour (kNN), using different values of ‘k’ or the number of nearest
neighbours to be considered, the model can be tuned.
The approach of combining different models with diverse strengths is known as ensemble. Ensemble
methods combine weaker learners to create stronger ones.
One of the earliest and most popular ensemble models is bootstrap aggregating or bagging. Bagging uses
bootstrapping to generate multiple training data sets. These training data sets are used to generate a set of
models using the same learning algorithm.
Just like bagging, boosting is another key ensemble-based technique. In boosting, weaker learning models
are trained on resampled data and the outcomes are combined using a weighted voting approach based
on the performance of different models.