0% found this document useful (0 votes)
13 views

machine learning unit 2

Uploaded by

Nalini Bangaram
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

machine learning unit 2

Uploaded by

Nalini Bangaram
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Process of Machine Learning

Data set:
A dataset is a collection of related information or records.The information may be on some entity or some subject
area. Each row of a data set is called a record. Each data set also has multiple attributes, each of which gives
information on a specific characteristic. For example, in the dataset on students, there are four attributes namely Roll
Number, Name, Gender, and Age, each of which understandably is a specific characteristic about the student entity.
Attributes can also be termed as feature, variable, dimension or field.

Types of DATA:
Data can broadly be divided into following two types:
1. Qualitative data
2. Quantitative data

1.Qualitative data: Qualitative data provides information about the quality of an object or information which cannot
be measured. For example, if we consider the quality of performance of students in terms of ‘Good’, ‘Average’, and
‘Poor’. Also, name or roll number of students are information that cannot be measured using some scale of
measurement. Qualitative data is also called categorical data.
Qualitative data can be further subdivided into two types as follows:
 Nominal data: has named value
 Ordinal data:has named value which can be naturally ordered

 Nominal data is one which has no numeric value, but a named value. It is used for assigning named
values to attributes. Nominal values cannot be quantified. Examples of nominal data are

 Bloodgroup: A,B,O,AB,etc.
 Nationality: Indian, American, British,etc.
 Gender: Male, Female, Other

It is obvious, mathematical operations such as addition, subtraction, multiplication, etc. cannot be


performed on nominal data. For that reason, statistical functions such as mean, variance, etc.can also not
be applied on nominal data.
However, a basic count is possible. So mode, i.e. most frequently occurring value, can be identified for
nominal data.

 Ordinal data, in addition to possessing the properties of nominal data, can also be naturally ordered. This
means ordinal data also assigns named values to attributes but unlike nominal data,they can be arranged
in a sequence of increasing or decreasing value so that we can say whether a value is better than or
greater than another value. Examples of ordinal data are
 Customer satisfaction:‘VeryHappy’,‘Happy’,‘Unhappy’,etc.
 Grades: A,B,C,etc.
 Hardness of Metal:‘VeryHard’,‘Hard’,‘Soft’,etc.

Like nominal data, basic counting is possible for ordinal data. Hence, the mode can be identified. Since
ordering is possible incase of ordinal data, median can be identified in addition. Mean can still not be
calculated.

2.Quantitative data: Quantitative data relates to information about the quantity of an object – hence it can be
measured. There are two types of quantitative data:
 Intervel data
 Ratio data
 Interval data: numeric data for which the exact difference between values is known. An ideal
example of interval data is Celsius temperature. For example, the difference between 12°C and 18°C
degrees is measurable and is 6°C. However, such data do not have something called a ‘true zero’
value. For example, there is nothing called ‘0 temperature’ or ‘no temperature’. Hence, only addition
and subtraction applies for interval data.
For interval data, mathematical operations such as addition and subtraction are possible. For that
reason, for interval data, the central tendency can be measured by mean, median, or mode. Standard
deviation can also be calculated.

 Ratio data: numeric data for which exact value can be measured and absolute zero is available.
Examples of ratio data include height, weight, age, salary, etc.
For Ratio data, mathematical operations such as addition and subtraction are possible. For that
reason, for interval data, the central tendency can be measured by mean, median, or mode. Standard
deviation can also be calculated.

Note: Measures of central tendency help to understand the central point of a set of data. Standard
measures of central tendency of data are mean, median, and mode.

Structure of data:
By now, we understand that in machine learning, we have two basic data types – numeric and categorical.
So, We need to understand that in a data set, which of the attributes are numeric and which are categorical
in nature.This is because, the approach of exploring numerical data is different than the approach of
exploring categorical data.

Exploring numerical data:


There are two most effective mathematical plots to explore numerical data–boxplot and histogram.
1.Box Plot:
A boxplot (also known as a box-and-whisker plot) is a graphical representation of a data distribution.

It provides a summary of the data through its five-number summary:


1. Minimum: The smallest data point in the dataset (excluding outliers).
2. First Quartile (Q1): The median of the lower half of the data (25th percentile).
3. Median (Q2): The middle value of the dataset (50th percentile).
4. Third Quartile (Q3): The median of the upper half of the data (75th percentile).
5. Maximum: The largest data point (excluding outliers).

Key Features of a Boxplot:


 Box: The central box is drawn between the first quartile (Q1) and third quartile (Q3), showing the
interquartile range (IQR).
 Whiskers: Lines extending from the box to the smallest and largest data points within 1.5 times the
IQR from Q1 and Q3.
 Outliers: Data points outside the whiskers (more than 1.5 times the IQR from Q1 or Q3).

A boxplot is useful for:


 Visualizing the spread of the data.
 Identifying the median, quartiles, and potential outliers.
 Comparing distributions across different categories or groups.
 Boxplots for above table:
 Boxplots for above table:
Boxplots for above table

2.Histogram:
A histogram is a type of bar chart that represents the distribution of a dataset. It displays the frequency of
data points within specific intervals, called bins. Each bar in a histogram represents the count (or
frequency) of data points that fall within a certain range.
Key Features of a Histogram:
1. Bins (Intervals): The data is divided into intervals, also called bins. The bins represent a range of
values, and the width of each bin corresponds to the interval size.
2. Bars: Each bar's height represents the number of data points that fall within the corresponding bin's
range. Taller bars indicate higher frequencies, and shorter bars indicate lower frequencies.
3. X-Axis: Represents the values of the data, divided into bins.
4. Y-Axis: Represents the frequency or count of data points in each bin.

How Histograms Are Used:


 Visualizing Distribution: Histograms provide insight into the distribution of the data (e.g., whether
it's normally distributed, skewed, or has multiple peaks).
 Identifying Patterns: They can highlight patterns, trends, and potential outliers.
 Comparing Data: Multiple histograms can be compared to visualize the distribution of different
datasets.
Example of Data Representation in a Histogram:
 Data: Test scores from 0 to 100.
 Bins: You could create bins like 0-10, 10-20, 20-30, etc.
 Bars: Each bar would show how many test scores fall within each bin range.

Exploring Categorical data: Most popularly we use scatter plot technic for Categorical data

Scatter plot:
A scatter plot is a type of data visualization that displays the relationship between two continuous
variables. Each point on the plot represents a pair of values from the dataset, with the horizontal axis (x-
axis) representing one variable and the vertical axis (y-axis) representing the other.
Key Features of a Scatter Plot:

1. Data Points: Each point represents an observation in the dataset, plotted based on its values for
the two variables.
2. Axes: The x-axis and y-axis represent the two variables you're comparing.
o X-Axis: One variable, typically independent (e.g., time, age, etc.).
o Y-Axis: The other variable, typically dependent (e.g., height, sales, etc.).
3. Trend or Correlation: A scatter plot can help identify the correlation between the two variables:
o Positive correlation: Points tend to rise from left to right (upward slope).
o Negative correlation: Points tend to fall from left to right (downward slope).
o No correlation: Points are scattered randomly with no discernible pattern.

Uses of a Scatter Plot:


 Identifying Relationships: It helps in visually assessing whether two variables have a linear or
non-linear relationship.
 Detecting Outliers: Outliers appear as points that deviate significantly from the general trend.
 Correlations: It is especially useful for identifying whether a positive, negative, or no correlation
exists between the variables.

Data Quality and Remediation:

Data quality refers to the condition or fitness of data for its intended use. High-quality data is accurate,
complete, consistent, timely, and relevant, while poor-quality data can lead to errors, misleading
conclusions, and poor decision-making.

Success of machine learning depends largely on the quality of data. A data which has the right quality
helps to achieve better prediction accuracy, in case of supervised learning. However, it is not realistic to
expect that the data will be flawless.

Data remediation involves the process of identifying, correcting, and improving data quality issues within a
dataset. It ensures that the data is suitable for analysis, reporting, and decision-making.
Common Data Quality Issues and Remediations:

1. Missing Data: When values for certain fields are missing (e.g., missing customer email addresses).
o Remediation: You can either fill in missing data (imputation), remove incomplete records, or
flag them as "unknown."
2. Duplicate Records: Identical or very similar records appearing multiple times in the dataset.
o Remediation: Use deduplication techniques to identify and merge duplicates or remove
redundant records.
3. Inconsistent Formatting: Data represented in multiple formats, such as date formats
(MM/DD/YYYY vs. DD/MM/YYYY) or inconsistent naming conventions.
o Remediation: Standardize data formats using automated rules, ensuring consistency across
records.
4. Outliers and Invalid Data: Extreme values or data entries that fall outside the expected range or
are clearly erroneous (e.g., a negative value for age).
o Remediation: Identify and investigate outliers to determine whether they are errors. You
may replace, remove, or correct such data points.
5. Inaccurate Data: Data entries that are incorrect or not up-to-date.
o Remediation: Validate data accuracy by cross-referencing with trusted sources or through
manual review, and update as necessary.

Data modeling:
To increase the level of accuracy of machine, human participation should be added to the machine learning
process. For this we follow mainly 4 steps:
Step 1-DATA PRE-PROCESSING :
a.Dimensionality reduction:
Dimensionality reduction in machine learning is the process of reducing the number of input variables
(features) in a dataset while preserving as much of the relevant information as possible. This helps
improve model performance, reduce overfitting, and decrease computational cost.

High-dimensional data sets need a high amount of computational space and time. At the same time, not
all features are useful – they degrade the performance of machine learning algorithms. Most of the
machine learning algorithms perform better if the dimensionality of dataset, i.e. the number of features in
the data set, is reduced. Dimensionality reduction helps in reducing irrelevance and redundancy in
features. Also, it is easier to understand a model if the number of features involved in the learning activity
is less.
Common techniques include:
 Principal Component Analysis (PCA): A linear method that transforms the data into a smaller set of
orthogonal components (principal components), capturing the most variance in the data.
 Singular Value Decomposition (SVD): A matrix factorization method that decomposes a matrix into
singular values, used in tasks like topic modeling and recommendation systems.

b.Feature Subset Selection :


Feature subset selection is a technique in machine learning used to select a subset of relevant
features (or variables) from the original set of features in a dataset. The goal is to improve model
performance by reducing overfitting, decreasing computational complexity, and enhancing model
interpretability, all while retaining the most important information.

It may seem that a feature subset may lead to loss of useful information as certain features are going to
be excluded from the final set of features used for learning. However, for elimination only features which
are not relevant or redundant are selected. All irrelevant features are eliminated while selecting the final
feature subset.

There are three methods to perform feature subset selection, which can be categorized as:
1. Filter Methods
2. Wrapper Methods
3. Embedded Methods
Step 2:Learning of the Data model:
1.Selecting a Model in Machine Learning
Selecting the right model for your machine learning task is a crucial step in building an effective solution.
The model you choose depends on the nature of the data, the problem you're trying to solve, and the
computational resources available.
Choosing the right model is critical for the learning process. Different models have different strengths and
weaknesses, and their suitability depends on the type of problem you're trying to solve.
Types of Machine Learning Models:
 Linear Models
 Decision Trees
 Support Vector Machines (SVM)
 K-Nearest Neighbors (KNN)

2.Training the Model


Once you've selected the model, the next step is to train it on the training data. This involves fitting the
model to the data by adjusting its internal parameters (e.g., coefficients, weights) based on the input-output
relationships in the training set.
Training a machine learning model is the process of teaching the model to make predictions based on data.
This involves feeding the model with a training dataset, allowing it to learn the patterns and relationships in
the data, and then refining its internal parameters (e.g., weights, coefficients) to minimize error and improve
performance.
Steps:

 Initialize the Model: Create an instance of the model class.


 Train the Model: Use the fit() function to train the model on the training dataset.

3.Model Representation And Interpretability:


Model Representation:
A machine learning model cannot understand the given data directly. We have to create a representation of
the data to provide the model. The Model representation refers to how the machine learning model
captures, encodes, and transforms data into a structure that can be used to make predictions or
classifications.
Interpretability:
Interpretability refers to the extent to which a human can understand the reasons behind a model's
predictions. It's the ability to explain, in human terms, why a model made a certain decision or prediction.
This is especially important for complex models (like deep learning) and in high-stakes applications (e.g.,
healthcare, finance, criminal justice).

Step 3:Analysing performance Evaluation of model:


There are mainly 3 ways of analyzing methods to evaluate a data model
1.classification
2.regression
3.clustering

1.classification: The Classification algorithm is a Supervised Learning technique that is used to identify
the category of new observations on the basis of training data. In Classification, a program learns from the
given dataset or observations and then classifies new observation into a number of classes or groups.
Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes can be called as targets/labels
or categories.

The main goal of the Classification algorithm is to identify the category of a given dataset, and these
algorithms are mainly used to predict the output for the categorical data.
Classification algorithms:
1. decision tree algorithm
2.random forest algorithm
3.Support vector machine algorithm
Classification of class A and Class B

2.Regression :
It is a supervised machine learning technique, used to predict the value of the dependent variable for
new, unseen data. It models the relationship between the input features and the target variable, allowing
for the estimation or prediction of numerical values.

Regression analysis problem works with if output variable is a real or continuous value, such as “salary”
or “weight”. Many different models can be used, the simplest is the linear regression. It tries to fit data
with the best hyper-plane which goes through the points.

Regression algorithms:
1.simple linear regression
2.multiple linear regression

Clustering:
The task of grouping data points based on their similarity with each other is called Clustering or Cluster
Analysis. This method is defined under the branch of Unsupervised Learning, which aims at gaining
insights from unlabelled data points, that is, unlike supervised learning we don’t have a target variable.

Clustering aims at forming groups of homogeneous data points from a heterogeneous dataset. It
evaluates the similarity based on a metric like Euclidean distance, Cosine similarity, Manhattan distance,
etc. and then group the points with highest similarity score together.

Clustering algorithms:
1.k-means algorithm
2. Hierarchical Clustering

Step 4:The performance improvement of a model:


Improving the performance of a machine learning model is a critical task, whether you're working on a
classification, regression, or any other type of machine learning problem. Model performance can be
enhanced through a variety of techniques, including data preprocessing, feature engineering, algorithm
tuning, and evaluation strategies.

Model parameter tuning is the process of adjusting the model fitting options. For example, in the popular
classification model k-Nearest Neighbour (kNN), using different values of ‘k’ or the number of nearest
neighbours to be considered, the model can be tuned.
The approach of combining different models with diverse strengths is known as ensemble. Ensemble
methods combine weaker learners to create stronger ones.
One of the earliest and most popular ensemble models is bootstrap aggregating or bagging. Bagging uses
bootstrapping to generate multiple training data sets. These training data sets are used to generate a set of
models using the same learning algorithm.

Just like bagging, boosting is another key ensemble-based technique. In boosting, weaker learning models
are trained on resampled data and the outcomes are combined using a weighted voting approach based
on the performance of different models.

You might also like