0% found this document useful (0 votes)
23 views17 pages

Machine Learning

Unit 5 data science

Uploaded by

bonamkotaiah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views17 pages

Machine Learning

Unit 5 data science

Uploaded by

bonamkotaiah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

MACHINE LEARNING

MACHINE LEARNING

Machine Learning is a branch of artificial intelligence.


Capability to learn without being explicitly programmed.
It provides techniques to extract data and then appends various methods to learn from the
collected data and then with the help of some well-defined algorithms to be able to predict
future trends from the data.

Ex: Google search, Amazon, Netflix

Arthur Samuel first used the term "machine learning" in 1959.

Features of Machine Learning:


Machine learning uses data to detect various patterns in a given dataset.
It can learn from past data and improve automatically.
It is a data-driven technology.
Machine learning is much similar to data mining as it also deals with the huge amount of
the data.
Categories of Machine Learning
At a broad level, machine learning can be classified as:
1) Supervised Learning
Supervised learning is a type of machine learning in which the algorithm is trained
on the labeled dataset.

In supervised learning, the algorithm is provided with input features and


corresponding output labels, and it learns to generalize from this data to make
predictions on new, unseen data.

There are two main types of supervised learning:

 Regression:

In regression task the algorithm learns to predict continuous values based on input
features.

Regression algorithms in machine learning are:

Linear Regression, Polynomial Regression, Ridge Regression, Decision Tree


Regression, Random Forest Regression, Support Vector Regression, etc

 Classification:

In classification task the algorithm learns to assign input data to a specific category or
class based on input features.

The output labels in classification are discrete values.

Classification algorithms can be binary, where the output is one of two possible
classes, or multiclass, where the output can be one of several classes.

The different Classification algorithms in machine learning are: Logistic Regression,


Naive Bayes, Decision Tree, Support Vector Machine (SVM), K-Nearest Neighbors
(KNN), etc

2. Unsupervised Machine Learning

Unsupervised learning is a type of machine learning where the algorithm learns


to recognize patterns in data without being explicitly trained using labeled examples.
The goal of unsupervised learning is to discover the underlying structure or distribution
in the data.

There are two main types of unsupervised learning:

 Clustering:

Clustering algorithms group similar data points together based on their


characteristics. Some popular clustering algorithms include K-means, Hierarchical
clustering.

 Dimensionality Reduction:

Dimensionality reduction algorithms reduce the number of input variables in a dataset


while preserving as much of the original information as possible.

This is useful for reducing the complexity of a dataset and making it easier to
visualize and analyze.

Some popular dimensionality reduction algorithms include Principal Component


Analysis (PCA), t-SNE, and Autoencoders.

Ex:

Applying a classification model to new data.

---
INTRODUCING SCIKIT-LEARN

Scikit-learn (Sklearn) is the most useful and robust library for machine learning in
Python. It provides a selection of efficient tools for machine learning and statistical modeling
including classification, regression, clustering and dimensionality reduction via a consistence
interface in Python.

Data Representation in Scikit-Learn

Data as table

The best way to represent data in Scikit-learn is in the form of tables. A table represents
a 2-D grid of data where rows represent the individual elements of the dataset and the columns
represents the quantities related to those individual elements.

Ex:

import seaborn as sns


iris = sns.load_dataset('iris')
iris.head()

In general, we will refer to the rows of the matrix as samples, the number of rows as
n_samples, columns of the matrix as features, and the number of columns as n_features.

Data as Feature Matrix

Features matrix may be defined as the table layout where information can be thought
of as a 2-D matrix. It is stored in a variable named X and assumed to be two dimensional with
shape [n_samples, n_features]. Mostly, it is contained in a NumPy array or a Pandas
DataFrame.

The samples (rows) always refer to the individual objects and the features (columns)
always refer to the distinct observations that describe each sample in a quantitative manner.
Data as Target array

Along with Features matrix, denoted by X, we also have target array. It is also called
label. It is denoted by y. The label or target array is usually one-dimensional having length
n_samples. Target array may have both the values, continuous numerical values and discrete
values.

---

SCIKIT-LEARN’S ESTIMATOR API

It is one of the main APIs implemented by Scikit-learn.

It provides a consistent interface for a wide range of ML applications.

That’s why all machine learning algorithms in Scikit-Learn are implemented via Estimator
API.
The Scikit-Learn API is designed with the following guiding principles

Consistency

All objects share a common interface drawn from a limited set of methods, with

consistent documentation.

Inspection

All specified parameter values are exposed as public attributes.

Limited object hierarchy

Only algorithms are represented by Python classes; datasets are represented in

standard formats (NumPy arrays, Pandas DataFrames, SciPy sparse matrices) and

parameter names use standard Python strings.

Composition

As we know that, ML algorithms can be expressed as the sequence of many


fundamental algorithms. Scikit-learn makes use of these fundamental algorithms
whenever needed.

Sensible defaults

When models require user-specified parameters, the library defines an appropriate

default value.

Steps in using Estimator API

Step 1: Choose a class of model

In this first step, we need to choose a class of model. It can be done by importing the
appropriate Estimator class from Scikit-learn.

Step 2: Choose model hyperparameters

In this step, we need to choose class model hyperparameters. It can be done by instantiating
the class with desired values.
Step 3: Arranging the data

Next, we need to arrange the data into features matrix (X) and target vector(y).

Step 4: Model Fitting

Now, we need to fit the model to your data. It can be done by calling fit() method of the
model instance.

Step 5: Applying the model

After fitting the model, we can apply it to new data. For supervised learning, use predict()
method to predict the labels for unknown data. While for unsupervised learning, use predict()
or transform() to infer properties of the data.

Ex: Supervised learning example: Simple linear regression

---

FEATURE ENGINEERING
Feature engineering is the process of transforming raw data into features that are
suitable for machine learning models. In other words, it is the process of selecting, extracting,
and transforming the most relevant features from the available data to build more accurate and
efficient machine learning models.

The success of machine learning models heavily depends on the quality of the features
used to train them. Feature engineering involves a set of techniques that enable us to create
new features by combining or transforming the existing ones.
Categorical Features
It transforms each categorical attribute into a numeric representation. Transforming
categorical data into numeric data is often called “categorical-column encoding”.

One-hot encoding is the simplest and most basic categorical-column encoding method.
The idea is to have a unique binary number of multiple digits for each category. Hence, the
number of digits is the number of categories. The binary number has one digit as 1 and the rest
zeros, hence the name ‘one-hot.’

Text Features
Another common need in feature engineering is to convert text to a set of representative
numerical values. One of the simplest methods of encoding data is by word counts: you take
each snippet of text, count the occurrences of each word within it, and put the results in a table.

Ex:
sample = ['problem of evil', 'evil queen', 'horizon problem']

Image Features
Another common need is to suitably encode images for machine learning analysis.
The simplest approach is to use the pixel values.
Derived Features
The goal of creating derived features is to improve the performance of machine learning
models by providing additional information or reducing noise in the data. Derived features are
useful in many machine-learning applications, including image recognition, natural language
processing, and financial analysis.

---

NAIVE BAYES CLASSIFICATION

 Naive Bayes algorithm is a supervised learning algorithm, which is based on Bayes


theorem and used for solving classification problems.
 It is mainly used in text classification that includes a high-dimensional training dataset.
 It is one of the simple and most effective classification algorithms which helps in
building the fast machine learning models that can make quick predictions.
 It predicts on the basis of the probability of an object.
 Some popular examples of Naive Bayes Algorithm are spam filtration, classifying
articles.

The Naive Bayes algorithm is comprised of two words Naive and Bayes:

Naive:

It is called Naive because it assumes that the occurrence of a certain feature is


independent of the occurrence of other features.

Ex: Such as if the fruit is identified on the bases of color, shape, and taste, then red,
spherical, and sweet fruit is recognized as an apple.

Bayes:

It is called Bayes because it depends on the principle of Bayes' Theorem.

Bayes' Theorem:

Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine
the probability of a hypothesis with prior knowledge. It depends on the conditional probability.

The formula for Bayes' theorem is given as:


Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.

P(A) is Prior Probability: Probability of hypothesis before observing the evidence.

P(B) is Marginal Probability: Probability of Evidence.

Types of Naïve Bayes Model:

Gaussian:

The Gaussian model assumes that features follow a normal distribution. This means if
predictors take continuous values instead of discrete, then the model assumes that these values
are sampled from the Gaussian distribution.

Multinomial:

The Multinomial Naive Bayes classifier is used when the data is multinomial
distributed. It is primarily used for document classification problems, it means a particular
document belongs to which category such as Sports, Politics, education, etc.

Steps to implement:

Data Pre-processing step

Fitting Naive Bayes to the Training set

Predicting the test result

Test accuracy of the result(Creation of Confusion matrix)

Visualizing the test set result.

---
LINEAR REGRESSION

Linear regression is one of the easiest and most popular Machine Learning algorithms.

Linear regression algorithm shows a linear relationship between a dependent and one or more
independent variables.

Types of Linear Regression

Simple Linear Regression

This is the simplest form of linear regression, and it involves only one independent
variable and one dependent variable. The equation for simple linear regression is.

y=\beta_{0}+\beta_{1}X

where:

Y is the dependent variable


X is the independent variable
β0 is the intercept
β1 is the slope

Ex:
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
rng = np.random.RandomState(1)
x = 10 * rng.rand(50)
y = 2 * x - 5 + rng.randn(50)
from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept=True)
model.fit(x[:, np.newaxis], y)
xfit = np.linspace(0, 10, 1000)
yfit = model.predict(xfit[:, np.newaxis])
plt.scatter(x, y)
plt.plot(xfit, yfit);
Multiple Linear Regression

This involves more than one independent variable and one dependent variable. The
equation for multiple linear regression is:

y=\beta_{0}+\beta_{1}X+\beta_{2}X+.........\beta_{n}X

where:

Y is the dependent variable

X1, X2, …, Xp are the independent variables

β0 is the intercept

β1, β2, …, βn are the slopes

Best Fit Line

Our primary objective while using linear regression is to locate the best-fit line, which
implies that the error between the predicted and actual values should be kept to a minimum.
The best Fit Line equation provides a straight line that represents the relationship between the
dependent and independent variables. The slope of the line indicates how much the dependent
variable changes for a unit change in the independent variable(s).

---
DECISION TREES AND RANDOM FORESTS

Decision Trees Classification

Decision Tree is a Supervised learning technique. It is mostly preferred for solving


classification problems. It is a tree-structured classifier, where internal nodes represent the
features of a dataset, branches represent the decision rules and each leaf node represents the
outcome.

A decision tree simply asks a question, and based on the answer (Yes/No), it further
split the tree into subtrees.

In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
Decision nodes are used to make any decision and have multiple branches, whereas Leaf nodes
are the output of those decisions and do not contain any further branches.

Implementation of Decision Tree

 Data Pre-processing step


 Fitting a Decision-Tree algorithm to the Training set
 Predicting the test result
 Test accuracy of the result.
 Visualizing the test set result.
Random Forest

Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in ML.

Random Forest is a classifier that contains a number of decision trees on various


subsets of the given dataset and takes the average to improve the predictive accuracy of that
dataset. The greater number of trees in the forest leads to higher accuracy.

Implementation of Random Forest Algorithm

 Data Pre-processing step


 Fitting the Random forest algorithm to the Training set
 Predicting the test result
 Test accuracy of the result (Creation of Confusion matrix)
 Visualizing the test set result.

---

PRINCIPAL COMPONENT ANALYSIS

Principal component analysis is a fast and flexible unsupervised method for


dimensionality reduction in data. It is a statistical process that converts the observations of
correlated features into a set of linearly uncorrelated features. These new transformed
features are called the Principal Components. The number of these PCs are either equal to or
less than the original features present in the dataset.

Some properties of these principal components are given below:

 The principal component must be the linear combination of the original features.
 These components are orthogonal, i.e., the correlation between a pair of variables is zero.
 The importance of each component decreases when going to 1 to n, it means the 1 PC
has the most importance, and n PC will have the least importance.

Steps for PCA algorithm

 Getting the dataset


 Representing data into a structure
 Standardizing the data
 Calculating the Covariance of Z
 Calculating the Eigen Values and Eigen Vectors
 Sorting the Eigen Vectors
 Calculating the new features Or Principal Components
 Remove less or unimportant features from the new dataset.

PCA as Noise Filtering

When utilizing real-life data several factors can impact the data. One significant
element is noise. Data collection often presents opportunities for human error and the potential
for unreliable data collection tools leading to inaccuracies commonly referred to as noise. This
noise can present challenges in machine learning, as algorithms can misinterpret and generalize
from this noise.

If a dataset has a high volume of noise, it can severely disrupt the whole data analysis.
Data scientists, often measure noise using a signal to noise ratio. Therefore, data scientists
must address and manage noise in their data science algorithms.

PCA aims to eliminate damaged data from a signal or image utilizing preservative noise
while keeping the essential features. It's a geometric and statistical technique that lowers the
input signal data dimensionality by projecting it along different axes. In simple terms, you can
imagine projecting a point in the XY plane along the X-axis and subsequently removing the
noisy Y-axis. This process is known as "dimensionality reduction."

---

K-MEANS CLUSTERING

K-Means Clustering is an Unsupervised Learning algorithm, which groups the


unlabeled dataset into different clusters. Here K defines the number of pre-defined
clusters that need to be created in the process.

The algorithm takes the unlabeled dataset as input, divides the dataset into k-number
of clusters, and repeats the process until it does not find the best clusters. The value of k
should be predetermined in this algorithm.

The k-means clustering algorithm mainly performs two tasks:

 Determines the best value for K center points or centroids by an iterative process.
 Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.

Hence each cluster has datapoints with some commonalities, and it is away from other
clusters.

The working of the K-Means algorithm is explained in the below steps:

 Step-1: Select the number K to decide the number of clusters.


 Step-2: Select random K points or centroids. (It can be other from the input dataset).
 Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.
 Step-4: Calculate the variance and place a new centroid of each cluster.
 Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.
 Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
 Step-7: The model is ready.

---

You might also like