Machine Learning
Machine Learning
MACHINE LEARNING
Regression:
In regression task the algorithm learns to predict continuous values based on input
features.
Classification:
In classification task the algorithm learns to assign input data to a specific category or
class based on input features.
Classification algorithms can be binary, where the output is one of two possible
classes, or multiclass, where the output can be one of several classes.
Clustering:
Dimensionality Reduction:
This is useful for reducing the complexity of a dataset and making it easier to
visualize and analyze.
Ex:
---
INTRODUCING SCIKIT-LEARN
Scikit-learn (Sklearn) is the most useful and robust library for machine learning in
Python. It provides a selection of efficient tools for machine learning and statistical modeling
including classification, regression, clustering and dimensionality reduction via a consistence
interface in Python.
Data as table
The best way to represent data in Scikit-learn is in the form of tables. A table represents
a 2-D grid of data where rows represent the individual elements of the dataset and the columns
represents the quantities related to those individual elements.
Ex:
In general, we will refer to the rows of the matrix as samples, the number of rows as
n_samples, columns of the matrix as features, and the number of columns as n_features.
Features matrix may be defined as the table layout where information can be thought
of as a 2-D matrix. It is stored in a variable named X and assumed to be two dimensional with
shape [n_samples, n_features]. Mostly, it is contained in a NumPy array or a Pandas
DataFrame.
The samples (rows) always refer to the individual objects and the features (columns)
always refer to the distinct observations that describe each sample in a quantitative manner.
Data as Target array
Along with Features matrix, denoted by X, we also have target array. It is also called
label. It is denoted by y. The label or target array is usually one-dimensional having length
n_samples. Target array may have both the values, continuous numerical values and discrete
values.
---
That’s why all machine learning algorithms in Scikit-Learn are implemented via Estimator
API.
The Scikit-Learn API is designed with the following guiding principles
Consistency
All objects share a common interface drawn from a limited set of methods, with
consistent documentation.
Inspection
standard formats (NumPy arrays, Pandas DataFrames, SciPy sparse matrices) and
Composition
Sensible defaults
default value.
In this first step, we need to choose a class of model. It can be done by importing the
appropriate Estimator class from Scikit-learn.
In this step, we need to choose class model hyperparameters. It can be done by instantiating
the class with desired values.
Step 3: Arranging the data
Next, we need to arrange the data into features matrix (X) and target vector(y).
Now, we need to fit the model to your data. It can be done by calling fit() method of the
model instance.
After fitting the model, we can apply it to new data. For supervised learning, use predict()
method to predict the labels for unknown data. While for unsupervised learning, use predict()
or transform() to infer properties of the data.
---
FEATURE ENGINEERING
Feature engineering is the process of transforming raw data into features that are
suitable for machine learning models. In other words, it is the process of selecting, extracting,
and transforming the most relevant features from the available data to build more accurate and
efficient machine learning models.
The success of machine learning models heavily depends on the quality of the features
used to train them. Feature engineering involves a set of techniques that enable us to create
new features by combining or transforming the existing ones.
Categorical Features
It transforms each categorical attribute into a numeric representation. Transforming
categorical data into numeric data is often called “categorical-column encoding”.
One-hot encoding is the simplest and most basic categorical-column encoding method.
The idea is to have a unique binary number of multiple digits for each category. Hence, the
number of digits is the number of categories. The binary number has one digit as 1 and the rest
zeros, hence the name ‘one-hot.’
Text Features
Another common need in feature engineering is to convert text to a set of representative
numerical values. One of the simplest methods of encoding data is by word counts: you take
each snippet of text, count the occurrences of each word within it, and put the results in a table.
Ex:
sample = ['problem of evil', 'evil queen', 'horizon problem']
Image Features
Another common need is to suitably encode images for machine learning analysis.
The simplest approach is to use the pixel values.
Derived Features
The goal of creating derived features is to improve the performance of machine learning
models by providing additional information or reducing noise in the data. Derived features are
useful in many machine-learning applications, including image recognition, natural language
processing, and financial analysis.
---
The Naive Bayes algorithm is comprised of two words Naive and Bayes:
Naive:
Ex: Such as if the fruit is identified on the bases of color, shape, and taste, then red,
spherical, and sweet fruit is recognized as an apple.
Bayes:
Bayes' Theorem:
Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine
the probability of a hypothesis with prior knowledge. It depends on the conditional probability.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
Gaussian:
The Gaussian model assumes that features follow a normal distribution. This means if
predictors take continuous values instead of discrete, then the model assumes that these values
are sampled from the Gaussian distribution.
Multinomial:
The Multinomial Naive Bayes classifier is used when the data is multinomial
distributed. It is primarily used for document classification problems, it means a particular
document belongs to which category such as Sports, Politics, education, etc.
Steps to implement:
---
LINEAR REGRESSION
Linear regression is one of the easiest and most popular Machine Learning algorithms.
Linear regression algorithm shows a linear relationship between a dependent and one or more
independent variables.
This is the simplest form of linear regression, and it involves only one independent
variable and one dependent variable. The equation for simple linear regression is.
y=\beta_{0}+\beta_{1}X
where:
Ex:
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
rng = np.random.RandomState(1)
x = 10 * rng.rand(50)
y = 2 * x - 5 + rng.randn(50)
from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept=True)
model.fit(x[:, np.newaxis], y)
xfit = np.linspace(0, 10, 1000)
yfit = model.predict(xfit[:, np.newaxis])
plt.scatter(x, y)
plt.plot(xfit, yfit);
Multiple Linear Regression
This involves more than one independent variable and one dependent variable. The
equation for multiple linear regression is:
y=\beta_{0}+\beta_{1}X+\beta_{2}X+.........\beta_{n}X
where:
β0 is the intercept
Our primary objective while using linear regression is to locate the best-fit line, which
implies that the error between the predicted and actual values should be kept to a minimum.
The best Fit Line equation provides a straight line that represents the relationship between the
dependent and independent variables. The slope of the line indicates how much the dependent
variable changes for a unit change in the independent variable(s).
---
DECISION TREES AND RANDOM FORESTS
A decision tree simply asks a question, and based on the answer (Yes/No), it further
split the tree into subtrees.
In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
Decision nodes are used to make any decision and have multiple branches, whereas Leaf nodes
are the output of those decisions and do not contain any further branches.
Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in ML.
---
The principal component must be the linear combination of the original features.
These components are orthogonal, i.e., the correlation between a pair of variables is zero.
The importance of each component decreases when going to 1 to n, it means the 1 PC
has the most importance, and n PC will have the least importance.
When utilizing real-life data several factors can impact the data. One significant
element is noise. Data collection often presents opportunities for human error and the potential
for unreliable data collection tools leading to inaccuracies commonly referred to as noise. This
noise can present challenges in machine learning, as algorithms can misinterpret and generalize
from this noise.
If a dataset has a high volume of noise, it can severely disrupt the whole data analysis.
Data scientists, often measure noise using a signal to noise ratio. Therefore, data scientists
must address and manage noise in their data science algorithms.
PCA aims to eliminate damaged data from a signal or image utilizing preservative noise
while keeping the essential features. It's a geometric and statistical technique that lowers the
input signal data dimensionality by projecting it along different axes. In simple terms, you can
imagine projecting a point in the XY plane along the X-axis and subsequently removing the
noisy Y-axis. This process is known as "dimensionality reduction."
---
K-MEANS CLUSTERING
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number
of clusters, and repeats the process until it does not find the best clusters. The value of k
should be predetermined in this algorithm.
Determines the best value for K center points or centroids by an iterative process.
Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other
clusters.
---