0% found this document useful (0 votes)
5 views

DATA SCIENCE 1(7th sem)

Data science is a multidisciplinary field that utilizes statistics and computer science to extract insights from data, involving processes like data collection, cleaning, and analysis. Google Colab is a cloud-based platform that provides powerful computing resources for machine learning, while Python is a versatile programming language known for its ease of use and extensive community support. Data preprocessing is crucial in ensuring data quality and involves steps like cleaning, transformation, and integration to prepare data for analysis.

Uploaded by

Bhumika Piplani
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

DATA SCIENCE 1(7th sem)

Data science is a multidisciplinary field that utilizes statistics and computer science to extract insights from data, involving processes like data collection, cleaning, and analysis. Google Colab is a cloud-based platform that provides powerful computing resources for machine learning, while Python is a versatile programming language known for its ease of use and extensive community support. Data preprocessing is crucial in ensuring data quality and involves steps like cleaning, transformation, and integration to prepare data for analysis.

Uploaded by

Bhumika Piplani
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 49

DATA SCIENCE(UNIT-1)

What is Data Science?


Data science is a multidisciplinary field that combines techniques from statistics and
computer science to extract valuable insights and knowledge from data. Additionally, it
involves collecting, cleaning, and analysing data to discover patterns, make predictions, and
enhance decision-making. Data scientists use a variety of tools and techniques, such as data
analysis, machine learning, data visualization, and data mining, to achieve these goals.
What is the Difference Between Data Science, ML and AI?

What is Google Colab?


Google Colaboratory is a free online cloud-based Jupyter notebook environment that allows
us to train our machine learning and deep learning models on CPUs, GPUs, and TPUs. TPUs
are much more expensive than a GPU, and you can use it for free on Colab.
Benefits of Google Colab
Here are some of the benefits of using Google Colab:
 Accessibility: Google Colab is accessible from any web browser, so you don’t need to
install any software on your computer.
 Power: Google Colab provides access to powerful computing resources, including
GPUs and TPUs. This means you can train and run complex machine-learning models
quickly and efficiently.
 Collaboration: Google Colab makes it easy to collaborate with others on projects. You
can share your notebooks with others, and they can edit and run your code in real-
time.
 Education: Google Colab Notebook is an excellent tool for learning about machine
learning and data science.
What is Python?
Python is a popular programming language. It was created by Guido van Rossum, and
released in 1991.
It is used for:
 web development (server-side),
 software development,
 mathematics,
 system scripting.
Features in Python
1. Free and Open Source
Python language is freely available at the official website and you can download it .
2. Easy to code
Python is a high-level programming language. Python is very easy to learn the language as
compared to other languages like C, C#, Javascript, Java, etc.
3. Object-Oriented Language
One of the key features of Python is Object-Oriented programming. Python supports
object-oriented language and concepts of classes, object encapsulation, etc.

4. High-Level Language
Python is a high-level language. When we write programs in Python, we do not need to
remember the system architecture, nor do we need to manage the memory.

5. Large Community Support


Python has gained popularity over the years. Our questions are constantly answered by
the enormous StackOverflow community. These websites have already provided answers
to many questions about Python, so Python users can consult them as needed.
6. Easy to Debug
Excellent information for mistake tracing. You will be able to quickly identify and correct
the majority of your program’s issues once you understand how to interpret Python’s
error traces. Simply by glancing at the code, you can determine what it is designed to
perform.
7. Python is a Portable language
Python language is also a portable language. For example, if we have Python code for
Windows and if we want to run this code on other platforms such as Linux, Unix, and Mac
then we do not need to change it, we can run this code on any platform.

8. Python is an Integrated language


Python is also an Integrated language because we can easily integrate Python with other
languages like C, C++, etc.

9. Interpreted Language:
Python is an Interpreted Language because Python code is executed line by line at a time.
like other languages C, C++, Java, etc. there is no need to compile Python code this makes
it easier to debug our code.

10. Large Standard Library


Python has a large standard library that provides a rich set of modules and functions so
you do not have to write your own code for every single thing.

11. Dynamically Typed Language


Python is a dynamically-typed language. That means the type (for example- int, double,
long, etc.) for a variable is decided at run time not in advance because of this feature we
don’t need to specify the type of variable.
What is Data Preprocessing
Real-world datasets are generally messy, raw, incomplete, inconsistent, and unusable. It can
contain manual entry errors, missing values, inconsistent schema, etc.
Data Preprocessing is the process of converting raw data into a format that is
understandable and usable. It is a crucial step in any Data Science project to carry out an
efficient and accurate analysis. It ensures that data quality is consistent before applying
any Machine Learning or Data Mining techniques.
Why is Data Preprocessing Important
Data Preprocessing is an important step in the Data Preparation stage of a Data Science
development lifecycle that will ensure reliable, robust, and consistent results. The main
objective of this step is to ensure and check the quality of data before applying any Machine
Learning or Data Mining methods.
Accuracy - Data Preprocessing will ensure that input data is accurate and reliable by
ensuring there are no manual entry errors, no duplicates, etc.
Completeness - It ensures that missing values are handled, and data is complete for further
analysis.
Consistent - Data Preprocessing ensures that input data is consistent, i.e., the same data
kept in different places should match.
Timeliness - Whether data is updated regularly and on a timely basis or not.
Trustable - Whether data is coming from trustworthy sources or not.
Interpretability - Raw data is generally unusable, and Data Preprocessing converts raw data
into an interpretable format.
Key Steps in Data Preprocessing
1. Data Cleaning
This is the first and most important step. It involves fixing or removing any problems in the
data.
 Handling Missing Data: Data might have gaps or missing values. To deal with this, we
can:
o Remove rows or columns that have missing values.
o Fill missing values with averages (mean), most frequent values (mode), or
other techniques like using predictions from similar data.
 Fixing Inconsistent Data: Sometimes data is recorded in different ways. For example,
dates may be in different formats (like 01/02/2024 or January 2, 2024). In this step,
we make sure the data is consistent.
2. Data Transformation
Here, we change the data into a better format or structure.
 Normalization: This means adjusting all the data to a common scale, like making all
numbers fall between 0 and 1. This helps to compare them easily, especially when
working with algorithms that expect data in a certain range.
 Standardization: This is another way to scale the data but instead of a specific range,
it adjusts the data so that it has a mean of 0 and a standard deviation of 1.
 Discretization: This involves converting continuous data (like height, weight, or
temperature) into categories or bins. For example, instead of having specific ages like
21, 22, 23, etc., we group them into age ranges like 20-30, 30-40, etc.
3. Data Reduction
In this step, we simplify the data to make it smaller and easier to work with, without losing
important information.
 Dimensionality Reduction: Sometimes, there are too many features (or columns) in
the data, and many of them may not be necessary. We can remove unnecessary
features while keeping only the most important ones. This makes the data smaller
and speeds up processing.
 Aggregation: We can combine smaller units of data into larger ones. For example,
instead of looking at daily sales numbers, we can combine them into monthly totals,
which are easier to analyze.
4. Data Integration
When data comes from multiple sources, it needs to be combined in a meaningful way.
 For example, data about customers may come from different departments (sales,
support, marketing) and need to be merged to create a complete view of each
customer.
 We also have to make sure that the same data from different sources is consistent
(e.g., using the same format for dates or phone numbers).
5. Data Encoding
This step is necessary when dealing with categorical data (data that has categories instead of
numbers, like color or brand names). Most machine learning algorithms can only work with
numbers, so we convert categories into numbers:
 Label Encoding: Each category is assigned a unique number. For example, "Red" can
be assigned 1, "Blue" as 2, and "Green" as 3.
 One-Hot Encoding: Instead of assigning numbers, we create separate columns for
each category with 1s and 0s. For example, for colors, we could have three columns:
"Red," "Blue," and "Green." If the color is Red, the Red column will have a 1, and the
other two will have 0.
6. Feature Selection
This step involves selecting the most important features (or columns) from the data. Not all
data is important for making predictions, so we choose only the useful ones to make the
model simpler and faster.
7. Outlier Detection and Removal
Outliers are extreme values that don’t fit well with the rest of the data. For example, if we
are analyzing income data and most values are around $40,000 to $60,000, but we find one
value of $1,000,000, it might be an outlier. We can choose to either remove or fix such
outliers to avoid them skewing our analysis.

Data Scales
Data scales refer to the different ways we can represent data values in a dataset.
Understanding the types of scales helps in choosing the right statistical methods and
machine learning algorithms. There are four common types of data scales:
1. Nominal Scale (Categorical Data)
 Description: This scale represents categories or labels with no specific order. It's used
for naming or classifying items.
 Example:
o Colors of cars: Red, Blue, Green.
o Types of fruits: Apple, Orange, Banana.
o Gender: Male, Female.
Key Points:
 No mathematical operations can be performed on these values.
 We can only count or compare equality, not order them.
2. Ordinal Scale
 Description: This scale represents categories that have a meaningful order or
ranking, but the intervals between values are not necessarily equal.
 Example:
o Rating a movie: Poor, Average, Good, Excellent.
o Education level: High School, Bachelor’s, Master’s, PhD.
o Customer satisfaction: Very dissatisfied, Dissatisfied, Neutral, Satisfied, Very
satisfied.
Key Points:
 We can rank the data, but we can’t measure the exact difference between them.
 For example, the difference between "Poor" and "Average" might not be the same as
between "Good" and "Excellent."
3. Interval Scale
 Description: This scale has ordered categories, and the intervals between values are
equal, but there is no true zero point.
 Example:
o Temperature (in Celsius or Fahrenheit): The difference between 10°C and
20°C is the same as between 20°C and 30°C, but 0°C does not mean "no
temperature."
o Time of the day (on a 12-hour clock): 2 PM is 2 hours after 12 PM, but there
is no "zero" time.
Key Points:
 Addition and subtraction are meaningful, but ratios (e.g., "twice as much") are not
because there is no true zero point.
4. Ratio Scale
 Description: This is the most informative scale. It has all the properties of the interval
scale, but it also has a true zero point, which means that we can compare both
differences and ratios.
 Example:
o Height: 0 cm means "no height," and 180 cm is twice as tall as 90 cm.
o Weight: 0 kg means "no weight," and 50 kg is half the weight of 100 kg.
o Distance: 0 km means "no distance," and 200 km is twice as long as 100 km.
Key Points:
 All mathematical operations (addition, subtraction, multiplication, division) are
meaningful.

Similarity and Disimilarity Measures:


Similarity and dissimilarity measures are mathematical methods used to compare data
points or objects. They help determine how similar or different two data points are, which is
essential for tasks like clustering, classification, and recommendation systems.
1. Similarity Measures
A similarity measure quantifies how alike two data points are. Higher values indicate more
similarity. These measures are often scaled between 0 and 1 (or a percentage), where 1
indicates identical items.
Common Similarity Measures:
 Euclidean Distance (Inverse):
o Measures the straight-line distance between two points in a multi-
dimensional space.
o Formula:

Example: The distance between (2, 3) and (6, 7) is calculated using the formula. The closer
the points, the higher their similarity.
 Cosine Similarity:
o Measures the cosine of the angle between two vectors. It’s used for
measuring text similarity and is often used in document comparison.
o Formula:

o Example: For two documents represented as vectors of word counts, cosine


similarity measures how similar they are in terms of content. If the angle
between them is small, the documents are similar.
 Jaccard Similarity (Intersection over Union):
o Measures the similarity between two sets by dividing the size of their
intersection by the size of their union.
o Formula:

o Example: For two sets A = {1, 2, 3} and B = {2, 3, 4}, the Jaccard similarity is
24=0.5\frac{2}{4} = 0.542=0.5.
 Pearson Correlation Coefficient:
o Measures the linear correlation between two variables, indicating how closely
their values move together.
o Formula:

o Example: Used to measure the similarity in trends between two data series. A
correlation of 1 indicates perfect similarity, while -1 indicates they move in
opposite directions.
2. Dissimilarity Measures
A dissimilarity measure quantifies how different two data points are. Higher values indicate
more dissimilarity. These measures are used in clustering algorithms like k-means, where we
need to separate points based on their dissimilarity.
Common Dissimilarity Measures:
 Euclidean Distance:
o Measures the straight-line distance between two points.
o Formula:

o Example: In a 2D plane, the further apart two points (2,3) and (6,7) are, the
higher their dissimilarity.
 Manhattan Distance (Taxicab or L1 distance):
o Measures the distance between two points by adding the absolute
differences along each dimension.
o Formula:

∣2−6∣+∣3−7∣=4+4=8|2 - 6| + |3 - 7| = 4 + 4 = 8∣2−6∣+∣3−7∣=4+4=8. It’s often


o Example: For two points (2,3) and (6,7), the Manhattan distance is

used when movement is restricted to grid-like paths.


 Hamming Distance:
o Measures the number of positions at which the corresponding elements of
two strings or binary vectors are different.
o Example: The Hamming distance between "10101" and "10011" is 2 because
there are two positions where the bits differ.
 Jaccard Dissimilarity:
o It is the complement of Jaccard similarity.
o Formula:
o Example: If two sets A and B have a Jaccard similarity of 0.5, their dissimilarity
will be 1−0.5=0.51 - 0.5 = 0.51−0.5=0.5.

Sampling and Quantization of data


ANS- In data science, sampling and quantization are important techniques for managing and
processing data.
Sampling
What It Is: Sampling is about choosing a smaller portion of data from a larger dataset to
analyze or work with.
Why We Do It:
 Efficiency: It’s often impractical or too time-consuming to analyze every single piece
of data, especially if you have a huge dataset.
 Insight: A well-chosen sample can give you a good idea of the whole dataset without
having to process everything.
How It Works:
 Random Sampling: Selecting data points randomly. Imagine you have a big bag of
mixed candies. You reach in and grab a handful of candies randomly. This random
sample should give you a good idea of what kinds of candies are in the bag.
 Stratified Sampling: Dividing the data into groups and sampling from each
group.Suppose the bag has different types of candies (like chocolates, gummies, and
lollipops). You make sure to pick a few from each type to get a representative sample
of the whole bag.
 Systematic Sampling: Selecting every nth data point. This method is useful when
data is ordered. If you line up the candies and pick every 10th one, you’re using
systematic sampling. This way, you get a sample that’s evenly spaced out.
Quantization
What It Is: Quantization is the process of converting continuous data (like a smooth gradient
of colors) into a set of discrete values (like a limited palette of colors).
Why We Do It:
 Simplification: Continuous data can be complex and large. Quantization makes it
simpler by reducing the number of distinct values.
 Storage and Processing: It helps in storing and processing data more efficiently by
using less space and computational power.
How It Works:
 Uniform Quantization: Divides the range of data into equal-sized intervals. Imagine
you have a range of numbers between 0 and 100, and you decide to divide this range
into 10 equal parts. Each number then gets rounded to the nearest part. So, if you
have 25, 27, and 29, they might all be rounded to 30.
 Non-Uniform Quantization: Uses intervals of varying sizes. Sometimes, you might
want to have more precision for certain ranges. For example, if you’re measuring
temperatures, you might use finer steps for lower temperatures (where changes are
more noticeable) and coarser steps for higher temperatures.

Data Transformation and Data Merging


ANS- Data Transformation
Definition: Data transformation is the process of changing data from one format or structure
into another to make it suitable for analysis or to meet specific needs.
Common Types:
1. Normalization/Standardization: Adjusting data to a common scale or format. For
instance:
o Normalization: Scaling values to a range, like 0 to 1.
o Standardization: Adjusting values to have a mean of 0 and a standard
deviation of 1.
2. Aggregation: Combining multiple data points into a summary statistic, like calculating
the average, sum, or count. For example, summing monthly sales data to get annual
sales figures.
3. Data Cleaning: Correcting or removing inaccurate, corrupted, or irrelevant data. This
might involve handling missing values, correcting typos, or removing duplicates.
4. Data Encoding: Converting categorical data into numerical format. For example,
converting “Red,” “Blue,” and “Green” into 1, 2, and 3.
5. Feature Engineering: Creating new features from existing data to improve the
performance of machine learning models. For example, combining "height" and
"weight" to create a "BMI" feature.
Example: Imagine you have a dataset with heights in centimeters and weights in kilograms.
To analyze BMI, you would transform the data by calculating BMI using the formula: BMI =
weight (kg) / (height (m))^2.
Data Merging
Definition: Data merging is the process of combining two or more datasets into a single
dataset. This helps in integrating information from different sources to create a unified view.
Common Methods:
 Joining Tables: Imagine you have two tables of data: one with customer information
and one with order details. Merging these tables will let you see which customers
made which orders.
 Inner Join: Combine only the data that matches in both tables (e.g., only show orders
for customers who exist in both tables).
 Outer Join: Combine all data, including the unmatched parts (e.g., show all
customers and their orders, including customers with no orders).
 Concatenation: Stacking datasets vertically (appending rows) or horizontally (adding
columns). Useful when combining similar datasets or adding new attributes.

Data Visualization
ANS- Definition: Data visualization is the process of creating visual representations of data to
help people understand patterns, trends, and insights more easily.
Why We Use It:
 Clarity: Makes complex data more understandable at a glance.
 Patterns: Helps to quickly identify trends and outliers.
 Communication: Makes it easier to share findings with others.
Common Types of Data Visualizations:
1. Bar Charts: Show quantities of different categories with bars. Great for comparing
different groups. For example, showing sales numbers for different products.

2. Line Graphs: Display data points connected by lines. Useful for showing changes over
time, like tracking monthly sales.
3. Pie Charts: Represent parts of a whole as slices of a pie. Good for showing how
different parts contribute to the total, like the market share of different companies.

4. Histograms: Display the distribution of numerical data by grouping it into bins. Useful
for understanding the frequency of data within certain ranges, like test scores.

5. Scatter Plots: Show data points on a grid to observe the relationship between two
variables. For example, plotting height against weight to see if there's a correlation.
6. Heat Maps: Use color to represent data values. Great for visualizing data density or
intensity, like showing the number of website visits by day and time.

Example:
Imagine you have data on how many books different people read in a month. Instead of just
listing the numbers, you create a bar chart. Each bar represents a person, and the height of
the bar shows how many books they read. This makes it easy to compare the reading habits
of different people at a glance.

Correlation
ANS- What Is Correlation?
Definition: Correlation is a measure of how two variables are related to each other. It tells us
whether changes in one variable are associated with changes in another.
Types of Correlation:
1. Positive Correlation: When one variable increases, the other variable also increases.
For example, height and weight often have a positive correlation.
2. Negative Correlation: When one variable increases, the other variable decreases. For
example, the amount of time spent watching TV and the amount of time spent
exercising might have a negative correlation.
3. No Correlation: No consistent relationship between the variables. For example, shoe
size and intelligence may have no correlation.
How Correlation Is Measured:
 Correlation Coefficient: A numerical value between -1 and 1 that indicates the
strength and direction of the relationship.
o +1: Perfect positive correlation (they move together perfectly).
o -1: Perfect negative correlation (one increases as the other decreases
perfectly).
o 0: No correlation (no predictable relationship).
Example:
Imagine you have data on students' hours of study and their exam scores. If you find a high
positive correlation between the hours of study and the exam scores, it means that students
who study more tend to score higher on the exam. This relationship helps you understand
how study time impacts performance.

PCA (Principal Component Analysis)


Definition: PCA is a statistical technique used in data science for dimensionality reduction,
which transforms a dataset into a new coordinate system, focusing on maximizing variance.
Steps Involved:
1. Standardization: Scale the dataset to have a mean of 0 and a variance of 1.
2. Covariance Matrix: Calculate the covariance matrix to identify how features vary
together.
3. Eigenvalues and Eigenvectors: Compute the eigenvalues and eigenvectors of the
covariance matrix. Eigenvectors represent directions in the new space, and
eigenvalues indicate variance.
4. Principal Components Selection: Choose the top eigenvectors (principal
components) based on their eigenvalues to reduce dimensionality.
Applications: PCA is used for data visualization, noise reduction, and as a preprocessing step
for machine learning, helping to simplify data analysis.
Benefits: It helps in reducing complexity, eliminates correlated features, and can enhance
the performance of machine learning models by focusing on the most significant data
aspects.

UNIT-2
Regression Analysis
Definition: Regression analysis is a statistical method used to examine the relationship
between one dependent variable (target) and one or more independent variables
(predictors). It helps in understanding how the dependent variable changes when any of the
independent variables vary.
Purpose of Regression Analysis:
1. Prediction: To predict the value of the dependent variable based on the values of the
independent variables.
2. Understanding Relationships: To assess the strength and nature of relationships
between variables.
3. Trend Analysis: To identify trends in data over time, useful in forecasting.
4. Effect Quantification: To quantify the effect of independent variables on the
dependent variable, helping in decision-making.
Types of Regression Analysis:
1. Linear Regression: Models the relationship between the dependent and
independent variable as a straight line. It can be simple (one predictor) or multiple
(more than one predictor).
A sloped straight line represents the linear regression model.

Syntax:
y = θx + b where,
θ – It is the model weights or parameters
b – It is known as the bias.
CODE:
from sklearn.linear_model import LinearRegression

# Create a linear regression model


model = LinearRegression()
# Fit the model to the data
model.fit(X, y)
# Predict the response for a new data point
y_pred = model.predict(X_new)
Polynomial Regression: A form of linear regression where the relationship between
variables is modeled as an nth degree polynomial. It captures non-linear relationships.

CODE:
from sklearn.linear_model import PolynomialRegression
# Create a polynomial regression model
model = PolynomialRegression(degree=2)
# Fit the model to the data
model.fit(X, y)
# Predict the response for a new data point
y_pred = model.predict(X_new)
Logistic Regression: Used when the dependent variable is categorical (e.g., binary
outcomes). It estimates the probability that a given input point belongs to a certain category.
Ridge Regression: A type of linear regression that includes a penalty term to reduce
overfitting by constraining the size of the coefficients.
Lasso Regression: Similar to ridge regression, but it can reduce some coefficients to zero,
effectively performing variable selection.
Stepwise Regression: Involves automatic selection of independent variables by adding or
removing them based on their statistical significance.
Elastic Net: Combines properties of both ridge and lasso regression, balancing between
regularization and variable selection.
Cross Validation:
What is Cross-Validation?
Cross-validation is a technique used in machine learning to check how well a model will
perform on new, unseen data. It helps ensure that the model doesn't just work well on the
data it was trained on, but also performs well when given different data. The idea is to divide
the data into several parts (called "folds"), train the model on some of these parts, and test
it on the remaining parts. This process is repeated multiple times, using a different part for
testing each time, and the results are averaged to get a more accurate measure of the
model's performance.
Why is Cross-Validation Important?
The main reason for using cross-validation is to prevent overfitting. Overfitting happens
when a model learns too much from the training data, including the noise or irrelevant
details, which makes it perform poorly on new data. Cross-validation helps by giving a more
realistic estimate of how well the model will generalize to unseen data.
Types of Cross-Validation
1. Holdout Validation:
o In this method, the dataset is divided into two parts: one part is used for
training the model, and the other part is used for testing it.
o Example: If you have 100 data points, you might use 50 for training and the
other 50 for testing.
o Advantage: Simple and quick.
o Disadvantage: You might miss important patterns in the data if you only use
half for training.
2. LOOCV (Leave-One-Out Cross-Validation):
o This method trains the model on all data except for one data point, which is
used for testing. This process is repeated for every single data point.
o Example: If you have 10 data points, the model will be trained on 9 points
and tested on 1. This is done 10 times, using a different data point for testing
each time.
o Advantage: It uses almost all data for training, so the model has more
information to learn from.
o Disadvantage: It takes a lot of time because you need to train the model as
many times as there are data points.
3. Stratified Cross-Validation:
o This method is especially useful when you have imbalanced data, meaning
that some categories are more common than others. Stratified cross-
validation ensures that each fold has the same distribution of classes as the
original dataset.
o Example: If 80% of your data is class A and 20% is class B, each fold in
stratified cross-validation will maintain this ratio.
o Advantage: Ensures that each fold is representative of the entire dataset.
o Disadvantage: More complex to implement than simple k-fold cross-
validation.
4. K-Fold Cross-Validation:
o In this method, the dataset is split into k equal-sized subsets (called folds).
The model is trained on k-1 folds and tested on the remaining fold. This is
repeated k times, each time using a different fold for testing.
o Example: If k=5 and you have 100 data points, the data is divided into 5
subsets (20 points each). In each iteration, 80 points are used for training, and
20 points are used for testing.
o Advantage: More reliable results because all the data is used for both training
and testing at some point.
o Disadvantage: Training the model multiple times can take longer.
Advantages and Disadvantages of Cross Validation
Advantages:
1. Overcoming Overfitting: Cross validation helps to prevent overfitting by providing a more
robust estimate of the model’s performance on unseen data.
2. Model Selection: Cross validation can be used to compare different models and select the
one that performs the best on average.
3. Hyperparameter tuning: Cross validation can be used to optimize the hyperparameters of
a model, such as the regularization parameter, by selecting the values that result in the best
performance on the validation set.
4. Data Efficient: Cross validation allows the use of all the available data for both training
and validation, making it a more data-efficient method compared to traditional validation
techniques.
Disadvantages:
1. Computationally Expensive: Cross validation can be computationally expensive, especially
when the number of folds is large or when the model is complex and requires a long time to
train.
2. Time-Consuming: Cross validation can be time-consuming, especially when there are
many hyperparameters to tune or when multiple models need to be compared.
3. Bias-Variance Tradeoff: The choice of the number of folds in cross validation can impact
the bias-variance tradeoff, i.e., too few folds may result in high variance, while too many
folds may result in high bias.
Q Difference between training data and testing data

Q Why do we need Training data and Testing data


1.Training Data
 Model Learning: It allows the model to learn patterns, relationships, and features
from the data. Without training data, the model cannot develop the ability to make
predictions.
 Parameter Adjustment: The model adjusts its parameters based on the training data
to minimize error and improve accuracy.
2. Testing Data
 Performance Evaluation: It provides an unbiased assessment of the model’s
performance on unseen data, helping to understand how well the model will
perform in real-world scenarios.
 Generalization Check: It helps determine if the model is overfitting (memorizing
training data rather than learning) or if it can generalize well to new, unseen data.
 Hyperparameter Tuning: Testing data can be used to fine-tune hyperparameters,
ensuring that the model is optimized for best performance.
Importance of Both
 Balanced Approach: Using both training and testing data ensures that the model is
both well-trained and able to perform accurately on new data, leading to better
overall performance in practical applications.
Q How Training and Testing Data Work?
1.Data Preparation
 Collection: Gather a dataset relevant to the problem you want to solve.
 Splitting: Divide the dataset into two parts: training data (usually 70-80%) and testing
data (usually 20-30%).
2. Training Phase
 Model Initialization: Choose a machine learning algorithm (e.g., linear regression,
decision tree).
 Training: Use the training data to teach the model. During this phase, the model
learns:
o Patterns: It identifies relationships between input features and the target
variable.
o Parameter Adjustment: The model adjusts its internal parameters to
minimize the error in its predictions on the training data.
3. Testing Phase
 Model Evaluation: After training, the model is tested using the testing data.
 Performance Metrics: The model makes predictions on the testing data, and its
performance is evaluated using metrics like accuracy, precision, recall, or mean
squared error, depending on the task.
 Generalization Check: This step helps determine if the model can generalize well to
new, unseen data rather than just memorizing the training data.
4. Iteration
 Refinement: Based on the performance on the testing data, you may need to adjust
the model, choose different algorithms, or modify the training data.
 Retraining: If changes are made, the model is retrained with the training data before
being tested again.
5.Deployment and Monitoring
 Deployment: Once the model performs well on testing data, it can be deployed for
real-world use (e.g., predictions, classifications).
 Monitoring: Automation tools often include monitoring capabilities to track model
performance over time, ensuring it remains effective with new incoming data.
Non-Linear Regression
Non-linear regression is a type of regression analysis used to model the relationship
between a dependent variable and one or more independent variables when that
relationship is not a straight line. In contrast to linear regression, which assumes a linear
relationship, non-linear regression can capture more complex patterns.
Key Features of Non-Linear Regression:
1. Modeling Flexibility: Non-linear regression can fit a wide variety of functional forms,
such as exponential, logarithmic, polynomial, or sinusoidal relationships.
2. Equation Form: The model is expressed as:
Y=f(x) + ϵ
where y is the dependent variable, f(x) is a non-linear function of the independent variable
x, and ϵ (epsilon) is the error term.
3. Common Examples:
o Polynomial Regression: Models relationships using polynomial equations
(e.g., quadratic or cubic functions).
o Exponential Regression: Models data that increases or decreases at a
consistent percentage rate.
o Logarithmic Regression: Useful for modeling relationships that grow rapidly
at first and then level off.
4. Applications: Non-linear regression is used in various fields, including biology
(growth models), economics (demand curves), and engineering (stress-strain
relationships).
5. Fitting the Model: Non-linear regression typically requires iterative methods (like the
Newton-Raphson method) to estimate parameters, as closed-form solutions are
often not available.
Latent Variables
Latent variables are variables that are not directly observed or measured but are inferred
from other variables that are observed. They represent underlying factors or constructs that
can influence observed data.
Key Features of Latent Variables:
1. Hidden Nature: Latent variables are not directly measurable. Instead, they are
estimated based on observable indicators or measurements.
2. Examples:
o Psychological Constructs: Traits like intelligence, motivation, or satisfaction,
which cannot be directly measured but can be inferred through
questionnaires or tests.
o Market Factors: Consumer preferences or brand loyalty, which influence
purchasing behavior but are not directly observable.
3. Statistical Models: Latent variables are often used in statistical modeling techniques
such as:
o Factor Analysis: Used to identify underlying relationships between measured
variables.
o Structural Equation Modeling (SEM): Combines factor analysis and regression
to assess relationships among latent and observed variables.
4. Purpose: Latent variables help simplify complex systems by summarizing underlying
patterns and relationships, making it easier to analyze and interpret data.
5. Applications: Commonly used in fields like psychology, social sciences, economics,
and marketing research to model concepts that are not directly measurable.
Structural Equation Modeling
Structural Equation Modeling (SEM) is a comprehensive statistical technique used to analyze
complex relationships between observed and latent variables. It combines aspects of factor
analysis and multiple regression, allowing researchers to evaluate both the measurement
and structural components of their models.
Key Features of SEM:
1. Latent Variables: SEM allows for the inclusion of latent variables, which are
unobserved constructs inferred from observed indicators. This helps in modeling
theoretical concepts that are not directly measurable.
2. Model Structure: SEM consists of two main components:
o Measurement Model: Describes how latent variables are measured by
observed variables. It assesses the relationships between indicators and their
respective latent constructs.
o Structural Model: Describes the relationships between latent variables. It
assesses the direct and indirect effects among them.
3. Path Diagrams: SEM is often represented using path diagrams, which visually depict
the relationships between variables. Arrows indicate the direction and type of
relationships (e.g., causal).
4. Estimation Methods: Common methods for estimating the parameters in SEM
include Maximum Likelihood Estimation (MLE), Generalized Least Squares (GLS), and
Bayesian estimation.
5. Goodness of Fit: SEM provides various fit indices (e.g., Chi-square, RMSEA, CFI) to
assess how well the model fits the data. A good fit indicates that the model
adequately represents the observed relationships.
6. Applications: SEM is widely used in social sciences, psychology, marketing, and other
fields to test theoretical models, understand complex relationships, and validate
measurement instruments.
Steps in SEM
Steps involved in Structural Equation Modeling (SEM):
1. Define the Research Problem
 Identify Constructs: Determine the latent variables and observed indicators relevant
to your research question.
 Theoretical Framework: Develop a theoretical model that outlines expected
relationships among the variables.
2. Specify the Model
 Measurement Model: Define how latent variables are measured by observed
variables.
 Structural Model: Outline the relationships (paths) between latent variables,
indicating direct and indirect effects.
3. Develop a Path Diagram
 Create a visual representation of the model using a path diagram, showing latent and
observed variables along with arrows indicating the direction of relationships.
4. Collect Data
 Gather data through surveys, experiments, or existing datasets that include
measurements for the observed variables.
5. Estimate Model Parameters
 Use statistical software to estimate the parameters of the SEM model (e.g.,
regression weights, variances). Common methods include Maximum Likelihood
Estimation (MLE) or Bayesian estimation.
6. Assess Model Fit
 Evaluate how well the model fits the data using fit indices such as:
o Chi-Square: Tests the difference between the observed and expected
covariance matrices.
o Root Mean Square Error of Approximation (RMSEA): Indicates model fit,
with lower values suggesting better fit.
o Comparative Fit Index (CFI): Compares the fit of the specified model to a
baseline model.
7. Modify the Model (if necessary)
 If the model fit is poor, consider modifying the model based on theoretical
justification, such as adding or removing paths, or re-specifying measurement
relationships.
8. Validate the Model
 Conduct cross-validation using a different dataset (if available) to confirm the
robustness and generalizability of the model.
9. Interpret Results
 Analyze the estimated parameters and their significance to draw conclusions about
the relationships between variables and assess the implications for the research
question.
10. Report Findings
 Document the methodology, results, and conclusions in a clear and comprehensive
manner, including discussions on limitations and future research directions.
Ridge Regression
Ridge regression is a type of linear regression that includes a regularization term to prevent
overfitting and improve model generalization. It is particularly useful when dealing with
multicollinearity (when independent variables are highly correlated) or when the number of
predictors exceeds the number of observations..

Key Features of Ridge Regression:


1. Regularization:
o Ridge regression adds a penalty term to the loss function, which is the square
of the magnitude of the coefficients multiplied by a regularization parameter
(λ):
2. Bias-Variance Trade-off:
o By adding the penalty term, ridge regression introduces a small bias but
significantly reduces variance, leading to a model that performs better on
unseen data.
3. Shrinkage of Coefficients:
o The regularization term causes the coefficients to be "shrunk" towards zero,
which can help in reducing the effect of less important predictors.
4. Interpretability:
o While ridge regression does not eliminate any coefficients (unlike Lasso
regression), it provides a more stable solution when predictors are correlated.
5. Applications:
o Commonly used in situations with many predictors, such as in genomics,
finance, and any domain where multicollinearity is a concern.
6. Choosing λ:
o The regularization parameter λ is crucial. It can be selected using techniques
like cross-validation to find the optimal value that balances bias and variance.

UNIT-3
Time Series Analysis and Forecasting
What is a Time Series?
A time series is a sequence of data points collected, recorded, or measured at successive,
evenly-spaced time intervals.
Each data point represents observations or measurements taken over time, such as stock
prices, temperature readings, or sales figures. Time series data is commonly represented
graphically with time on the horizontal axis and the variable of interest on the vertical axis,
allowing analysts to identify trends, patterns, and changes over time.
Importance of Time Series Analysis
1. Predict Future Trends: Time series analysis enables the prediction of future trends,
allowing businesses to anticipate market demand, stock prices, and other key
variables, facilitating proactive decision-making.
2. Detect Patterns and Anomalies: By examining sequential data points, time series
analysis helps detect recurring patterns and anomalies, providing insights into
underlying behaviors and potential outliers.
3. Risk Mitigation: By spotting potential risks, businesses can develop strategies to
mitigate them, enhancing overall risk management.
4. Strategic Planning: Time series insights inform long-term strategic planning, guiding
decision-making across finance, healthcare, and other sectors.
5. Competitive Edge: Time series analysis enables businesses to optimize resource
allocation effectively, whether it's inventory, workforce, or financial assets. By staying
ahead of market trends, responding to changes, and making data-driven decisions,
businesses gain a competitive edge.

Components of Time Series Data


There are four main components of a time series:

Components of Time Series Data


1. Trend: Trend represents the long-term movement or directionality of the data over
time. It captures the overall tendency of the series to increase, decrease, or remain
stable. Trends can be linear, indicating a consistent increase or decrease, or
nonlinear, showing more complex patterns.
2. Seasonality: Seasonality refers to periodic fluctuations or patterns that occur at
regular intervals within the time series. These cycles often repeat annually, quarterly,
monthly, or weekly and are typically influenced by factors such as seasons, holidays,
or business cycles.
3. Cyclic variations: Cyclical variations are longer-term fluctuations in the time series
that do not have a fixed period like seasonality. These fluctuations represent
economic or business cycles, which can extend over multiple years and are often
associated with expansions and contractions in economic activity.
4. Irregularity (or Noise): Irregularity, also known as noise or randomness, refers to the
unpredictable or random fluctuations in the data that cannot be attributed to the
trend, seasonality, or cyclical variations. These fluctuations may result from random
events, measurement errors, or other unforeseen factors. Irregularity makes it
challenging to identify and model the underlying patterns in the time series data.

Time Series Visualization


Time series visualization is the graphical representation of data collected over successive
time intervals. It encompasses various techniques such as line plots, seasonal subseries
plots, autocorrelation plots, histograms, and interactive visualizations. These methods help
analysts identify trends, patterns, and anomalies in time-dependent data for better
understanding and decision-making.
Different Time series visualization graphs
1. Line Plots: Line plots display data points over time, allowing easy observation of
trends, cycles, and fluctuations.
2. Seasonal Plots: These plots break down time series data into seasonal components,
helping to visualize patterns within specific time periods.
3. Histograms and Density Plots: Shows the distribution of data values over time,
providing insights into data characteristics such as skewness and kurtosis.
4. Autocorrelation and Partial Autocorrelation Plots: These plots visualize correlation
between a time series and its lagged values, helping to identify seasonality and
lagged relationships.
5. Spectral Analysis: Spectral analysis techniques, such as periodograms and
spectrograms, visualize frequency components within time series data, useful for
identifying periodicity and cyclical patterns.
6. Decomposition Plots: Decomposition plots break down a time series into its trend,
seasonal, and residual components, aiding in understanding the underlying patterns.
These visualization techniques allow analysts to explore, interpret, and communicate
insights from time series data effectively, supporting informed decision-making and
forecasting.
Preprocessing Time Series Data
Time series preprocessing refers to the steps taken to clean, transform, and prepare time
series data for analysis or forecasting. It involves techniques aimed at improving data quality,
removing noise, handling missing values, and making the data suitable for modeling.
Preprocessing tasks may include removing outliers, handling missing values through
imputation, scaling or normalizing the data, detrending, deseasonalizing, and applying
transformations to stabilize variance. The goal is to ensure that the time series data is in a
suitable format for subsequent analysis or modeling.
 Handling Missing Values : Dealing with missing values in the time series data to
ensure continuity and reliability in analysis.
 Dealing with Outliers: Identifying and addressing observations that significantly
deviate from the rest of the data, which can distort analysis results.
 Stationarity and Transformation: Ensuring that the statistical properties of the time
series, such as mean and variance, remain constant over time. Techniques like
differencing, detrending, and deseasonalizing are used to achieve stationarity.

Time Series Analysis & Decomposition


Time Series Analysis and Decomposition is a systematic approach to studying sequential data
collected over successive time intervals. It involves analyzing the data to understand its
underlying patterns, trends, and seasonal variations, as well as decomposing the time series
into its fundamental components. This decomposition typically includes identifying and
isolating elements such as trend, seasonality, and residual (error) components within the
data.
Different Time Series Analysis & Decomposition Techniques
1. Autocorrelation Analysis: A statistical method to measure the correlation between a
time series and a lagged version of itself at different time lags. It helps identify
patterns and dependencies within the time series data.
2. Partial Autocorrelation Functions (PACF): PACF measures the correlation between a
time series and its lagged values, controlling for intermediate lags, aiding in
identifying direct relationships between variables.
3. Trend Analysis: The process of identifying and analyzing the long-term movement or
directionality of a time series. Trends can be linear, exponential, or nonlinear and are
crucial for understanding underlying patterns and making forecasts.
4. Seasonality Analysis: Seasonality refers to periodic fluctuations or patterns that
occur in a time series at fixed intervals, such as daily, weekly, or yearly. Seasonality
analysis involves identifying and quantifying these recurring patterns to understand
their impact on the data.
5. Decomposition: Decomposition separates a time series into its constituent
components, typically trend, seasonality, and residual (error). This technique helps
isolate and analyze each component individually, making it easier to understand and
model the underlying patterns.
6. Spectrum Analysis: Spectrum analysis involves examining the frequency domain
representation of a time series to identify dominant frequencies or periodicities. It
helps detect cyclic patterns and understand the underlying periodic behavior of the
data.
7. Seasonal and Trend decomposition using Loess: STL decomposes a time series into
three components: seasonal, trend, and residual. This decomposition enables
modeling and forecasting each component separately, simplifying the forecasting
process.
8. Rolling Correlation: Rolling correlation calculates the correlation coefficient between
two time series over a rolling window of observations, capturing changes in the
relationship between variables over time.
9. Cross-correlation Analysis: Cross-correlation analysis measures the similarity
between two time series by computing their correlation at different time lags. It is
used to identify relationships and dependencies between different variables or time
series.
10. Box-Jenkins Method: Box-Jenkins Method is a systematic approach for analyzing and
modeling time series data. It involves identifying the appropriate autoregressive
integrated moving average (ARIMA) model parameters, estimating the model,
diagnosing its adequacy through residual analysis, and selecting the best-fitting
model.
11. Granger Causality Analysis: Granger causality analysis determines whether one time
series can predict future values of another time series. It helps infer causal
relationships between variables in time series data, providing insights into the
direction of influence.

What is Time Series Forecasting?


Time Series Forecasting is a statistical technique used to predict future values of a time
series based on past observations. In simpler terms, it's like looking into the future of data
points plotted over time. By analyzing patterns and trends in historical data, Time Series
Forecasting helps make informed predictions about what may happen next, assisting in
decision-making and planning for the future.
Different Time Series Forecasting Algorithms
1. Autoregressive (AR) Model: Autoregressive (AR) model is a type of time series model
that predicts future values based on linear combinations of past values of the same
time series. In an AR(p) model, the current value of the time series is modeled as a
linear function of its previous p values, plus a random error term. The order of the
autoregressive model (p) determines how many past values are used in the
prediction.
2. Autoregressive Integrated Moving Average (ARIMA): ARIMA is a widely used
statistical method for time series forecasting. It models the next value in a time series
based on linear combination of its own past values and past forecast errors. The
model parameters include the order of autoregression (p), differencing (d), and
moving average (q).
3. ARIMAX: ARIMA model extended to include exogenous variables that can improve
forecast accuracy.
4. Seasonal Autoregressive Integrated Moving Average (SARIMA): SARIMA extends
ARIMA by incorporating seasonality into the model. It includes additional seasonal
parameters (P, D, Q) to capture periodic fluctuations in the data.
5. SARIMAX: Extension of SARIMA that incorporates exogenous variables for seasonal
time series forecasting.
6. Vector Autoregression (VAR) Models: VAR models extend autoregression to
multivariate time series data by modeling each variable as a linear combination of its
past values and the past values of other variables. They are suitable for analyzing and
forecasting interdependencies among multiple time series.
7. Theta Method: A simple and intuitive forecasting technique based on extrapolation
and trend fitting.
8. Exponential Smoothing Methods: Exponential smoothing methods, such as Simple
Exponential Smoothing (SES) and Holt-Winters, forecast future values by
exponentially decreasing weights for past observations. These methods are
particularly useful for data with trend and seasonality.
9. Gaussian Processes Regression: Gaussian Processes Regression is a Bayesian non-
parametric approach that models the distribution of functions over time. It provides
uncertainty estimates along with point forecasts, making it useful for capturing
uncertainty in time series forecasting.
10. Generalized Additive Models (GAM): A flexible modeling approach that combines
additive components, allowing for nonlinear relationships and interactions.
11. Random Forests: Random Forests is a machine learning ensemble method that
constructs multiple decision trees during training and outputs the average prediction
of the individual trees. It can handle complex relationships and interactions in the
data, making it effective for time series forecasting.
12. Gradient Boosting Machines (GBM): GBM is another ensemble learning technique
that builds multiple decision trees sequentially, where each tree corrects the errors
of the previous one. It excels in capturing nonlinear relationships and is robust
against overfitting.
13. State Space Models: State space models represent a time series as a combination of
unobserved (hidden) states and observed measurements. These models capture
both the deterministic and stochastic components of the time series, making them
suitable for forecasting and anomaly detection.
14. Dynamic Linear Models (DLMs): DLMs are Bayesian state-space models that
represent time series data as a combination of latent state variables and
observations. They are flexible models capable of incorporating various trends,
seasonality, and other dynamic patterns in the data.
15. Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM)
Networks: RNNs and LSTMs are deep learning architectures designed to handle
sequential data. They can capture complex temporal dependencies in time series
data, making them powerful tools for forecasting tasks, especially when dealing with
large-scale and high-dimensional data.
16. Hidden Markov Model (HMM): A Hidden Markov Model (HMM) is a statistical model
used to describe sequences of observable events generated by underlying hidden
states. In time series, HMMs infer hidden states from observed data, capturing
dependencies and transitions between states. They are valuable for tasks like speech
recognition, gesture analysis, and anomaly detection, providing a framework to
model complex sequential data and extract meaningful patterns from it.

Top Python Libraries for Time Series Analysis & Forecasting


Python Libraries for Time Series Analysis & Forecasting encompass a suite of powerful tools
and frameworks designed to facilitate the analysis and forecasting of time series data. These
libraries offer a diverse range of capabilities, including statistical modeling, machine learning
algorithms, deep learning techniques, and probabilistic forecasting methods. With their
user-friendly interfaces and extensive documentation, these libraries serve as invaluable
resources for both beginners and experienced practitioners in the field of time series
analysis and forecasting.
1. Statsmodels: Statsmodels is a Python library for statistical modeling and hypothesis
testing. It includes a wide range of statistical methods and models, including time
series analysis tools like ARIMA, SARIMA, and VAR. Statsmodels is useful for
performing classical statistical tests and building traditional time series models.
2. Pmdarima: Pmdarima is a Python library that provides an interface to ARIMA models
in a manner similar to that of scikit-learn. It automates the process of selecting
optimal ARIMA parameters and fitting models to time series data.
3. Prophet: Prophet is a forecasting tool developed by Facebook that is specifically
designed for time series forecasting at scale. It provides a simple yet powerful
interface for fitting and forecasting time series data, with built-in support for
handling seasonality, holidays, and trend changes.
4. tslearn: tslearn is a Python library for time series learning, which provides various
algorithms and tools for time series classification, clustering, and regression. It offers
implementations of state-of-the-art algorithms, such as dynamic time warping (DTW)
and shapelets, for analyzing and mining time series data.
5. ARCH: ARCH is a Python library for estimating and forecasting volatility models
commonly used in financial econometrics. It provides tools for fitting autoregressive
conditional heteroskedasticity (ARCH) and generalized autoregressive conditional
heteroskedasticity (GARCH) models to time series data.
6. GluonTS: GluonTS is a Python library for probabilistic time series forecasting
developed by Amazon. It provides a collection of state-of-the-art deep learning
models and tools for building and training probabilistic forecasting models for time
series data.
7. PyFlux: PyFlux is a Python library for time series analysis and forecasting, which
provides implementations of various time series models, including ARIMA, GARCH,
and stochastic volatility models. It offers an intuitive interface for fitting and
forecasting time series data with Bayesian inference methods.
8. Sktime: Sktime is a Python library for machine learning with time series data, which
provides a unified interface for building and evaluating machine learning models for
time series forecasting, classification, and regression tasks. It integrates seamlessly
with scikit-learn and offers tools for handling time series data efficiently.
9. PyCaret: PyCaret is an open-source, low-code machine learning library in Python that
automates the machine learning workflow. It supports time series forecasting tasks
and provides tools for data preprocessing, feature engineering, model selection, and
evaluation in a simple and streamlined manner.
10. Darts: Darts (Data Augmentation for Regression Tasks with SVD) is a Python library
for time series forecasting. It provides a flexible and modular framework for
developing and evaluating forecasting models, including classical and deep learning-
based approaches. Darts emphasizes simplicity, scalability, and reproducibility in time
series analysis and forecasting tasks.
11. Kats: Kats, short for "Kits to Analyze Time Series," is an open-source Python library
developed by Facebook. It provides a comprehensive toolkit for time series analysis,
offering a wide range of functionalities to handle various aspects of time series data.
Kats includes tools for time series forecasting, anomaly detection, feature
engineering, and model evaluation. It aims to simplify the process of working with
time series data by providing an intuitive interface and a collection of state-of-the-art
algorithms.
12. AutoTS: AutoTS, or Automated Time Series, is a Python library developed to simplify
time series forecasting by automating the model selection and parameter tuning
process. It employs machine learning algorithms and statistical techniques to
automatically identify the most suitable forecasting models and parameters for a
given dataset. This automation saves time and effort by eliminating the need for
manual model selection and tuning.
13. Scikit-learn: Scikit-learn is a popular machine learning library in Python that provides
a wide range of algorithms and tools for data mining and analysis. While not
specifically tailored for time series analysis, Scikit-learn offers some useful algorithms
for forecasting tasks, such as regression, classification, and clustering.
14. TensorFlow: TensorFlow is an open-source machine learning framework developed
by Google. It is widely used for building and training deep learning models, including
recurrent neural networks (RNNs) and long short-term memory networks (LSTMs),
which are commonly used for time series forecasting tasks.
15. Keras: Keras is a high-level neural networks API written in Python, which runs on top
of TensorFlow. It provides a user-friendly interface for building and training neural
networks, including recurrent and convolutional neural networks, for various
machine learning tasks, including time series forecasting.
16. PyTorch: PyTorch is another popular deep learning framework that is widely used for
building neural network models. It offers dynamic computation graphs and a flexible
architecture, making it suitable for prototyping and experimenting with complex
models for time series forecasting.

UNIT-4
Basic Concept of Classification (Data Mining)
Data Mining: Data mining in general terms means mining or digging deep into data that is in
different forms to gain patterns, and to gain knowledge on that pattern. In the process of
data mining, large data sets are first sorted, then patterns are identified and relationships
are established to perform data analysis and solve problems.
Classification is a task in data mining that involves assigning a class label to each instance in a
dataset based on its features. The goal of classification is to build a model that accurately
predicts the class labels of new instances based on their features.
There are two main types of classification: binary classification and multi-class classification.
Binary classification involves classifying instances into two classes, such as “spam” or “not
spam”, while multi-class classification involves classifying instances into more than two
classes.
The process of building a classification model typically involves the following steps:
Data Collection:
The first step in building a classification model is data collection. In this step, the data
relevant to the problem at hand is collected. The data should be representative of the
problem and should contain all the necessary attributes and labels needed for classification.
The data can be collected from various sources, such as surveys, questionnaires, websites,
and databases.
Data Preprocessing:
The second step in building a classification model is data preprocessing. The collected data
needs to be preprocessed to ensure its quality. This involves handling missing values, dealing
with outliers, and transforming the data into a format suitable for analysis. Data
preprocessing also involves converting the data into numerical form, as most classification
algorithms require numerical input.
Handling Missing Values: Missing values in the dataset can be handled by replacing them
with the mean, median, or mode of the corresponding feature or by removing the entire
record.
Dealing with Outliers: Outliers in the dataset can be detected using various statistical
techniques such as z-score analysis, boxplots, and scatterplots. Outliers can be removed
from the dataset or replaced with the mean, median, or mode of the corresponding feature.
Data Transformation: Data transformation involves scaling or normalizing the data to bring it
into a common scale. This is done to ensure that all features have the same level of
importance in the analysis.
Feature Selection:
The third step in building a classification model is feature selection. Feature selection
involves identifying the most relevant attributes in the dataset for classification. This can be
done using various techniques, such as correlation analysis, information gain, and principal
component analysis.
Correlation Analysis: Correlation analysis involves identifying the correlation between the
features in the dataset. Features that are highly correlated with each other can be removed
as they do not provide additional information for classification.
Information Gain: Information gain is a measure of the amount of information that a feature
provides for classification. Features with high information gain are selected for classification.
Principal Component Analysis:
Principal Component Analysis (PCA) is a technique used to reduce the dimensionality of the
dataset. PCA identifies the most important features in the dataset and removes the
redundant ones.
Model Selection:
The fourth step in building a classification model is model selection. Model selection
involves selecting the appropriate classification algorithm for the problem at hand. There are
several algorithms available, such as decision trees, support vector machines, and neural
networks.
Decision Trees: Decision trees are a simple yet powerful classification algorithm. They divide
the dataset into smaller subsets based on the values of the features and construct a tree-like
model that can be used for classification.
Support Vector Machines: Support Vector Machines (SVMs) are a popular classification
algorithm used for both linear and nonlinear classification problems. SVMs are based on the
concept of maximum margin, which involves finding the hyperplane that maximizes the
distance between the two classes.
Neural Networks:
Neural Networks are a powerful classification algorithm that can learn complex patterns in
the data. They are inspired by the structure of the human brain and consist of multiple layers
of interconnected nodes.
Model Training:
The fifth step in building a classification model is model training. Model training involves
using the selected classification algorithm to learn the patterns in the data. The data is
divided into a training set and a validation set. The model is trained using the training set,
and its performance is evaluated on the validation set.
Model Evaluation:
The sixth step in building a classification model is model evaluation. Model evaluation
involves assessing the performance of the trained model on a test set. This is done to ensure
that the model generalizes well
Classification is a widely used technique in data mining and is applied in a variety of
domains, such as email filtering, sentiment analysis, and medical diagnosis.

Classifiers can be categorized into two major types:

1. Discriminative: It is a very basic classifier and determines just one class for each row
of data. It tries to model just by depending on the observed data, depends heavily on
the quality of data rather than on distributions.
Example: Logistic Regression
2. Generative: It models the distribution of individual classes and tries to learn the
model that generates the data behind the scenes by estimating assumptions and
distributions of the model. Used to predict the unseen data.
Example: Naive Bayes Classifier
Detecting Spam emails by looking at the previous data. Suppose 100 emails and that
too divided in 1:4 i.e. Class A: 25%(Spam emails) and Class B: 75%(Non-Spam emails).
Now if a user wants to check that if an email contains the word cheap, then that may
be termed as Spam.
It seems to be that in Class A(i.e. in 25% of data), 20 out of 25 emails are spam and
rest not.
And in Class B(i.e. in 75% of data), 70 out of 75 emails are not spam and rest are
spam.
So, if the email contains the word cheap, what is the probability of it being spam ??
(= 80%)
Classifiers Of Machine Learning:
1. Decision Trees
2. Bayesian Classifiers
3. Neural Networks
4. K-Nearest Neighbour
5. Support Vector Machines
6. Linear Regression
7. Logistic Regression
Associated Tools and Languages: Used to mine/ extract useful information from raw data.
 Main Languages used: R, SAS, Python, SQL
 Major Tools used: RapidMiner, Orange, KNIME, Spark, Weka
 Libraries used: Jupyter, NumPy, Matplotlib, Pandas, ScikitLearn, NLTK, TensorFlow,
Seaborn, Basemap, etc.
Real–Life Examples :
 Market Basket Analysis:
It is a modeling technique that has been associated with frequent transactions of
buying some combination of items.
Example: Amazon and many other Retailers use this technique. While viewing some
products, certain suggestions for the commodities are shown that some people have
bought in the past.
 Weather Forecasting:
Changing Patterns in weather conditions needs to be observed based on parameters
such as temperature, humidity, wind direction. This keen observation also requires
the use of previous records in order to predict it accurately.

Support Vector Machine Algorithm


Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used
for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in
the correct category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as Support
Vector Machine. Consider the below diagram in which there are two different categories
that are classified using a decision boundary or hyperplane:

Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model
that can accurately identify whether it is a cat or dog, so such a model can be created by
using the SVM algorithm. We will first train our model with lots of images of cats and dogs
so that it can learn about different features of cats and dogs, and then we test it with this
strange creature. So as support vector creates a decision boundary between these two data
(cat and dog) and choose extreme cases (support vectors), it will see the extreme case of cat
and dog. On the basis of the support vectors, it will classify it as a cat. Consider the below
diagram:

SVM algorithm can be used for Face detection, image classification, text categorization, etc.
Types of SVM
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is
termed as linearly separable data, and classifier is used called as Linear SVM
classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is
termed as non-linear data and classifier used is called as Non-linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:


Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-
dimensional space, but we need to find out the best decision boundary that helps to classify
the data points. This best boundary is known as the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the dataset, which
means if there are 2 features (as shown in image), then hyperplane will be a straight line.
And if there are 3 features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the
position of the hyperplane are termed as Support Vector. Since these vectors support the
hyperplane, hence called a Support vector.

How does SVM works?


Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we
have a dataset that has two tags (green and blue), and the dataset has two features x1 and
x2. We want a classifier that can classify the pair(x1, x2) of coordinates in either green or
blue. Consider the below image:

So as it is 2-d space so by just using a straight line, we can easily separate these two classes.
But there can be multiple lines that can separate these classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane. SVM algorithm finds the closest point of the
lines from both the classes. These points are called support vectors. The distance between
the vectors and the hyperplane is called as margin. And the goal of SVM is to maximize this
margin. The hyperplane with maximum margin is called the optimal hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear
data, we cannot draw a single straight line. Consider the below image:

So to separate these data points, we need to add one more dimension. For linear data, we
have used two dimensions x and y, so for non-linear data, we will add a third dimension z. It
can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the below
image:

Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert
it in 2d space with z=1, then it will become as:
Hence we get a circumference of radius 1 in case of non-linear data.

Clustering in Machine Learning


Clustering or cluster analysis is a machine learning technique, which groups the unlabelled
dataset. It can be defined as "A way of grouping the data points into different clusters,
consisting of similar data points. The objects with the possible similarities remain in a
group that has less or no similarities with another group."
It does it by finding some similar patterns in the unlabelled dataset such as shape, size, color,
behavior, etc., and divides them as per the presence and absence of those similar patterns.
It is an unsupervised learning method, hence no supervision is provided to the algorithm,
and it deals with the unlabeled dataset.
The clustering technique is commonly used for statistical data analysis.
Example: Let's understand the clustering technique with the real-world example of Mall:
When we visit any shopping mall, we can observe that the things with similar usage are
grouped together. Such as the t-shirts are grouped in one section, and trousers are at other
sections, similarly, at vegetable sections, apples, bananas, Mangoes, etc., are grouped in
separate sections, so that we can easily find out the things. The clustering technique also
works in the same way. Other examples of clustering are grouping documents according to
the topic.
The clustering technique can be widely used in various tasks. Some most common uses of
this technique are:
o Market Segmentation
o Statistical data analysis
o Social network analysis
o Image segmentation
o Anomaly detection, etc.
Apart from these general usages, it is used by the Amazon in its recommendation system to
provide the recommendations as per the past search of products. Netflix also uses this
technique to recommend the movies and web-series to its users as per the watch history.
The below diagram explains the working of the clustering algorithm. We can see the
different fruits are divided into several groups with similar properties.
Types of Clustering Methods
The clustering methods are broadly divided into Hard clustering (datapoint belongs to only
one group) and Soft Clustering (data points can belong to another group also). But there are
also other various approaches of Clustering exist. Below are the main clustering methods
used in Machine learning:
1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering
Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also known as
the centroid-based method. The most common example of partitioning clustering is the K-
Means Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define the
number of pre-defined groups. The cluster center is created in such a way that the distance
between the data points of one cluster is minimum as compared to another cluster centroid.
Density-Based Clustering
The density-based clustering method connects the highly-dense areas into clusters, and the
arbitrarily shaped distributions are formed as long as the dense region can be connected.
This algorithm does it by identifying different clusters in the dataset and connects the areas
of high densities into clusters. The dense areas in data space are divided from each other by
sparser areas.
These algorithms can face difficulty in clustering the data points if the dataset has varying
densities and high dimensions.

Distribution Model-Based Clustering


In the distribution model-based clustering method, the data is divided based on the
probability of how a dataset belongs to a particular distribution. The grouping is done by
assuming some distributions commonly Gaussian Distribution.
The example of this type is the Expectation-Maximization Clustering algorithm that uses
Gaussian Mixture Models (GMM).

Hierarchical Clustering
Hierarchical clustering can be used as an alternative for the partitioned clustering as there is
no requirement of pre-specifying the number of clusters to be created. In this technique, the
dataset is divided into clusters to create a tree-like structure, which is also called
a dendrogram. The observations or any number of clusters can be selected by cutting the
tree at the correct level. The most common example of this method is the Agglomerative
Hierarchical algorithm.
Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may belong to more than
one group or cluster. Each dataset has a set of membership coefficients, which depend on
the degree of membership to be in a cluster. Fuzzy C-means algorithm is the example of this
type of clustering; it is sometimes also known as the Fuzzy k-means algorithm.

Clustering Algorithms
The Clustering algorithms can be divided based on their models that are explained above.
There are different types of clustering algorithms published, but only a few are commonly
used. The clustering algorithm is based on the kind of data that we are using. Such as, some
algorithms need to guess the number of clusters in the given dataset, whereas some are
required to find the minimum distance between the observation of the dataset.
Here we are discussing mainly popular Clustering algorithms that are widely used in machine
learning:
1. K-Means algorithm: The k-means algorithm is one of the most popular clustering
algorithms. It classifies the dataset by dividing the samples into different clusters of
equal variances. The number of clusters must be specified in this algorithm. It is fast
with fewer computations required, with the linear complexity of O(n).
2. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in the
smooth density of data points. It is an example of a centroid-based model, that works
on updating the candidates for centroid to be the center of the points within a given
region.
3. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of Applications
with Noise. It is an example of a density-based model similar to the mean-shift, but
with some remarkable advantages. In this algorithm, the areas of high density are
separated by the areas of low density. Because of this, the clusters can be found in
any arbitrary shape.
4. Expectation-Maximization Clustering using GMM: This algorithm can be used as an
alternative for the k-means algorithm or for those cases where K-means can be
failed. In GMM, it is assumed that the data points are Gaussian distributed.
5. Agglomerative Hierarchical algorithm: The Agglomerative hierarchical algorithm
performs the bottom-up hierarchical clustering. In this, each data point is treated as a
single cluster at the outset and then successively merged. The cluster hierarchy can
be represented as a tree-structure.
6. Affinity Propagation: It is different from other clustering algorithms as it does not
require to specify the number of clusters. In this, each data point sends a message
between the pair of data points until convergence. It has O(N2T) time complexity,
which is the main drawback of this algorithm.

Applications of Clustering
Below are some commonly known applications of clustering technique in Machine Learning:
o In Identification of Cancer Cells: The clustering algorithms are widely used for the
identification of cancerous cells. It divides the cancerous and non-cancerous data
sets into different groups.
o In Search Engines: Search engines also work on the clustering technique. The search
result appears based on the closest object to the search query. It does it by grouping
similar data objects in one group that is far from the other dissimilar objects. The
accurate result of a query depends on the quality of the clustering algorithm used.
o Customer Segmentation: It is used in market research to segment the customers
based on their choice and preferences.
o In Biology: It is used in the biology stream to classify different species of plants and
animals using the image recognition technique.
o In Land Use: The clustering technique is used in identifying the area of similar lands
use in the GIS database. This can be very useful to find that for what purpose the
particular land should be used, that means for which purpose it is more suitable.

You might also like