DATA SCIENCE 1(7th sem)
DATA SCIENCE 1(7th sem)
4. High-Level Language
Python is a high-level language. When we write programs in Python, we do not need to
remember the system architecture, nor do we need to manage the memory.
9. Interpreted Language:
Python is an Interpreted Language because Python code is executed line by line at a time.
like other languages C, C++, Java, etc. there is no need to compile Python code this makes
it easier to debug our code.
Data Scales
Data scales refer to the different ways we can represent data values in a dataset.
Understanding the types of scales helps in choosing the right statistical methods and
machine learning algorithms. There are four common types of data scales:
1. Nominal Scale (Categorical Data)
Description: This scale represents categories or labels with no specific order. It's used
for naming or classifying items.
Example:
o Colors of cars: Red, Blue, Green.
o Types of fruits: Apple, Orange, Banana.
o Gender: Male, Female.
Key Points:
No mathematical operations can be performed on these values.
We can only count or compare equality, not order them.
2. Ordinal Scale
Description: This scale represents categories that have a meaningful order or
ranking, but the intervals between values are not necessarily equal.
Example:
o Rating a movie: Poor, Average, Good, Excellent.
o Education level: High School, Bachelor’s, Master’s, PhD.
o Customer satisfaction: Very dissatisfied, Dissatisfied, Neutral, Satisfied, Very
satisfied.
Key Points:
We can rank the data, but we can’t measure the exact difference between them.
For example, the difference between "Poor" and "Average" might not be the same as
between "Good" and "Excellent."
3. Interval Scale
Description: This scale has ordered categories, and the intervals between values are
equal, but there is no true zero point.
Example:
o Temperature (in Celsius or Fahrenheit): The difference between 10°C and
20°C is the same as between 20°C and 30°C, but 0°C does not mean "no
temperature."
o Time of the day (on a 12-hour clock): 2 PM is 2 hours after 12 PM, but there
is no "zero" time.
Key Points:
Addition and subtraction are meaningful, but ratios (e.g., "twice as much") are not
because there is no true zero point.
4. Ratio Scale
Description: This is the most informative scale. It has all the properties of the interval
scale, but it also has a true zero point, which means that we can compare both
differences and ratios.
Example:
o Height: 0 cm means "no height," and 180 cm is twice as tall as 90 cm.
o Weight: 0 kg means "no weight," and 50 kg is half the weight of 100 kg.
o Distance: 0 km means "no distance," and 200 km is twice as long as 100 km.
Key Points:
All mathematical operations (addition, subtraction, multiplication, division) are
meaningful.
Example: The distance between (2, 3) and (6, 7) is calculated using the formula. The closer
the points, the higher their similarity.
Cosine Similarity:
o Measures the cosine of the angle between two vectors. It’s used for
measuring text similarity and is often used in document comparison.
o Formula:
o Example: For two sets A = {1, 2, 3} and B = {2, 3, 4}, the Jaccard similarity is
24=0.5\frac{2}{4} = 0.542=0.5.
Pearson Correlation Coefficient:
o Measures the linear correlation between two variables, indicating how closely
their values move together.
o Formula:
o Example: Used to measure the similarity in trends between two data series. A
correlation of 1 indicates perfect similarity, while -1 indicates they move in
opposite directions.
2. Dissimilarity Measures
A dissimilarity measure quantifies how different two data points are. Higher values indicate
more dissimilarity. These measures are used in clustering algorithms like k-means, where we
need to separate points based on their dissimilarity.
Common Dissimilarity Measures:
Euclidean Distance:
o Measures the straight-line distance between two points.
o Formula:
o Example: In a 2D plane, the further apart two points (2,3) and (6,7) are, the
higher their dissimilarity.
Manhattan Distance (Taxicab or L1 distance):
o Measures the distance between two points by adding the absolute
differences along each dimension.
o Formula:
Data Visualization
ANS- Definition: Data visualization is the process of creating visual representations of data to
help people understand patterns, trends, and insights more easily.
Why We Use It:
Clarity: Makes complex data more understandable at a glance.
Patterns: Helps to quickly identify trends and outliers.
Communication: Makes it easier to share findings with others.
Common Types of Data Visualizations:
1. Bar Charts: Show quantities of different categories with bars. Great for comparing
different groups. For example, showing sales numbers for different products.
2. Line Graphs: Display data points connected by lines. Useful for showing changes over
time, like tracking monthly sales.
3. Pie Charts: Represent parts of a whole as slices of a pie. Good for showing how
different parts contribute to the total, like the market share of different companies.
4. Histograms: Display the distribution of numerical data by grouping it into bins. Useful
for understanding the frequency of data within certain ranges, like test scores.
5. Scatter Plots: Show data points on a grid to observe the relationship between two
variables. For example, plotting height against weight to see if there's a correlation.
6. Heat Maps: Use color to represent data values. Great for visualizing data density or
intensity, like showing the number of website visits by day and time.
Example:
Imagine you have data on how many books different people read in a month. Instead of just
listing the numbers, you create a bar chart. Each bar represents a person, and the height of
the bar shows how many books they read. This makes it easy to compare the reading habits
of different people at a glance.
Correlation
ANS- What Is Correlation?
Definition: Correlation is a measure of how two variables are related to each other. It tells us
whether changes in one variable are associated with changes in another.
Types of Correlation:
1. Positive Correlation: When one variable increases, the other variable also increases.
For example, height and weight often have a positive correlation.
2. Negative Correlation: When one variable increases, the other variable decreases. For
example, the amount of time spent watching TV and the amount of time spent
exercising might have a negative correlation.
3. No Correlation: No consistent relationship between the variables. For example, shoe
size and intelligence may have no correlation.
How Correlation Is Measured:
Correlation Coefficient: A numerical value between -1 and 1 that indicates the
strength and direction of the relationship.
o +1: Perfect positive correlation (they move together perfectly).
o -1: Perfect negative correlation (one increases as the other decreases
perfectly).
o 0: No correlation (no predictable relationship).
Example:
Imagine you have data on students' hours of study and their exam scores. If you find a high
positive correlation between the hours of study and the exam scores, it means that students
who study more tend to score higher on the exam. This relationship helps you understand
how study time impacts performance.
UNIT-2
Regression Analysis
Definition: Regression analysis is a statistical method used to examine the relationship
between one dependent variable (target) and one or more independent variables
(predictors). It helps in understanding how the dependent variable changes when any of the
independent variables vary.
Purpose of Regression Analysis:
1. Prediction: To predict the value of the dependent variable based on the values of the
independent variables.
2. Understanding Relationships: To assess the strength and nature of relationships
between variables.
3. Trend Analysis: To identify trends in data over time, useful in forecasting.
4. Effect Quantification: To quantify the effect of independent variables on the
dependent variable, helping in decision-making.
Types of Regression Analysis:
1. Linear Regression: Models the relationship between the dependent and
independent variable as a straight line. It can be simple (one predictor) or multiple
(more than one predictor).
A sloped straight line represents the linear regression model.
Syntax:
y = θx + b where,
θ – It is the model weights or parameters
b – It is known as the bias.
CODE:
from sklearn.linear_model import LinearRegression
CODE:
from sklearn.linear_model import PolynomialRegression
# Create a polynomial regression model
model = PolynomialRegression(degree=2)
# Fit the model to the data
model.fit(X, y)
# Predict the response for a new data point
y_pred = model.predict(X_new)
Logistic Regression: Used when the dependent variable is categorical (e.g., binary
outcomes). It estimates the probability that a given input point belongs to a certain category.
Ridge Regression: A type of linear regression that includes a penalty term to reduce
overfitting by constraining the size of the coefficients.
Lasso Regression: Similar to ridge regression, but it can reduce some coefficients to zero,
effectively performing variable selection.
Stepwise Regression: Involves automatic selection of independent variables by adding or
removing them based on their statistical significance.
Elastic Net: Combines properties of both ridge and lasso regression, balancing between
regularization and variable selection.
Cross Validation:
What is Cross-Validation?
Cross-validation is a technique used in machine learning to check how well a model will
perform on new, unseen data. It helps ensure that the model doesn't just work well on the
data it was trained on, but also performs well when given different data. The idea is to divide
the data into several parts (called "folds"), train the model on some of these parts, and test
it on the remaining parts. This process is repeated multiple times, using a different part for
testing each time, and the results are averaged to get a more accurate measure of the
model's performance.
Why is Cross-Validation Important?
The main reason for using cross-validation is to prevent overfitting. Overfitting happens
when a model learns too much from the training data, including the noise or irrelevant
details, which makes it perform poorly on new data. Cross-validation helps by giving a more
realistic estimate of how well the model will generalize to unseen data.
Types of Cross-Validation
1. Holdout Validation:
o In this method, the dataset is divided into two parts: one part is used for
training the model, and the other part is used for testing it.
o Example: If you have 100 data points, you might use 50 for training and the
other 50 for testing.
o Advantage: Simple and quick.
o Disadvantage: You might miss important patterns in the data if you only use
half for training.
2. LOOCV (Leave-One-Out Cross-Validation):
o This method trains the model on all data except for one data point, which is
used for testing. This process is repeated for every single data point.
o Example: If you have 10 data points, the model will be trained on 9 points
and tested on 1. This is done 10 times, using a different data point for testing
each time.
o Advantage: It uses almost all data for training, so the model has more
information to learn from.
o Disadvantage: It takes a lot of time because you need to train the model as
many times as there are data points.
3. Stratified Cross-Validation:
o This method is especially useful when you have imbalanced data, meaning
that some categories are more common than others. Stratified cross-
validation ensures that each fold has the same distribution of classes as the
original dataset.
o Example: If 80% of your data is class A and 20% is class B, each fold in
stratified cross-validation will maintain this ratio.
o Advantage: Ensures that each fold is representative of the entire dataset.
o Disadvantage: More complex to implement than simple k-fold cross-
validation.
4. K-Fold Cross-Validation:
o In this method, the dataset is split into k equal-sized subsets (called folds).
The model is trained on k-1 folds and tested on the remaining fold. This is
repeated k times, each time using a different fold for testing.
o Example: If k=5 and you have 100 data points, the data is divided into 5
subsets (20 points each). In each iteration, 80 points are used for training, and
20 points are used for testing.
o Advantage: More reliable results because all the data is used for both training
and testing at some point.
o Disadvantage: Training the model multiple times can take longer.
Advantages and Disadvantages of Cross Validation
Advantages:
1. Overcoming Overfitting: Cross validation helps to prevent overfitting by providing a more
robust estimate of the model’s performance on unseen data.
2. Model Selection: Cross validation can be used to compare different models and select the
one that performs the best on average.
3. Hyperparameter tuning: Cross validation can be used to optimize the hyperparameters of
a model, such as the regularization parameter, by selecting the values that result in the best
performance on the validation set.
4. Data Efficient: Cross validation allows the use of all the available data for both training
and validation, making it a more data-efficient method compared to traditional validation
techniques.
Disadvantages:
1. Computationally Expensive: Cross validation can be computationally expensive, especially
when the number of folds is large or when the model is complex and requires a long time to
train.
2. Time-Consuming: Cross validation can be time-consuming, especially when there are
many hyperparameters to tune or when multiple models need to be compared.
3. Bias-Variance Tradeoff: The choice of the number of folds in cross validation can impact
the bias-variance tradeoff, i.e., too few folds may result in high variance, while too many
folds may result in high bias.
Q Difference between training data and testing data
UNIT-3
Time Series Analysis and Forecasting
What is a Time Series?
A time series is a sequence of data points collected, recorded, or measured at successive,
evenly-spaced time intervals.
Each data point represents observations or measurements taken over time, such as stock
prices, temperature readings, or sales figures. Time series data is commonly represented
graphically with time on the horizontal axis and the variable of interest on the vertical axis,
allowing analysts to identify trends, patterns, and changes over time.
Importance of Time Series Analysis
1. Predict Future Trends: Time series analysis enables the prediction of future trends,
allowing businesses to anticipate market demand, stock prices, and other key
variables, facilitating proactive decision-making.
2. Detect Patterns and Anomalies: By examining sequential data points, time series
analysis helps detect recurring patterns and anomalies, providing insights into
underlying behaviors and potential outliers.
3. Risk Mitigation: By spotting potential risks, businesses can develop strategies to
mitigate them, enhancing overall risk management.
4. Strategic Planning: Time series insights inform long-term strategic planning, guiding
decision-making across finance, healthcare, and other sectors.
5. Competitive Edge: Time series analysis enables businesses to optimize resource
allocation effectively, whether it's inventory, workforce, or financial assets. By staying
ahead of market trends, responding to changes, and making data-driven decisions,
businesses gain a competitive edge.
UNIT-4
Basic Concept of Classification (Data Mining)
Data Mining: Data mining in general terms means mining or digging deep into data that is in
different forms to gain patterns, and to gain knowledge on that pattern. In the process of
data mining, large data sets are first sorted, then patterns are identified and relationships
are established to perform data analysis and solve problems.
Classification is a task in data mining that involves assigning a class label to each instance in a
dataset based on its features. The goal of classification is to build a model that accurately
predicts the class labels of new instances based on their features.
There are two main types of classification: binary classification and multi-class classification.
Binary classification involves classifying instances into two classes, such as “spam” or “not
spam”, while multi-class classification involves classifying instances into more than two
classes.
The process of building a classification model typically involves the following steps:
Data Collection:
The first step in building a classification model is data collection. In this step, the data
relevant to the problem at hand is collected. The data should be representative of the
problem and should contain all the necessary attributes and labels needed for classification.
The data can be collected from various sources, such as surveys, questionnaires, websites,
and databases.
Data Preprocessing:
The second step in building a classification model is data preprocessing. The collected data
needs to be preprocessed to ensure its quality. This involves handling missing values, dealing
with outliers, and transforming the data into a format suitable for analysis. Data
preprocessing also involves converting the data into numerical form, as most classification
algorithms require numerical input.
Handling Missing Values: Missing values in the dataset can be handled by replacing them
with the mean, median, or mode of the corresponding feature or by removing the entire
record.
Dealing with Outliers: Outliers in the dataset can be detected using various statistical
techniques such as z-score analysis, boxplots, and scatterplots. Outliers can be removed
from the dataset or replaced with the mean, median, or mode of the corresponding feature.
Data Transformation: Data transformation involves scaling or normalizing the data to bring it
into a common scale. This is done to ensure that all features have the same level of
importance in the analysis.
Feature Selection:
The third step in building a classification model is feature selection. Feature selection
involves identifying the most relevant attributes in the dataset for classification. This can be
done using various techniques, such as correlation analysis, information gain, and principal
component analysis.
Correlation Analysis: Correlation analysis involves identifying the correlation between the
features in the dataset. Features that are highly correlated with each other can be removed
as they do not provide additional information for classification.
Information Gain: Information gain is a measure of the amount of information that a feature
provides for classification. Features with high information gain are selected for classification.
Principal Component Analysis:
Principal Component Analysis (PCA) is a technique used to reduce the dimensionality of the
dataset. PCA identifies the most important features in the dataset and removes the
redundant ones.
Model Selection:
The fourth step in building a classification model is model selection. Model selection
involves selecting the appropriate classification algorithm for the problem at hand. There are
several algorithms available, such as decision trees, support vector machines, and neural
networks.
Decision Trees: Decision trees are a simple yet powerful classification algorithm. They divide
the dataset into smaller subsets based on the values of the features and construct a tree-like
model that can be used for classification.
Support Vector Machines: Support Vector Machines (SVMs) are a popular classification
algorithm used for both linear and nonlinear classification problems. SVMs are based on the
concept of maximum margin, which involves finding the hyperplane that maximizes the
distance between the two classes.
Neural Networks:
Neural Networks are a powerful classification algorithm that can learn complex patterns in
the data. They are inspired by the structure of the human brain and consist of multiple layers
of interconnected nodes.
Model Training:
The fifth step in building a classification model is model training. Model training involves
using the selected classification algorithm to learn the patterns in the data. The data is
divided into a training set and a validation set. The model is trained using the training set,
and its performance is evaluated on the validation set.
Model Evaluation:
The sixth step in building a classification model is model evaluation. Model evaluation
involves assessing the performance of the trained model on a test set. This is done to ensure
that the model generalizes well
Classification is a widely used technique in data mining and is applied in a variety of
domains, such as email filtering, sentiment analysis, and medical diagnosis.
1. Discriminative: It is a very basic classifier and determines just one class for each row
of data. It tries to model just by depending on the observed data, depends heavily on
the quality of data rather than on distributions.
Example: Logistic Regression
2. Generative: It models the distribution of individual classes and tries to learn the
model that generates the data behind the scenes by estimating assumptions and
distributions of the model. Used to predict the unseen data.
Example: Naive Bayes Classifier
Detecting Spam emails by looking at the previous data. Suppose 100 emails and that
too divided in 1:4 i.e. Class A: 25%(Spam emails) and Class B: 75%(Non-Spam emails).
Now if a user wants to check that if an email contains the word cheap, then that may
be termed as Spam.
It seems to be that in Class A(i.e. in 25% of data), 20 out of 25 emails are spam and
rest not.
And in Class B(i.e. in 75% of data), 70 out of 75 emails are not spam and rest are
spam.
So, if the email contains the word cheap, what is the probability of it being spam ??
(= 80%)
Classifiers Of Machine Learning:
1. Decision Trees
2. Bayesian Classifiers
3. Neural Networks
4. K-Nearest Neighbour
5. Support Vector Machines
6. Linear Regression
7. Logistic Regression
Associated Tools and Languages: Used to mine/ extract useful information from raw data.
Main Languages used: R, SAS, Python, SQL
Major Tools used: RapidMiner, Orange, KNIME, Spark, Weka
Libraries used: Jupyter, NumPy, Matplotlib, Pandas, ScikitLearn, NLTK, TensorFlow,
Seaborn, Basemap, etc.
Real–Life Examples :
Market Basket Analysis:
It is a modeling technique that has been associated with frequent transactions of
buying some combination of items.
Example: Amazon and many other Retailers use this technique. While viewing some
products, certain suggestions for the commodities are shown that some people have
bought in the past.
Weather Forecasting:
Changing Patterns in weather conditions needs to be observed based on parameters
such as temperature, humidity, wind direction. This keen observation also requires
the use of previous records in order to predict it accurately.
Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model
that can accurately identify whether it is a cat or dog, so such a model can be created by
using the SVM algorithm. We will first train our model with lots of images of cats and dogs
so that it can learn about different features of cats and dogs, and then we test it with this
strange creature. So as support vector creates a decision boundary between these two data
(cat and dog) and choose extreme cases (support vectors), it will see the extreme case of cat
and dog. On the basis of the support vectors, it will classify it as a cat. Consider the below
diagram:
SVM algorithm can be used for Face detection, image classification, text categorization, etc.
Types of SVM
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is
termed as linearly separable data, and classifier is used called as Linear SVM
classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is
termed as non-linear data and classifier used is called as Non-linear SVM classifier.
So as it is 2-d space so by just using a straight line, we can easily separate these two classes.
But there can be multiple lines that can separate these classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane. SVM algorithm finds the closest point of the
lines from both the classes. These points are called support vectors. The distance between
the vectors and the hyperplane is called as margin. And the goal of SVM is to maximize this
margin. The hyperplane with maximum margin is called the optimal hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear
data, we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data, we
have used two dimensions x and y, so for non-linear data, we will add a third dimension z. It
can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the below
image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert
it in 2d space with z=1, then it will become as:
Hence we get a circumference of radius 1 in case of non-linear data.
Hierarchical Clustering
Hierarchical clustering can be used as an alternative for the partitioned clustering as there is
no requirement of pre-specifying the number of clusters to be created. In this technique, the
dataset is divided into clusters to create a tree-like structure, which is also called
a dendrogram. The observations or any number of clusters can be selected by cutting the
tree at the correct level. The most common example of this method is the Agglomerative
Hierarchical algorithm.
Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may belong to more than
one group or cluster. Each dataset has a set of membership coefficients, which depend on
the degree of membership to be in a cluster. Fuzzy C-means algorithm is the example of this
type of clustering; it is sometimes also known as the Fuzzy k-means algorithm.
Clustering Algorithms
The Clustering algorithms can be divided based on their models that are explained above.
There are different types of clustering algorithms published, but only a few are commonly
used. The clustering algorithm is based on the kind of data that we are using. Such as, some
algorithms need to guess the number of clusters in the given dataset, whereas some are
required to find the minimum distance between the observation of the dataset.
Here we are discussing mainly popular Clustering algorithms that are widely used in machine
learning:
1. K-Means algorithm: The k-means algorithm is one of the most popular clustering
algorithms. It classifies the dataset by dividing the samples into different clusters of
equal variances. The number of clusters must be specified in this algorithm. It is fast
with fewer computations required, with the linear complexity of O(n).
2. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in the
smooth density of data points. It is an example of a centroid-based model, that works
on updating the candidates for centroid to be the center of the points within a given
region.
3. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of Applications
with Noise. It is an example of a density-based model similar to the mean-shift, but
with some remarkable advantages. In this algorithm, the areas of high density are
separated by the areas of low density. Because of this, the clusters can be found in
any arbitrary shape.
4. Expectation-Maximization Clustering using GMM: This algorithm can be used as an
alternative for the k-means algorithm or for those cases where K-means can be
failed. In GMM, it is assumed that the data points are Gaussian distributed.
5. Agglomerative Hierarchical algorithm: The Agglomerative hierarchical algorithm
performs the bottom-up hierarchical clustering. In this, each data point is treated as a
single cluster at the outset and then successively merged. The cluster hierarchy can
be represented as a tree-structure.
6. Affinity Propagation: It is different from other clustering algorithms as it does not
require to specify the number of clusters. In this, each data point sends a message
between the pair of data points until convergence. It has O(N2T) time complexity,
which is the main drawback of this algorithm.
Applications of Clustering
Below are some commonly known applications of clustering technique in Machine Learning:
o In Identification of Cancer Cells: The clustering algorithms are widely used for the
identification of cancerous cells. It divides the cancerous and non-cancerous data
sets into different groups.
o In Search Engines: Search engines also work on the clustering technique. The search
result appears based on the closest object to the search query. It does it by grouping
similar data objects in one group that is far from the other dissimilar objects. The
accurate result of a query depends on the quality of the clustering algorithm used.
o Customer Segmentation: It is used in market research to segment the customers
based on their choice and preferences.
o In Biology: It is used in the biology stream to classify different species of plants and
animals using the image recognition technique.
o In Land Use: The clustering technique is used in identifying the area of similar lands
use in the GIS database. This can be very useful to find that for what purpose the
particular land should be used, that means for which purpose it is more suitable.