Data Analytics and Visualization Lab
Data Analytics and Visualization Lab
LAB MANUAL
SEMESTER - 5 th
RUNGTA COLLEGE
Rungta Educational Campus,
Kohka-Kurud Road, Bhilai,
Chhattisgarh, India
Phone No. 0788-6666666
MANAGED BY: SANTOSH RUNGTA GROUP OF INSTITUTIONS
Prepared By
Mr. LAKSHMAN SAHU
(Assistant Professor)
RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 1
RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI
LAB MANUAL
DOs:
▪ Come prepared in the lab regarding the experiment to be performed in the lab.
▪ Take help from the Manual / Work Book for preparation of the experiment.
▪ For any abnormal working of the machine consult the Faculty In-charge/ Lab
Assistant.
▪ Shut down the machine and switch off the power supply after performing the
experiment.
DON’Ts:
▪ Do not tamper the instruments in the Lab and do not disturb their settings.
LIST OF EXPERIMENTS
AS PER THE SYLLABUS PRESCRIBED BY THE UNIVERSITY
LIST OF EXPERIMENTS
AS PER RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY
LIST OF EXPERIMENTS
Exp. Name of Experiment Page
No. No.
3 Write a program for Creating line charts, bar plots, scatter plots, and 21
histograms, Plotting multiple graphs in a single figure.
4 Write a program for Hypothesis testing using t-tests, ANOVA, and chi- 28
square tests.
5 Write a program for Regression Analysis, fitting a linear model and making 35
predictions.
7 Write a program for Model evaluation using accuracy, precision, recall, and 46
F1-score.
Experiment No. 1
Aim: write a program in Python for Cleaning and handling missing values
in a dataset and data normalization.
Theory: Data cleaning is one of the important parts of machine learning. It plays a significant part in
building a model. It surely isn’t the fanciest part of machine learning and at the same time, there aren’t any
hidden tricks or secrets to uncover. However, the success or failure of a project relies on proper data
cleaning. Professional data scientists usually invest a very large portion of their time in this step because of
the belief that “Better data beats fancier algorithms”.
If we have a well-cleaned dataset, there are chances that we can get achieve good results with simple
algorithms also, which can prove very beneficial at times especially in terms of computation when the
dataset size is large. Obviously, different types of data will require different types of cleaning. However, this
systematic approach can always serve as a good starting point.
Data cleaning is a crucial step in the machine learning (ML) pipeline, as it involves identifying and
removing any missing, duplicate, or irrelevant data. The goal of data cleaning is to ensure that the data is
accurate, consistent, and free of errors, as incorrect or inconsistent data can negatively impact the
performance of the ML model.
Data cleaning, also known as data cleansing or data preprocessing, is a crucial step in the data science
pipeline that involves identifying and correcting or removing errors, inconsistencies, and inaccuracies in the
data to improve its quality and usability. Data cleaning is essential because raw data is often noisy,
incomplete, and inconsistent, which can negatively impact the accuracy and reliability of the insights
derived from it.
The following are the most common steps involved in data cleaning:
import pandas as pd
import numpy as np
Output:
This step involves understanding the data by inspecting its structure and identifying missing values, outliers,
and inconsistencies.
df.duplicated()
Output:
0 False
1 False
...
889 False
890 False
Length: 891, dtype: bool
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
From the above data info, we can see that Age and Cabin have an unequal number of counts. And some of
the columns are categorical and have data type objects and some are integer and float values.
df1.describe()
Output:
# Categorical columns
cat_col = [col for col in df.columns if df[col].dtype == 'object']
print('Categorical columns :',cat_col)
# Numerical columns
num_col = [col for col in df.columns if df[col].dtype != 'object']
print('Numerical columns :',num_col)
Output:
Output:
Name 891
Sex 2
Ticket 681
Cabin 147
Embarked 3
dtype: int64
This includes deleting duplicate/ redundant or irrelevant values from your dataset. Duplicate observations
most frequently arise during data collection and Irrelevant observations are those that don’t actually fit the
specific problem that you’re trying to solve.
• Redundant observations alter the efficiency to a great extent as the data repeats and may add towards the correct side or
towards the incorrect side, thereby producing unfaithful results.
• Irrelevant observations are any type of data that is of no use to us and can be removed directly.
Now we have to make a decision according to the subject of analysis, which factor is important for our
discussion. As we know our machines don’t understand the text data. So, we have to either drop or convert
the categorical column values into numerical types. Here we are dropping the Name columns because the
Name will be always unique and it hasn’t a great influence on target variables. For the ticket, Let’s first print
the 50 unique tickets.
df['Ticket'].unique()[:50]
Output:
From the above tickets, we can observe that it is made of two like first values ‘A/5 21171’ is joint from of
‘A/5’ and ‘21171’ this may influence our target variables. It will the case of Feature Engineering. where
we derived new features from a column or a group of columns. In the current case, we are dropping the
“Name” and “Ticket” columns.
df1 = df.drop(columns=['Name','Ticket'])
df1.shape
Output:
(891, 10)
Missing data is a common issue in real-world datasets, and it can occur due to various reasons such as
human errors, system failures, or data collection issues. Various techniques can be used to handle missing
data, such as imputation, deletion, or substitution.
Let’s check the % missing values columns-wise for each row using df.isnull() it checks whether the values
are null or not and gives returns boolean values. and .sum() will sum the total number of null values rows
and we divide it by the total number of rows present in the dataset then we multiply to get values in % i.e per
100 values how much values are null.
round((df1.isnull().sum()/df1.shape[0])*100,2)
Output:
PassengerId 0.00
Survived 0.00
Pclass 0.00
Sex 0.00
Age 19.87
SibSp 0.00
Parch 0.00
Fare 0.00
Cabin 77.10
Embarked 0.22
dtype: float64
We cannot just ignore or remove the missing observation. They must be handled carefully as they can be an
indication of something important.
The two most common ways to deal with missing data are:
As we can see from the above result that Cabin has 77% null values and Age has 19.87% and Embarked has
0.22% of null values. So, it’s not a good idea to fill 77% of null values. So, we will drop the Cabin column.
Embarked column has only 0.22% of null values so, we drop the null values rows of Embarked column.
df2 = df1.drop(columns='Cabin')
df2.dropna(subset=['Embarked'], axis=0, inplace=True)
df2.shape
Output:
(889, 9)
From the above describe table, we can see that there are very less differences between the mean and median
i..e 29.6 and 28. So, here we can do any one from mean imputation or Median imputations.
Note:
# Mean imputation
df3 = df2.fillna(df2.Age.mean())
# Let's check the null values again
df3.isnull().sum()
Output:
PassengerId 0
Survived 0
Pclass 0
Sex 0
Age 0
SibSp 0
Parch 0
Fare 0
Embarked 0
dtype: int64
4. Handling outliers:
Outliers are extreme values that deviate significantly from the majority of the data. They can negatively
impact the analysis and model performance. Techniques such as clustering, interpolation, or transformation
can be used to handle outliers.
To check the outliers, We generally use a box plot. A box plot, also referred to as a box-and-whisker plot, is
a graphical representation of a dataset’s distribution. It shows a variable’s median, quartiles, and potential
outliers. The line inside the box denotes the median, while the box itself denotes the interquartile range
(IQR). The whiskers extend to the most extreme non-outlier values within 1.5 times the IQR. Individual
points beyond the whiskers are considered potential outliers. A box plot offers an easy-to-understand
overview of the range of the data and makes it possible to identify outliers or skewness in the distribution.
plt.boxplot(df3['Age'], vert=False)
plt.ylabel('Variable')
plt.xlabel('Age')
plt.title('Box Plot')
plt.show()
Output:
As we can see from the above Box and whisker plot, Our age dataset has outliers values. The values less
than 5 and more 55 are outliers.
Output:
5. Data transformation
Data transformation involves converting the data from one form to another to make it more suitable for
analysis. Techniques such as normalization, scaling, or encoding can be used to transform the data.
• Data validation and verification: Data validation and verification involve ensuring that the data is accurate and
consistent by comparing it with external sources or expert knowledge.
For the machine learning prediction, First, we separate independent and target features. Here we will
consider only ‘Sex’ ‘Age’ ‘SibSp’, ‘Parch’ ‘Fare’ ‘Embarked’ only as the independent features and
Survived as target variables. Because PassengerId will not affect the survival rate.
X = df3[['Pclass','Sex','Age', 'SibSp','Parch','Fare','Embarked']]
Y = df3['Survived']
• Data formatting: Data formatting involves converting the data into a standard format or structure that can be easily
processed by the algorithms or models used for analysis. Here we will discuss commonly used data formatting techniques
i.e. Scaling and Normalization.
Scaling:
• Scaling involves transforming the values of features to a specific range. It maintains the shape of the original distribution
while changing the scale.
• Scaling is particularly useful when features have different scales, and certain algorithms are sensitive to the magnitude of
the features.
• Common scaling methods include Min-Max scaling and Standardization (Z-score scaling).
Min-Max Scaling:
• Min-Max scaling rescales the values to a specified range, typically between 0 and 1.
• It preserves the original distribution and ensures that the minimum value maps to 0 and the maximum value maps to 1.
# Numerical columns
num_col_ = [col for col in X.columns if X[col].dtype != 'object']
x1 = X
# learning the statistical parameters for each of the data and transforming
x1[num_col_] = scaler.fit_transform(x1[num_col_])
x1.head()
Output:
Z = (X - μ) / σ
Where,
• X = Data
• μ = Mean value of X
• σ = Standard deviation of X
• OpenRefine
• Trifacta Wrangler
• TIBCO Clarity
• Cloudingo
• IBM Infosphere Quality Stage
1. Improved model performance: Data cleaning helps improve the performance of the ML model by removing errors,
inconsistencies, and irrelevant data, which can help the model to better learn from the data.
2. Increased accuracy: Data cleaning helps ensure that the data is accurate, consistent, and free of errors, which can help
improve the accuracy of the ML model.
3. Better representation of the data: Data cleaning allows the data to be transformed into a format that better represents the
underlying relationships and patterns in the data, making it easier for the ML model to learn from the data.
4. Improved data quality: Data cleaning helps to improve the quality of the data, making it more reliable and accurate. This
ensures that the machine learning models are trained on high-quality data, which can lead to better predictions and
outcomes.
5. Improved data security: Data cleaning can help to identify and remove sensitive or confidential information that could
compromise data security. By eliminating this information, data cleaning can help to ensure that only the necessary and
relevant data is used for machine learning.
1. Time-consuming: Data cleaning can be a time-consuming task, especially for large and complex datasets.
2. Error-prone: Data cleaning can be error-prone, as it involves transforming and cleaning the data, which can result in the
loss of important information or the introduction of new errors.
3. Limited understanding of the data: Data cleaning can lead to a limited understanding of the data, as the transformed data
may not be representative of the underlying relationships and patterns in the data.
4. Data loss: Data cleaning can result in the loss of important information that may be valuable for machine learning
analysis. In some cases, data cleaning may result in the removal of data that appears to be irrelevant or inconsistent, but
which may contain valuable insights or patterns.
5. Cost and resource-intensive: Data cleaning can be a resource-intensive process that requires significant time, effort, and
expertise. It can also require the use of specialized software tools, which can add to the cost and complexity of data
cleaning.
6. Overfitting: Overfitting occurs when a machine learning model is trained too closely on a particular dataset, resulting in
poor performance when applied to new or different data. Data cleaning can inadvertently contribute to overfitting by
removing too much data, leading to a loss of information that could be important for model training and performance.
# Load the dataset (replace 'your_dataset.csv' with the actual file path)
df = pd.read_csv('your_dataset.csv')
Experiment No. 2
Theory: Descriptive Statistics is the building block of data science. Advanced analytics is often
incomplete without analyzing descriptive statistics of the key metrics. In simple terms, descriptive statistics
can be defined as the measures that summarize a given data, and these measures can be broken down further
into the measures of central tendency and the measures of dispersion.
Measures of central tendency include mean, median, and the mode, while the measures of variability include
standard deviation, variance, and the interquartile range. In this guide, you will learn how to compute these
measures of descriptive statistics and use them to interpret the data.
1. Mean
2. Median
3. Mode
4. Standard Deviation
5. Variance
6. Interquartile Range
7. Skewness
Data
In this guide, we will be using fictitious data of loan applicants containing 600 observations and 10
variables, as described below:
import pandas as pd
import numpy as np
import statistics as st
Output:
(600, 10)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 600 entries, 0 to 599
Data columns (total 10 columns):
Marital_status 600 non-null object
Dependents 600 non-null int64
Is_graduate 600 non-null object
Income 600 non-null int64
Loan_amount 600 non-null int64
Term_months 600 non-null int64
Credit_score 600 non-null object
approval_status 600 non-null object
Age 600 non-null int64
Sex 600 non-null object
dtypes: int64(5), object(5)
memory usage: 47.0+ KB
None
Five of the variables are categorical (labelled as 'object') while the remaining five are numerical (labelled as
'int').
Measures of central tendency describe the center of the data, and are often represented by the mean, the
median, and the mode.
Mean
Mean represents the arithmetic average of the data. The line of code below prints the mean of the numerical
variables in the data. From the output, we can infer that the average age of the applicant is 49 years, the
average annual income is USD 705,541, and the average tenure of loans is 183 months. The command
df.mean(axis = 0) will also give the same output.
df.mean()
python
Output:
Dependents 0.748333
Income 705541.333333
Loan_amount 323793.666667
Term_months 183.350000
Age 49.450000
It is also possible to calculate the mean of a particular variable in a data, as shown below, where we
calculate the mean of the variables 'Age' and 'Income'.
print(df.loc[:,'Age'].mean())
print(df.loc[:,'Income'].mean())
python
Output:
49.45
705541.33
In the previous sections, we computed the column-wise mean. It is also possible to calculate the mean of the
rows by specifying the (axis = 1) argument. The code below calculates the mean of the first five rows.
df.mean(axis = 1)[0:5]
python
Output:
0 70096.0
1 161274.0
2 125113.4
3 119853.8
4 120653.8
dtype: float64
Median
In simple terms, median represents the 50th percentile, or the middle value of the data, that separates the
distribution into two halves. The line of code below prints the median of the numerical variables in the data.
The command df.median(axis = 0) will also give the same output.
df.median()
python
Output:
Dependents 0.0
Income 508350.0
Loan_amount 76000.0
Term_months 192.0
Age 51.0
dtype: float64
From the output, we can infer that the median age of the applicants is 51 years, the median annual income is
USD 508,350, and the median tenure of loans is 192 months. There is a difference between the mean and the
median values of these variables, which is because of the distribution of the data. We will learn more about
this in the subsequent sections.
df.median(axis = 1)[0:5]
python
Output:
51.0
508350.0
0 102.0
1 192.0
2 192.0
3 192.0
4 192.0
dtype: float64
Mode
Mode represents the most frequent value of a variable in the data. This is the only central tendency measure
that can be used with categorical variables, unlike the mean and the median which can be used only with
quantitative data.
The line of code below prints the mode of all the variables in the data. The .mode() function returns the most
common value or most repeated value of a variable. The command df.mode(axis = 0) will also give the
same output.
df.mode()
python
Output:
The interpretation of the mode is simple. The output above shows that most of the applicants are married, as
depicted by the 'Marital_status' value of "Yes". Similar interpreation could be done for the other categorical
variables like 'Sex' and 'Credit-Score'. For numerical variables, the mode value represents the value that
occurs most frequently. For example, the mode value of 55 for the variable 'Age' means that the highest
number (or frequency) of applicants are 55 years old.
import numpy as np
from scipy import stats
# Sample data
data = [22, 25, 28, 31, 35, 25, 30, 27, 29, 32]
Output:
Experiment No. 3
Aim: Write a program for Creating line charts, bar plots, scatter plots, and
histograms, Plotting multiple graphs in a single figure.
Theory: Matplotlib is a data visualization library in Python. The pyplot, a sublibrary of matplotlib, is a
collection of functions that helps in creating a variety of charts. Line charts are used to represent the
relation between two data X and Y on a different axis. Here we will see some of the examples of a line chart
in Python:
Line plots
First import Matplotlib.pyplot library for plotting functions. Also, import the Numpy library as per
requirement. Then define data values x and y.
Output:
we can see in the above output image that there is no label on the x-axis and y-axis. Since labeling is
necessary for understanding the chart dimensions. In the following example, we will see how to add labels,
Ident in the charts
Bar Plots
A bar plot or bar chart is a graph that represents the category of data with rectangular bars with lengths and
heights that is proportional to the values which they represent. The bar plots can be plotted horizontally or
vertically. A bar chart describes the comparisons between the discrete categories. One of the axis of the plot
represents the specific categories being compared, while the other axis represents the measured values
The matplotlib API in Python provides the bar() function which can be used in MATLAB style use or as an
object-oriented API. The syntax of the bar() function to be used with the axes is as follows:-
The function creates a bar plot bounded with a rectangle depending on the given parameters. Following is a
simple example of the bar plot, which represents the number of students enrolled in different courses of an
institute.
import numpy as np
import matplotlib.pyplot as plt
plt.xlabel("Courses offered")
plt.ylabel("No. of students enrolled")
plt.title("Students enrolled in different courses")
plt.show()
Output-
Here plt.bar(courses, values, color=’maroon’) is used to specify that the bar chart is to be plotted by using
the courses column as the X-axis, and the values as the Y-axis. The color attribute is used to set the color of
Scatter plots
Scatter plots are used to observe relationship between variables and uses dots to represent the relationship
between them. The scatter() method in the matplotlib library is used to draw a scatter plot. Scatter plots are
widely used to represent relation among variables and how change in one affects the other.
Syntax
The syntax for scatter() method is given below:
Except x_axis_data and y_axis_data all other parameters are optional and their default value is None. Below
are the scatter plot examples with various parameters.
Example 1: This is the most basic example of a scatter plot.
x =[5, 7, 8, 7, 2, 17, 2, 9,
4, 11, 12, 9, 6]
plt.scatter(x, y, c ="blue")
Output
Histogram
A histogram is basically used to represent data provided in a form of some groups.It is accurate method for
the graphical representation of numerical data distribution.It is a type of bar plot where X-axis represents the
bin ranges while Y-axis gives information about frequency.
Creating a Histogram
To create a histogram the first step is to create bin of the ranges, then distribute the whole range of the
values into a series of intervals, and count the values which fall into each of the intervals.Bins are clearly
identified as consecutive, non-overlapping intervals of variables.The matplotlib.pyplot.hist() function is used
to compute and create histogram of x.
Attribute parameter
x array or sequence of array
bins optional parameter contains integer or sequence or strings
density optional parameter contains boolean values
range optional parameter represents upper and lower range of bins
histtype optional parameter used to create type of histogram [bar, barstacked, step, stepfilled], default is “bar”
align optional parameter controls the plotting of histogram [left, right, mid]
Let’s create a basic histogram of some random values. Below code creates a simple histogram of some
random values:
# Creating dataset
a = np.array([22, 87, 5, 43, 56,
73, 55, 54, 11,
20, 51, 5, 79, 31,
27])
# Creating histogram
fig, ax = plt.subplots(figsize =(10, 7))
ax.hist(a, bins = [0, 25, 50, 75, 100])
# Show plot
plt.show()
Output :
A subplot () function is a wrapper function which allows the programmer to plot more than one graph in a
single figure by just calling it once.
Parameters:
1. nrows, ncols: These gives the number of rows and columns respectively. Also, it must be noted that both these
parameters are optional and the default value is 1.
2. sharex, sharey: These parameters specify about the properties that are shared among a and y axis.Possible values for
them can be, row, col, none or default value which is False.
3. squeeze: This parameter is a boolean value specified, which asks the programmer whether to squeeze out, meaning
remove the extra dimension from the array. It has a default value False.
4. subplot_kw: This parameters allow us to add keywords to each subplot and its default value is None.
5. gridspec_kw: This allows us to add grids on each subplot and has a default value of None.
6. **fig_kw: This allows us to pass any other additional keyword argument to the function call and has a default value of
None.
Example :
# importing libraries
import matplotlib.pyplot as plt
import numpy as np
import math
Output
In Matplotlib, there is another function very similar to subplot which is subplot2grid (). It is same almost
same as subplot function but provides more flexibility to arrange the plot objects according to the need of
the programmer.
Parameter:
1. shape
This parameter is a sequence of two integer values which tells the shape of the grid for which we need to place the axes.
The first entry is for row, whereas the second entry is for column.
2. loc
Like shape parameter, even Ioc is a sequence of 2 integer values, where first entry remains for the row and the second is
for column to place axis within grid.
3. rowspan
This parameter takes integer value and the number which indicates the number of rows for the axis to span to or increase
towards right side.
4. colspan
This parameter takes integer value and the number which indicates the number of columns for the axis to span to or
increase the length downwards.
5. fig
This is an optional parameter and takes Figure to place axis in. It defaults to current figure.
6. **kwargs
This allows us to pass any other additional keyword argument to the function call and has a default value of None.
Experiment No. 4
Aim: Write a program for Hypothesis testing using t-tests, ANOVA, and
chi-square tests.
Statistics is an important part of data science where we use statical assumptions to get assertions from
population data, to make assumptions from the population we make hypothesis about population parameters.
A hypothesis is a statement about a given problem.
Hypothesis testing is a statistical method that is used in making a statistical decision using experimental
data. Hypothesis testing is basically an assumption that we make about a population parameter. It evaluates
two mutually exclusive statements about a population to determine which statement is best supported by the
sample data.
Example: You say an average student in the class is 30 or a boy is taller than a girl. All of these is an
assumption that we are assuming and we need some statistical way to prove these. We need some
mathematical conclusion whatever we are assuming is true.
Hypothesis testing is an important procedure in statistics. Hypothesis testing evaluates two mutually
exclusive population statements to determine which statement is most supported by sample data. When we
say that the findings are statistically significant, it is thanks to hypothesis testing.
• Null hypothesis(H0): In statistics, the null hypothesis is a general given statement or default position that there is no
relationship between two measured cases or no relationship among groups. In other words, it is a basic assumption or
made based on the problem knowledge.
• Alternative hypothesis(H1): The alternative hypothesis is the hypothesis used in hypothesis testing that is contrary to
the null hypothesis.
• Level of significance It refers to the degree of significance in which we accept or reject the null hypothesis. 100%
accuracy is not possible for accepting a hypothesis, so we, therefore, select a level of significance that is usually 5%. This
is normally denoted with and generally, it is 0.05 or 5%, which means your output should be 95% confident to give a
similar kind of result in each sample.
• P-value The P value, or calculated probability, is the probability of finding the observed/extreme results when the null
hypothesis(H0) of a study-given problem is true. If your P-value is less than the chosen significance level then you reject
the null hypothesis i.e. accept that your sample claims to support the alternative hypothesis.
• Step 1– We first identify the problem about which we want to make an assumption keeping in mind that our assumption
should be contradictory to one another
• Step 2 – We consider statical assumption such that the data is normal or not, statical independence between the data.
• Step 3 – We decide our test data on which we will check our hypothesis
Example: Given a coin and it is not known whether that is fair or tricky so let’s decide the null and alternate
hypothesis
• Toss a coin 2nd time and assume that result again is head, now p-value =
and similarly, we Toss 6 consecutive times and got the result as all heads, now P-value = But we set our
significance level as an error rate we allow and here we see we are beyond that level i.e. our null- hypothesis
does not hold good so we need to reject and propose that this coin is a tricky coin which is actually because
it gives us 6 consecutive heads.
To validate our hypothesis about a population parameter we use statistical functions. we use the z-score, p-
value, and, level of significance(alpha) to make evidence for our hypothesis.
where,
We will use the scipy python library to compute the p-value and z-score for our sample dataset. Scipy is a
mathematical library in Python that is mostly used for mathematical equations and computations. In this
code, we will create a function hypothesis_test in which we will pass arguments like pop_mean(population
parameter upon which we are checking our hypothesis), sample dataset, level of confidence(alpha value),
and type of testing (whether it’s a one-tailed test or two-tailed test).
import numpy as np
To evaluate our hypothesis test function we will create a sample dataset of 20 points having 4.5 as the mean
and 2 as the standard deviation. Here, We will consider that our population has a mean equals to 5 .
np.random.seed(0)
sample = np.random.normal(loc=4.5, scale=2, size=20)
pop_mean = 5.0
Output :
In the above example, we can see that we are getting a p-value of 0.101 from the dataset which is less than
our level of confidence(alpha value) which is 0.5 hence in this case we will reject our null hypothesis the
population mean is 5.0
What if we get a p-value greater than our test statistics but we still reject our null hypothesis in this case we
will be making an error. Based on the error we make we define error in two types.
• Type I error: When we reject the null hypothesis, although that hypothesis was true. Type I error is denoted by alpha.
• Type II errors: When we accept the null hypothesis but it is false. Type II errors are denoted by beta.
•
• In my previous blog, I have given an overview of hypothesis testing what it is, and errors related to
it.
• In this blog, we will discuss different techniques for hypothesis testing mainly theoretical and when
to use what?
• What is P-value?
• The job of the p-value is to decide whether we should accept our Null Hypothesis or reject it. The
lower the p-value, the more surprising the evidence is, the more ridiculous our null hypothesis
looks. And when we feel ridiculous about our null hypothesis we simply reject it and accept our
Alternate Hypothesis.
• If we found the p-value is lower than the predetermined significance value(often called alpha or
threshold value) then we reject the null hypothesis. The alpha should always be set before an
experiment to avoid bias.
• For example, we generally consider a large population data to be in Normal Distribution so while
selecting alpha for that distribution we select it as 0.05 (it means we are accepting if it lies in the 95
percent of our distribution). This means that if our p-value is less than 0.05 we will reject the null
hypothesis.
•
• But wait, guys!! Significance of p-value comes in after performing Statistical tests and when to use
which technique is important. So now I will list when to perform which statistical technique for
hypothesis testing.
• Chi-Square Test
• Chi-Square test is used when we perform hypothesis testing on two categorical variables from a
single population or we can say that to compare categorical variables from a single population. By
this we find is there any significant association between the two categorical variables.
•
• The hypothesis being tested for chi-square is
• Null: Variable A and Variable B are independent.
• Alternate: Variable A and Variable B are not independent.
• T-Test
• The T-test is an inferential statistic that is used to determine the difference or to compare the means
of two groups of samples which may be related to certain features. It is performed on continuous
variables.
• There are three different versions of t-tests:
• → One sample t-test which tells whether means of sample and population are different.
•
• One Sample t-test
• → Two sample t-test also is known as Independent t-test — it compares the means of two
independent groups and determines whether there is statistical evidence that the associated
population means are significantly different.
•
• Two sample t-test
• → Paired t-test when you want to compare means of the different samples from the same group or
which compares means from the same group at different times.
1. Independent t-test: It performs an independent t-test between group1 and group2 and prints the
calculated t-statistic and p-value.
2. One-way ANOVA: It performs a one-way ANOVA on group1, group2, and group3 and displays
the calculated F-statistic and p-value.
3. Chi-square Test: It performs a chi-square test on the provided contingency table (observed) and
prints the chi-square statistic, p-value, degrees of freedom, and the expected frequencies table.
Output:
Experiment No. 5
Aim: Write a program for Regression Analysis, fitting a linear model and
making predictions.
Simple linear regression is an approach for predicting a response using a single feature. It is one of the
most basic machine learning models that a machine learning enthusiast gets to know about. In linear
regression, we assume that the two variables i.e. dependent and independent variables are linearly related.
Hence, we try to find a linear function that predicts the response value(y) as accurately as possible as a
function of the feature or independent variable(x). Let us consider a dataset where we have a value of
response y for every feature x:
for n observations (in the above example, n=10). A scatter plot of the above dataset looks like this:-
Now, the task is to find a line that fits best in the above scatter plot so that we can predict the response for
any new feature values. (i.e a value of x not present in a dataset) This line is called a regression line. The
equation of the regression line is represented as:
Here,
To create our model, we must “learn” or estimate the values of regression coefficients b_0 and b_1. And
once we’ve estimated these coefficients, we can use the model to predict responses!
In this article, we are going to use the principle of Least Squares.
Now consider:
Here, e_i is a residual error in ith observation. So, our aim is to minimize the total residual error. We define
the squared error or cost function, J as:
And our task is to find the value of b0 and b1 for which J(b0, b1) is minimum! Without going into the
mathematical details, we present the result here:
We can use the Python language to learn the coefficient of linear regression models. For plotting the input
data and best-fitted line we will use the matplotlib library. It is one of the most used Python libraries for
plotting graphs.
import numpy as np
import matplotlib.pyplot as plt
# putting labels
plt.xlabel('x')
plt.ylabel('y')
def main():
# observations / data
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])
# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \
\nb_1 = {}".format(b[0], b[1]))
if __name__ == "__main__":
main()
Output:
Estimated coefficients:
b_0 = -0.0586206896552
b_1 = 1.45747126437
# Make predictions
y_pred = model.predict(X_test)
In this program:
1. We generate synthetic data for demonstration purposes, where the relationship between X and y
follows a linear model (y = 2X + 1 with some added noise).
2. We split the data into training and testing sets using train_test_split.
3. We create a LinearRegression model and fit it to the training data using fit.
4. We use the trained model to make predictions on the testing data using predict.
5. We plot the original data points and the regression line to visualize the linear regression.
Output:
Experiment No. 6
Theory: In machine learning, binary classification is a supervised learning algorithm that categorizes
new observations into one of two classes.
The following are a few binary classification applications, where the 0 and 1 columns are two possible
classes for each observation:
Application Observation 0 1
Medical Diagnosis Patient Healthy Diseased
Email Analysis Email Not Spam Spam
Financial Data Analysis Transaction Not Fraud Fraud
Marketing Website visitor Won't Buy Will Buy
Image Classification Image Hotdog Not Hotdog
Quick example
In a medical diagnosis, a binary classifier for a specific disease could take a patient's symptoms as input
features and predict whether the patient is healthy or has the disease. The possible outcomes of the diagnosis
are positive and negative.
If the model successfully predicts the patients as positive, this case is called True Positive (TP). If the model
successfully predicts patients as negative, this is called True Negative (TN). The binary classifier may
misdiagnose some patients as well. If a diseased patient is classified as healthy by a negative test result, this
error is called False Negative (FN). Similarly, If a healthy patient is classified as diseased by a positive test
result, this error is called False Positive(FP).
• True Positive (TP): The patient is diseased and the model predicts "diseased"
• False Positive (FP): The patient is healthy but the model predicts "diseased"
• True Negative (TN): The patient is healthy and the model predicts "healthy"
• False Negative (FN): The patient is diseased and the model predicts "healthy"
After obtaining these values, we can compute the accuracy score of the binary classifier as follows:
In machine learning, many methods utilize binary classification. The most common are:
The following Python example will demonstrate using binary classification in a logistic regression problem.
For our data, we will use the breast cancer dataset from scikit-learn. This dataset contains tumor
observations and corresponding labels for whether the tumor was malignant or benign.
First, we'll import a few libraries and then load the data. When loading the data, we'll specify
as_frame=True so we can work with pandas objects.
dataset = load_breast_cancer(as_frame=True)
The dataset contains a DataFrame for the observation data and a Series for the target data.
Let's see what the first few rows of observations look like:
dataset['data'].head()
Out:
mea wor
mean worst
me me mean me n mean wo wor worst wo wors st worst
mean mean mean fract . worst worst fract
an an peri an conc sym rst st peri rst t conc sym
smoot compa conc al . smoot compa al
rad text mete are ave metr rad text mete are conc ave metr
hness ctness avity dime . hness ctness dime
ius ure r a poin y ius ure r a avity poin y
nsion nsion
ts ts
.
17. 10.3 122.8 100 0.1184 0.2776 0.300 0.14 0.241 0.078 25. 17.3 184.6 201 0.711 0.26 0.460 0.118
0 . 0.1622 0.6656
99 8 0 1.0 0 0 1 710 9 71 38 3 0 9.0 9 54 1 90
.
.
20. 17.7 132.9 132 0.0847 0.0786 0.086 0.07 0.181 0.056 24. 23.4 158.8 195 0.241 0.18 0.275 0.089
1 . 0.1238 0.1866
57 7 0 6.0 4 4 9 017 2 67 99 1 0 6.0 6 60 0 02
.
.
19. 21.2 130.0 120 0.1096 0.1599 0.197 0.12 0.206 0.059 23. 25.5 152.5 170 0.450 0.24 0.361 0.087
2 . 0.1444 0.4245
69 5 0 3.0 0 0 4 790 9 99 57 3 0 9.0 4 30 3 58
.
.
11. 20.3 386 0.1425 0.2839 0.241 0.10 0.259 0.097 14. 26.5 567 0.686 0.25 0.663 0.173
3 77.58 . 98.87 0.2098 0.8663
42 8 .1 0 0 4 520 7 44 91 0 .7 9 75 8 00
.
.
20. 14.3 135.1 129 0.1003 0.1328 0.198 0.10 0.180 0.058 22. 16.6 152.2 157 0.400 0.16 0.236 0.076
4 . 0.1374 0.2050
29 4 0 7.0 0 0 0 430 9 83 54 7 0 5.0 0 25 4 78
.
5 rows × 30 columns
The output shows five observations with a column for each feature we'll use to predict malignancy.
dataset['target'].head()
Out:
0 0
The targets for the first five observations are all zero, meaning the tumors are benign. Here's how many
malignant and benign tumors are in our dataset:
dataset['target'].value_counts()
Out:
1 357
0 212
Name: target, dtype: int64
So we have 357 malignant tumors, denoted as 1, and 212 benign, denoted as 0. So, we have a binary
classification problem.
To perform binary classification using logistic regression with sklearn, we must accomplish the following
steps.
We'll store the rows of observations in a variable X and the corresponding class of those observations (0 or
1) in a variable y.
X = dataset['data']
y = dataset['target']
We use 75% of data for training and 25% for testing. Setting random_state=0 will ensure your results are
the same as ours.
Note that we normalize after splitting the data. It's good practice to apply any data transformations to
training and testing data separately to prevent data leakage.
ss_train = StandardScaler()
X_train = ss_train.fit_transform(X_train)
ss_test = StandardScaler()
X_test = ss_test.fit_transform(X_test)
This step effectively trains the model to predict the targets from the data.
With the model trained, we now ask the model to predict targets based on the test data.
predictions = model.predict(X_test)
Step 6: Calculate the accuracy score by comparing the actual values and predicted values.
We can now calculate how well the model performed by comparing the model's predictions to the true target
values, which we reserved in the y_test variable.
First, we'll calculate the confusion matrix to get the necessary parameters:
cm = confusion_matrix(y_test, predictions)
To quickly train each model in a loop, we'll initialize each model and store it by name in a dictionary:
models = {}
# Logistic Regression
from sklearn.linear_model import LogisticRegression
models['Logistic Regression'] = LogisticRegression()
# Decision Trees
from sklearn.tree import DecisionTreeClassifier
models['Decision Trees'] = DecisionTreeClassifier()
# Random Forest
from sklearn.ensemble import RandomForestClassifier
models['Random Forest'] = RandomForestClassifier()
# Naive Bayes
from sklearn.naive_bayes import GaussianNB
models['Naive Bayes'] = GaussianNB()
# K-Nearest Neighbors
from sklearn.neighbors import KNeighborsClassifier
models['K-Nearest Neighbor'] = KNeighborsClassifier()
Now that we'veinitialized the models, we'll loop over each one, train it by calling .fit(), make predictions,
calculate metrics, and store each result in a dictionary.
# Make predictions
predictions = models[key].predict(X_test)
# Calculate metrics
accuracy[key] = accuracy_score(predictions, y_test)
precision[key] = precision_score(predictions, y_test)
recall[key] = recall_score(predictions, y_test)
With all metrics stored, we can use pandas to view the data as a table:
import pandas as pd
df_model
Out:
Accuracy Precision Recall
Logistic Regression 0.958042 0.955556 0.977273
Support Vector Machines 0.937063 0.933333 0.965517
Decision Trees 0.902098 0.866667 0.975000
Random Forest 0.972028 0.966667 0.988636
Naive Bayes 0.937063 0.955556 0.945055
ax = df_model.plot.barh()
ax.legend(
ncol=len(models.keys()),
bbox_to_anchor=(0, 1),
loc='lower left',
prop={'size': 14}
)
plt.tight_layout()
# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Print accuracy
print("Logistic Regression Accuracy:", logreg_accuracy)
print("Random Forest Accuracy:", rf_accuracy)
Output:
Logistic Regression Accuracy: 1.0
Random Forest Accuracy: 1.0
Experiment No. 7
Theory: Classification models are used in classification problems to predict the target class of the data
sample. The classification model predicts the probability that each instance belongs to one class or another.
It is important to evaluate the performance of the classifications model in order to reliably use these models
in production for solving real-world problems. Performance measures in machine learning classification
models are used to assess how well machine learning classification models perform in a given context.
These performance metrics include accuracy, precision, recall, and F1-score. Because it helps us
understand the strengths and limitations of these models when making predictions in new situations, model
performance is essential for machine learning. In this blog post, we will explore these four machine learning
classification model performance metrics through Python Sklearn example.
• Accuracy score
• Precision score
• Recall score
• F1-Score
As a data scientist, you must get a good understanding of concepts related to the above in relation to
measuring classification models’ performance. Before we get into the details of the performance metrics as
listed above, lets understand key terminologies such as true positive, false positive, true negative and false
negative with the help of confusion matrix. These terminologies will be used across different performance
metrics.
Table of Contents
Before we get into the definitions, lets work with Sklearn breast cancer datasets for classifying whether a
particular instance of data belongs to benign or malignant breast cancer class. You can load the dataset
using the following code:
1import pandas as pd
2import numpy as np
3from sklearn import datasets
4#
5# Load the breast cancer data set
6#
7bc = datasets.load_breast_cancer()
8X = bc.data
9y = bc.target
The target labels in the breast cancer dataset are Benign (1) and Malignant (0). There are 212 records with
labels as malignant and 357 records with labels as benign. Let’s create a training and test split where 30% of
the dataset is set aside for testing purposes.
Splitting the breast cancer dataset into training and test set results in the test set consisting of 64 records’
labels as benign and 107 records’ labels as malignant. Thus, the actual positive is 107 records and the actual
negative is 64 records. Let’s train the model and get the confusion matrix. Here is the code for training the
model and printing the confusion matrix.
The predicted data results in the above diagram could be read in the following manner given 1 represents
malignant cancer (positive).
• True Positive (TP): True positive measures the extent to which the model correctly predicts the positive class. That is,
the model predicts that the instance is positive, and the instance is actually positive. True positives are relevant when we
want to know how many positives our model correctly predicts. For example, in a binary classification problem with
classes “A” and “B”, if our goal is to predict class “A” correctly, then a true positive would be the number of instances of
class “A” that our model correctly predicted as class “A”. Taking a real-world example, if the model is designed to
predict whether an email is spam or not, a true positive would occur when the model correctly predicts that an email is a
spam. The true positive rate is the percentage of all instances that are correctly classified as belonging to a certain class.
True positives are important because they indicate how well our model performs on positive instances. In the above
confusion matrix, out of 107 actual positives, 104 are correctly predicted positives. Thus, the value of True Positive is
104.
• False Positive (FP): False positives occur when the model predicts that an instance belongs to a class that it actually does
not. False positives can be problematic because they can lead to incorrect decision-making. For example, if a medical
diagnosis model has a high false positive rate, it may result in patients undergoing unnecessary treatment. False positives
can be detrimental to classification models because they lower the overall accuracy of the model. There are a few ways to
measure false positives, including false positive rates. The false positive rate is the proportion of all negative examples
that are predicted as positive. While false positives may seem like they would be bad for the model, in some cases they
can be desirable. For example, in medical applications, it is often better to err on the side of caution and have a few false
positives than to miss a diagnosis entirely. However, in other applications, such as spam filtering, false positives can be
very costly. Therefore, it is important to carefully consider the trade-offs involved when choosing between different
classification models. In the above example, the false positive represents the number of negatives (out of 64) that get
falsely predicted as positive. Out of 64 actual negatives, 3 is falsely predicted as positive. Thus, the value of False
Positive is 3.
• True Negative (TN): True negatives are the outcomes that the model correctly predicts as negative. For example, if the
model is predicting whether or not a person has a disease, a true negative would be when the model predicts that the
person does not have the disease and they actually don’t have the disease. True negatives are one of the measures used to
assess how well a classification model is performing. In general, a high number of true negatives indicates that the model
is performing well. True negative is used in conjunction with false negative, true positive, and false positive to compute a
variety of performance metrics such as accuracy, precision, recall, and F1 score. While true negative provides valuable
insight into the classification model’s performance, it should be interpreted in the context of other metrics to get a
complete picture of the model’s accuracy. Out of 64 actual negatives, 61 is correctly predicted negative. Thus, the value
of True Negative is 61.
• False Negative (FN): A false negative occurs when a model predicts an instance as negative when it is actually positive.
False negatives can be very costly, especially in the field of medicine. For example, if a cancer screening test predicts
that a patient does not have cancer when they actually do, this could lead to the disease progressing without treatment.
False negatives can also occur in other fields, such as security or fraud detection. In these cases, a false negative may
Given the above definitions, let’s try and understand the concept of accuracy, precision, recall, and f1-score.
The model precision score measures the proportion of positively predicted labels that are actually correct.
Precision is also known as the positive predictive value. Precision is used in conjunction with the recall to
trade-off false positives and false negatives. Precision is affected by the class distribution. If there are more
samples in the minority class, then precision will be lower. Precision can be thought of as a measure of
exactness or quality. If we want to minimize false positives, we would choose a model with high precision.
Conversely, if we want to minimize false negatives, we would choose a model with high recall. Precision is
mainly used when we need to predict the positive class and there is a greater cost associated with false
positives than with false negatives such as in medical diagnosis or spam filtering. For example, if a model is
99% accurate but only has 50% precision, that means that half of the time when it predicts an email is a
spam, it is actually not spam.
The precision score is a useful measure of the success of prediction when the classes are very
imbalanced. Mathematically, it represents the ratio of true positive to the sum of true positive and false
positive.
From the above formula, you could notice that the value of false-positive would impact the precision score.
Thus, while building predictive models, you may choose to focus appropriately to build models with lower
false positives if a high precision score is important for the business requirements.
The precision score from the above confusion matrix will come out to be the following:
The same score can be obtained by using the precision_score method from sklearn.metrics
The precision score can be used in the scenario where the machine learning model is required to identify all
positive examples without any false positives. For example, machine learning models are used in medical
diagnosis applications where the doctor wants machine learning models that will not provide a label of
pneumonia if the patient does not have this disease. Oncologists ideally want models that can identify all
cancerous lesions without any false-positive results, and hence one could use a precision score in such cases.
Note that a greater number of false positives will result in a lot of stress for the patients in general although
that may not turn out to be fatal from a health perspective. Further tests will be able to negate the false
positive prediction.
The other example where the precision score can be useful is credit card fraud detection. In credit card fraud
detection problems, classification models are evaluated using the precision score to determine how many
positive samples were correctly classified by the classification model. You would not like to have a high
number of false positives or else you might end up blocking many credit cards and hence a lot of frustrations
with the end-users.
Model recall score represents the model’s ability to correctly predict the positives out of actual positives.
This is unlike precision which measures how many predictions made by models are actually positive out of
all positive predictions made. For example: If your machine learning model is trying to identify positive
reviews, the recall score would be what percent of those positive reviews did your machine learning model
correctly predict as a positive. In other words, it measures how good our machine learning model is at
identifying all actual positives out of all positives that exist within a dataset. Recall is also known as
sensitivity or the true positive rate.
The higher the recall score, the better the machine learning model is at identifying both positive and negative
examples. A high recall score indicates that the model is good at identifying positive examples. Conversely,
a low recall score indicates that the model is not good at identifying positive examples.
Recall is often used in conjunction with other performance metrics, such as precision and accuracy, to get a
complete picture of the model’s performance. Mathematically, it represents the ratio of true positive to the
sum of true positive and false negative.
From the above formula, you could notice that the value of false-negative would impact the recall score.
Thus, while building predictive models, you may choose to focus appropriately to build models with lower
false negatives if a high recall score is important for the business requirements.
The recall score from the above confusion matrix will come out to be the following:
The same score can be obtained by using the recall_score method from sklearn.metrics
Recall score can be used in the scenario where the labels are not equally divided among classes. For
example, if there is a class imbalance ratio of 20:80 (imbalanced data), then the recall score will be more
useful than accuracy because it can provide information about how well the machine learning model
identified rarer events.
Different real-world scenarios when recall scores can be used as evaluation metrics
Recall score is an important metric to consider when measuring the effectiveness of your machine learning
models. It can be used in a variety of real-world scenarios, and it’s important to always aim to improve
recall and precision scores together. The following are examples of some real-world scenarios where recall
scores can be used as evaluation metrics:
• In medical diagnosis, the recall score should be an extremely high otherwise greater number of false negatives would
prove to be fatal to the life of patients. The lower recall score would mean a greater false negative which essentially
would mean that some patients who are positive are termed as falsely negative. That would mean that patients would get
assured that he/she is not suffering from the disease and therefore he/she won’t take any further action. That could result
in the disease getting aggravated and prove fatal to life.
o Lets understand with an example of detection of breast cancer through mammography screening. ML
models can be trained on large datasets of mammography images to assist radiologists in interpreting them. A
high recall score is important in this scenario because it indicates that the model is able to correctly identify all
The precision-recall tradeoff is a common issue that arises when evaluating the performance of a
classification model. Precision and recall are two metrics that are often used to evaluate the performance of a
classifier, and they are often in conflict with each other.
Precision measures the proportion of true positive predictions made by the model (i.e. the number of correct
positive predictions divided by the total number of positive predictions). It is a useful metric for evaluating
the model’s ability to avoid false positives.
Recall, on the other hand, measures the proportion of true positive cases that were correctly predicted by the
model (i.e. the number of correct positive predictions divided by the total number of true positive cases). It
is a useful metric for evaluating the model’s ability to avoid false negatives.
In general, increasing the precision of a model will decrease its recall, and vice versa. This is because
precision and recall are inversely related – improving one will typically result in a decrease in the other. For
example, a model with a high precision will make few false positive predictions, but it may also miss some
true positive cases. On the other hand, a model with a high recall will correctly identify most of the true
positive cases, but it may also make more false positive predictions.
In order to evaluate a classification model, it is important to consider both precision and recall, rather than
just one of these metrics. The appropriate balance between precision and recall will depend on the specific
goals and requirements of the model, as well as the characteristics of the dataset. In some cases, it may be
more important to have a high precision (e.g. in medical diagnosis), while in others, a high recall may be
more important (e.g. in fraud detection).
To balance precision and recall, practitioners often use the F1 score, which is a combination of the two
metrics. The F1 score is calculated as the harmonic mean of precision and recall, and it provides a balance
between the two metrics. However, even the F1 score is not a perfect solution, as it can be difficult to
determine the optimal balance between precision and recall for a given application.
Model accuracy is a machine learning classification model performance metric that is defined as the ratio of
true positives and true negatives to all positive and negative observations. In other words, accuracy tells us
how often we can expect our machine learning model will correctly predict an outcome out of the total
number of times it made predictions. For example: Let’s assume that you were testing your machine
learning model with a dataset of 100 records and that your machine learning model predicted all 90 of those
instances correctly. The accuracy metric, in this case, would be: (90/100) = 90%. The accuracy rate is great
Mathematically, it represents the ratio of the sum of true positive and true negatives out of all the
predictions.
The accuracy score from above confusion matrix will come out to be the following:
The same score can be obtained by using accuracy_score method from sklearn.metrics
The following are some of the issues with accuracy metrics / score:
• The same accuracy metrics for two different models may indicate different model performance towards different classes.
• In case of imbalanced dataset, accuracy metrics is not the most effective metrics to be used.
One should be cautious when relying on the accuracy metrics of model to evaluate the model
performance. Take a look at the following confusion matrix. For model accuracy represented using both the
cases (left and right), the accuracy is 60%. However, both the models exhibit different behaviors.
The model performance represented by left confusion matrix indicates that the model has weak positive
recognition rate while the right confusion matrix represents that the model has strong positive recognition
rate. Note that the accuracy is 60% for both the models. Thus, one needs to dig deeper to understand about
the model performance given the accuracy metrics.
The accuracy metrics is also not reliable for the models trained on imbalanced or skewed datasets. Take
a scenario of dataset with 95% imbalance (95% data is negative class). The accuracy of the classifier will be
very high as it will be correctly doing right prediction issuing negative most of the time. A better classifier
that actually deals with the class imbalance issue, is likely to have a worse accuracy metrics score. In such
scenario of imbalanced dataset, another metrics AUC (the area under ROC curve) is more robust than
the accuracy metrics score. The AUC takes into the consideration, the class distribution in imbalanced
dataset. The ROC curve is a plot that shows the relationship between the true positive rate and the false
positive rate of a classification model. The area under the ROC curve (AUC) is a metric that quantifies the
overall performance of the model. A model with a higher AUC is considered to be a better classifier. Also, a
much better way to evaluate the performance of a classifier is to look at the confusion matrix.
Accuracy metrics only considers the number of correct predictions (true positives and true negatives)
made by the model. It does not take into account the relative importance of different types of errors, such as
false positives and false negatives. For example, if a model is being used to predict whether a patient has a
certain disease, a false positive (predicting that a patient has the disease when they actually do not) may be
less severe than a false negative (predicting that a patient does not have the disease when they actually do).
In this case, using accuracy as the sole evaluation metric may not provide a clear picture of the model’s
performance.
What is F1-Score?
Model F1 score represents the model score as a function of precision and recall score. F-score is a machine
learning model performance metric that gives equal weight to both the Precision and Recall for measuring
its performance in terms of accuracy, making it an alternative to Accuracy metrics (it doesn’t require us to
know the total number of observations). It’s often used as a single value that provides high-level information
about the model’s output quality. This is a useful measure of the model in the scenarios where one tries to
optimize either of precision or recall score and as a result, the model performance suffers. The following
represents the aspects relating to issues with optimizing either precision or recall score:
• Optimizing for recall helps with minimizing the chance of not detecting a malignant cancer. However, this comes at the
cost of predicting malignant cancer in patients although the patients are healthy (a high number of FP).
• Optimize for precision helps with correctness if the patient has malignant cancer. However, this comes at the cost of
missing malignant cancer more frequently (a high number of FN).
The accuracy score from the above confusion matrix will come out to be the following:
The same score can be obtained by using f1_score method from sklearn.metrics
# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Output:
Experiment No. 8
Aim: Write a program to use the K-means clustering algorithm.
K-means clustering algorithm computes the centroids and iterates until we it finds optimal centroid. It
assumes that the number of clusters are already known. It is also called flat clustering algorithm. The
number of clusters identified from data by algorithm is represented by ‘K’ in K-means.
In this algorithm, the data points are assigned to a cluster in such a manner that the sum of the squared
distance between the data points and centroid would be minimum. It is to be understood that less variation
within the clusters will lead to more similar data points within same cluster.
We can understand the working of K-Means clustering algorithm with the help of following steps −
• Step 1 − First, we need to specify the number of clusters, K, need to be generated by this algorithm.
• Step 2 − Next, randomly select K data points and assign each data point to a cluster. In simple
words, classify the data based on the number of data points.
• Step 3 − Now it will compute the cluster centroids.
• Step 4 − Next, keep iterating the following until we find optimal centroid which is the assignment of
data points to the clusters that are not changing any more −
4.1 − First, the sum of squared distance between data points and centroids would be computed.
4.2 − Now, we have to assign each data point to the cluster that is closer than other cluster (centroid).
4.3 − At last compute the centroids for the clusters by taking the average of all data points of that cluster.
K-means follows Expectation-Maximization approach to solve the problem. The Expectation-step is used
for assigning the data points to the closest cluster and the Maximization-step is used for computing the
centroid of each cluster.
While working with K-means algorithm we need to take care of the following things −
• While working with clustering algorithms including K-Means, it is recommended to standardize the
data because such algorithms use distance-based measurement to determine the similarity between
data points.
• Due to the iterative nature of K-Means and random initialization of centroids, K-Means may stick in
a local optimum and may not converge to global optimum. That is why it is recommended to use
different initializations of centroids.
Implementation in Python
The following two examples of implementing K-Means clustering algorithm will help us in its better
understanding −
Example 1
It is a simple example to understand how k-means works. In this example, we are going to first generate 2D
dataset containing 4 different blobs and after that will apply k-means algorithm to see the result.
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
from sklearn.cluster import KMeans
The following code will generate the 2D, containing four blobs −
Next, make an object of KMeans along with providing number of clusters, train the model and do the
prediction as follows −
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
Now, with the help of following code we can plot and visualize the cluster’s centers picked by k-means
Python estimator −
Let us move to another example in which we are going to apply K-means clustering on simple digits dataset.
K-means will try to identify similar digits without using the original label information.
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
from sklearn.cluster import KMeans
Next, load the digit dataset from sklearn and make an object of it. We can also find number of rows and
columns in this dataset as follows −
Output
(1797, 64)
The above output shows that this dataset is having 1797 samples with 64 features.
Output
(10, 64)
The above output shows that K-means created 10 clusters with 64 features.
Output
As output, we will get following image showing clusters centers learned by k-means.
The following lines of code will match the learned cluster labels with the true labels found in them −
Output
0.7935447968836951
Disadvantages
To fulfill the above-mentioned goals, K-means clustering is performing well enough. It can be used in
following applications −
• Market segmentation
• Document Clustering
• Image segmentation
• Image compression
• Customer segmentation
• Analyzing the trend on dynamic data
Output:
Experiment No. 9
Aim: Write a program for Text pre-processing using tokenization, stop word
removal, and stemming.
Text preprocessing is an essential step in natural language processing (NLP) that involves cleaning and
transforming unstructured text data to prepare it for analysis. It includes tokenization, stemming,
lemmatization, stop-word removal, and part-of-speech tagging. In this article, we will introduce the basics of
text preprocessing and provide Python code examples to illustrate how to implement these tasks using the
NLTK library. By the end of the article, readers will better understand how to prepare text data for NLP
tasks.
Natural Language Processing (NLP) is a branch of Data Science which deals with Text data. Apart from
numerical data, Text data is available to a great extent which is used to analyze and solve business problems.
But before using the data for analysis or prediction, processing the data is important.
To prepare the text data for the model building we perform text preprocessing. It is the very first step of
NLP projects. Some of the preprocessing steps are:
We need to use the required steps based on our dataset. In this article, we will use SMS Spam data to
understand the steps involved in Text Preprocessing in NLP.
Let’s start by importing the pandas library and reading the data.
Master 23+ tools & learn 50+ real-world projects to transform your career in Data Science.
<span data-mce-type="bookmark" style="display: inline-block; width: 0px; overflow: hidden; line-height: 0;"
class="mce_SELRES_start"></span>
The data has 5572 rows and 2 columns. You can check the shape of data using data.shape function. Let’s
check the dependent variable distribution between spam and ham.
In this step, all the punctuations from the text are removed. string library of Python contains some pre-
defined list of punctuations such as ‘!”#$%&'()*+,-./:;?@[\]^_`{|}~’
We can see in the above output, all the punctuations are removed from v2 and stored in the clean_msg
column.
It is one of the most common text preprocessing Python steps where the text is converted into the same case
preferably lower case. But it is not necessary to do this step every time you are working on an NLP problem
as for some problems lower casing can lead to loss of information.
For example, if in any project we are dealing with the emotions of a person, then the words written in upper
cases can be a sign of frustration or excitement.
Output: All the text of clean_msg column are converted into lower case and stored in msg_lower column
Tokenization
In this step, the text is split into smaller units. We can use either sentence tokenization or word tokenization
based on our problem statement.
Stopwords are the commonly used words and are removed from the text as they do not add any value to the
analysis. These words carry less or no meaning.
NLTK library consists of a list of words that are considered stopwords for the English language. Some of
them are : [i, me, my, myself, we, our, ours, ourselves, you, you’re, you’ve, you’ll, you’d, your, yours,
yourself, yourselves, he, most, other, some, such, no, nor, not, only, own, same, so, then, too, very, s, t, can,
will, just, don, don’t, should, should’ve, now, d, ll, m, o, re, ve, y, ain, aren’t, could, couldn’t, didn’t, didn’t]
But it is not necessary to use the provided list as stopwords as they should be chosen wisely based on the
project. For example, ‘How’ can be a stop word for a model but can be important for some other problem
where we are working on the queries of the customers. We can create a customized list of stop words for
different problems.
Output: Stop words that are present in the nltk library such as in, until, to, I, here are removed from the
tokenized text and the rest are stored in the no_stopwords column.
Stemming
It is also known as the text standardization step where the words are stemmed or diminished to their
root/base form. For example, words like ‘programmer’, ‘programming, ‘program’ will be stemmed to
‘program’.
But the disadvantage of stemming is that it stems the words such that its root form loses the meaning or it is
not diminished to a proper English word. We will see this in the steps done below.
Output: In the below image, we can see how some words are stemmed to their base.
crazy-> crazi
available-> avail
entry-> entri
early-> earli
Lemmatization
It stems the word but makes sure that it does not lose its meaning. Lemmatization has a pre-defined
dictionary that stores the context of words and checks the word in the dictionary while diminishing.
The difference between Stemming and Lemmatization can be understood with the example provided below.
Output: The difference between Stemming and Lemmatization can be seen in the below output.
In the first row- crazy has been changed to crazi which has no meaning but for lemmatization, it remained
the same i.e crazy
In the last row- goes has changed to goe while stemming but for lemmatization, it has converted into go
which is meaningful.
After all the text processing steps are performed, the final acquired data is converted into the numeric form
using Bag of words or TF-IDF.
# Tokenization
tokens = word_tokenize(input_text)
# Stemming
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]
Output:
Experiment No. 10
Theory: Time series is a series of data points in which each data point is associated with a timestamp. A
simple example is the price of a stock in the stock market at different points of time on a given day. Another
example is the amount of rainfall in a region at different months of the year.
In the below example we take the value of stock prices every day for a quarter for a particular stock symbol.
We capture these values as a csv file and then organize them to a dataframe using pandas library. We then
set the date field as index of the dataframe by recreating the additional Valuedate column as index and
deleting the old valuedate column.
Sample Data
Below is the sample data for the price of the stock on different days of a given quarter. The data is saved in a
file named as stock.csv
ValueDate Price
01-01-2018, 1042.05
02-01-2018, 1033.55
03-01-2018, 1029.7
04-01-2018, 1021.3
05-01-2018, 1015.4
...
...
...
...
23-03-2018, 1161.3
26-03-2018, 1167.6
27-03-2018, 1155.25
28-03-2018, 1154
data = pd.read_csv('path_to_file/stock.csv')
df = pd.DataFrame(data, columns = ['ValueDate', 'Price'])
df.plot(figsize=(15, 6))
plt.show()
Output:
Experiment No. 11
Theory: The method of communication with the help of which humans can speak, read, and write, is
language. In other words, we humans can think, make plans, make decisions in our natural language. Here
the big question is, in the era of artificial intelligence, machine learning and deep learning, can humans
communicate in natural language with computers/machines? Developing NLP applications is a huge
challenge for us because computers require structured data, but on the other hand, human speech is
unstructured and often ambiguous in nature.
Natural language is that subfield of computer science, more specifically of AI, which enables
computers/machines to understand, process and manipulate human language. In simple words, NLP is a way
of machines to analyze, understand and derive meaning from human natural languages like Hindi, English,
French, Dutch, etc.
Before getting deep dive into the working of NLP, we must have to understand how human beings use
language. Every day, we humans use hundreds or thousands of words and other humans interpret them and
answer accordingly. It’s a simple communication for humans, isn’t it? But we know words run much-much
deeper than that and we always derive a context from what we say and how we say. That’s why we can say
rather than focuses on voice modulation, NLP does draw on contextual pattern.
How humans know what word means what? The answer to this question is that we learn through our
experience. But, how do machines/computers learn the same?
• First, we need to feed the machines with enough data so that machines can learn from experience.
• Then machine will create word vectors, by using deep learning algorithms, from the data we fed
earlier as well as from its surrounding data.
• Then by performing simple algebraic operations on these word vectors, machine would be able to
provide the answers as human beings.
Components of NLP
Morphological Processing
Morphological processing is the first component of NLP. It includes breaking of chunks of language input
into sets of tokens corresponding to paragraphs, sentences and words. For example, a word like “everyday”
can be broken into two sub-word tokens as “every-day”.
Syntax analysis
Syntax Analysis, the second component, is one of the most important components of NLP. The purposes of
this component are as follows −
Semantic analysis
Semantic Analysis is the third component of NLP which is used to check the meaningfulness of the text. It
includes drawing exact meaning, or we can say dictionary meaning from the text. E.g. The sentences like
“It’s a hot ice-cream.” would be discarded by semantic analyzer.
Pragmatic analysis
Pragmatic analysis is the fourth component of NLP. It includes fitting the actual objects or events that exist
in each context with object references obtained by previous component i.e. semantic analysis. E.g. The
NLP, an emerging technology, derives various forms of AI we used to see these days. For today’s and
tomorrow’s increasingly cognitive applications, the use of NLP in creating a seamless and interactive
interface between humans and machines will continue to be a top priority. Following are some of the very
useful applications of NLP.
Machine Translation
Machine translation (MT) is one of the most important applications of natural language processing. MT is
basically a process of translating one source language or text into another language. Machine translation
system can be of either Bilingual or Multilingual.
Fighting Spam
Due to enormous increase in unwanted emails, spam filters have become important because it is the first line
of defense against this problem. By considering its false-positive and false-negative issues as the main
issues, the functionality of NLP can be used to develop spam filtering system.
N-gram modelling, Word Stemming and Bayesian classification are some of the existing NLP models that
can be used for spam filtering.
Most of the search engines like Google, Yahoo, Bing, WolframAlpha, etc., base their machine translation
(MT) technology on NLP deep learning models. Such deep learning models allow algorithms to read text on
webpage, interprets its meaning and translate it to another language.
Automatic text summarization is a technique which creates a short, accurate summary of longer text
documents. Hence, it helps us in getting relevant information in less time. In this digital era, we are in a
serious need of automatic text summarization because we have the flood of information over internet which
is not going to stop. NLP and its functionalities play an important role in developing an automatic text
summarization.
Grammar Correction
Spelling correction & grammar correction is a very useful feature of word processor software like Microsoft
Word. Natural language processing (NLP) is widely used for this purpose.
Question-answering
Question-answering, another main application of natural language processing (NLP), focuses on building
systems which automatically answer the question posted by user in their natural language.
Sentiment analysis
Sentiment analysis is among one other important applications of natural language processing (NLP). As its
name implies, Sentiment analysis is used to −
Online E-commerce companies like Amazon, ebay, etc., are using sentiment analysis to identify the opinion
and sentiment of their customers online. It will help them to understand what their customers think about
their products and services.
Speech engines
Speech engines like Siri, Google Voice, Alexa are built on NLP so that we can communicate with them in
our natural language.
Implementing NLP
In order to build the above-mentioned applications, we need to have specific skill set with a great
understanding of language and tools to process the language efficiently. To achieve this, we have various
open-source tools available. Some of them are open-sourced while others are developed by organizations to
build their own NLP applications. Following is the list of some NLP tools −
Among the above-mentioned NLP tool, NLTK scores very high when it comes to the ease of use and
explanation of the concept. The learning curve of Python is very fast and NLTK is written in Python so
NLTK is also having very good learning kit. NLTK has incorporated most of the tasks like tokenization,
stemming, Lemmatization, Punctuation, Character Count, and Word count. It is very elegant and easy to
work with.
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.probability import FreqDist
from nltk.sentiment.vader import SentimentIntensityAnalyzer
# Sample text
# Tokenization
tokens = word_tokenize(text)
# Stemming
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]
# Frequency distribution
freq_dist = FreqDist(stemmed_tokens)
# Sentiment analysis
sia = SentimentIntensityAnalyzer()
sentiment_score = sia.polarity_scores(text)
# Print results
print("Original Text:", text)
print("Tokens:", tokens)
print("Filtered Tokens (without stop words):", filtered_tokens)
print("Stemmed Tokens:", stemmed_tokens)
print("Most Common Words:", freq_dist.most_common(5))
print("Sentiment Analysis Score:", sentiment_score)
Output:
Experiment No. 12
Theory:
NLP is a branch of data science that consists of systematic processes for analyzing, understanding, and
deriving information from the text data in a smart and efficient manner. By utilizing NLP and its
components, one can organize the massive chunks of text data, perform numerous automated tasks and solve
a wide range of problems such as – automatic summarization, machine translation, named entity recognition,
relationship extraction, sentiment analysis, speech recognition, and topic segmentation etc.
Before moving further, I would like to explain some terms that are used in the article:
Download NLTK data: run python shell (in terminal) and write the following code:
```
import nltk nltk.download() ```
Follow the instructions on screen and download the desired package or collection. Other libraries can be
directly installed using pip.
Text Preprocessing
Since, text is the most unstructured form of all the available data, various types of noise are present in it and
the data is not readily analyzable without any pre-processing. The entire process of cleaning and
standardization of text, making it noise-free and ready for analysis is known as text preprocessing.
• Noise Removal
• Lexicon Normalization
• Object Standardization
Noise Removal
Any piece of text which is not relevant to the context of the data and the end-output can be specified as the
noise.
For example – language stopwords (commonly used words of a language – is, am, the, of, in etc), URLs or
links, social media entities (mentions, hashtags), punctuations and industry specific words. This step deals
with removal of all types of noisy entities present in the text.
A general approach for noise removal is to prepare a dictionary of noisy entities, and iterate the text object
by tokens (or by words), eliminating those tokens which are present in the noise dictionary.
Python Code:
<span data-mce-type="bookmark" style="display: inline-block; width: 0px; overflow: hidden; line-height: 0;"
class="mce_SELRES_start"></span>
Another approach is to use the regular expressions while dealing with special patterns of noise. We have
explained regular expressions in detail in one of our previous article. Following python code removes a
regex pattern from the input text:
```
regex_pattern = "#[\w]*"
```
Lexicon Normalization
Another type of textual noise is about the multiple representations exhibited by single word.
For example – “play”, “player”, “played”, “plays” and “playing” are the different variations of the word –
“play”, Though they mean different but contextually all are similar. The step converts all the disparities of a
• Stemming: Stemming is a rudimentary rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from a
word.
• Lemmatization: Lemmatization, on the other hand, is an organized & step by step procedure of obtaining the root form
of the word, it makes use of vocabulary (dictionary importance of words) and morphological analysis (word structure and
grammar relations).
Below is the sample code that performs lemmatization and stemming using python’s popular library –
NLTK.
```
word = "multiplying"
lem.lemmatize(word, "v")
>> "multiply"
stem.stem(word)
>> "multipli"
```
<span data-mce-type="bookmark" style="display: inline-block; width: 0px; overflow: hidden; line-height: 0;"
class="mce_SELRES_start"></span>
Object Standardization
Text data often contains words or phrases which are not present in any standard lexical dictionaries. These
pieces are not recognized by search engines and models.
Some of the examples are – acronyms, hashtags with attached words, and colloquial slangs. With the help of
regular expressions and manually prepared data dictionaries, this type of noise can be fixed, the code below
uses a dictionary lookup method to replace social media slangs from a text.
```
lookup_dict = {'rt':'Retweet', 'dm':'direct message', "awsm" : "awesome", "luv"
:"love", "..."}
def _lookup_words(input_text):
words = input_text.split()
new_words = []
for word in words:
if word.lower() in lookup_dict:
word = lookup_dict[word.lower()]
new_words.append(word) new_text = " ".join(new_words)
return new_text
```
Apart from three steps discussed so far, other types of text preprocessing includes encoding-decoding noise,
grammar checker, and spelling correction etc. The detailed article about preprocessing and its methods is
given in one of my previous article.
To analyse a preprocessed data, it needs to be converted into features. Depending upon the usage, text
features can be constructed using assorted techniques – Syntactical Parsing, Entities / N-grams / word-based
features, Statistical features, and word embeddings. Read on to understand these techniques in detail.
Syntactical parsing invol ves the analysis of words in the sentence for grammar and their arrangement in a
manner that shows the relationships among the words. Dependency Grammar and Part of Speech tags are
the important attributes of text syntactics.
Dependency Trees – Sentences are composed of some words sewed together. The relationship among the
words in a sentence is determined by the basic dependency grammar. Dependency grammar is a class of
syntactic text analysis that deals with (labeled) asymmetrical binary relations between two lexical items
(words). Every relation can be represented in the form of a triplet (relation, governor, dependent). For
example: consider the sentence – “Bills on ports and immigration were submitted by Senator Brownback,
Republican of Kansas.” The relationship among the words can be observed in the form of a tree
representation as shown:
This type of tree, when parsed recursively in top-down manner gives grammar relation triplets as output
which can be used as features for many nlp problems like entity wise sentiment analysis, actor & entity
identification, and text classification. The python wrapper StanfordCoreNLP (by Stanford NLP Group, only
commercial license) and NLTK dependency grammars can be used to generate dependency trees.
```
from nltk import word_tokenize, pos_tag
text = "I am learning Natural Language Processing on Analytics Vidhya"
tokens = word_tokenize(text)
print pos_tag(tokens)
>>> [('I', 'PRP'), ('am', 'VBP'), ('learning', 'VBG'), ('Natural', 'NNP'),('Language',
'NNP'),
('Processing', 'NNP'), ('on', 'IN'), ('Analytics', 'NNP'),('Vidhya', 'NNP')]
```
A.Word sense disambiguation: Some language words have multiple meanings according to their usage.
For example, in the two sentences below:
“Book” is used with different context, however the part of speech tag for both of the cases are different. In
sentence I, the word “book” is used as v erb, while in II it is used as no un. (Lesk Algorithm is also us ed for
similar purposes)
B.Improving word-based features: A learning model could learn different contexts of a word when used
word as the features, however if the part of speech tag is linked with them, the context is preserved, thus
making strong features. For example:
Tokens – (“book”, 2), (“my”, 1), (“flight”, 1), (“I”, 1), (“will”, 1), (“read”, 1), (“this”, 1)
Tokens with POS – (“book_VB”, 1), (“my_PRP$”, 1), (“flight_NN”, 1), (“I_PRP”, 1), (“will_MD”, 1),
(“read_VB”, 1), (“this_DT”, 1), (“book_NN”, 1)
C. Normalization and Lemmatization: POS tags are the basis of lemmatization process for converting a
word to its base form (lemma).
D.Efficient stopword removal : P OS tags are also useful in efficient removal of stopwords.
For example, there are some tags which always define the low frequency / less important words of a
language. For example: (IN – “within”, “upon”, “except”), (CD – “one”,”two”, “hundred”), (MD – “may”,
“mu st” etc)
import spacy
# Sample text
text = "Apple Inc. was founded by Steve Jobs and Steve Wozniak. It is headquartered in
Cupertino, California."
Output:
Reference
• https://www.geeksforgeeks.org/data-cleansing-introduction/
• https://www.pluralsight.com/guides/interpreting-data-using-descriptive-statistics-
python
• https://www.geeksforgeeks.org/plotting-histogram-in-python-using-matplotlib/
• https://medium.datadriveninvestor.com/p-value-t-test-chi-square-test-anova-when-to-
use-which-strategy-32907734aa0e
• https://www.geeksforgeeks.org/linear-regression-python-implementation/
• https://www.learndatasci.com/glossary/binary-classification/
• https://vitalflux.com/accuracy-precision-recall-f1-score-python-example/
• https://www.geeksforgeeks.org/k-means-clustering-introduction/
• https://www.analyticsvidhya.com/blog/2021/06/text-preprocessing-in-nlp-with-
python-codes/
• https://www.tutorialspoint.com/python_data_science/python_time_series.htm