0% found this document useful (0 votes)
14 views

Data Analytics and Visualization Lab

Uploaded by

Tarun Kushwaha
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Data Analytics and Visualization Lab

Uploaded by

Tarun Kushwaha
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI

RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY


DEPARTMENT OF CSE (Data Science)

LAB MANUAL

DATA ANALYTICS &


VISUALIZATION LAB

SEMESTER - 5 th

RUNGTA COLLEGE
Rungta Educational Campus,
Kohka-Kurud Road, Bhilai,
Chhattisgarh, India
Phone No. 0788-6666666
MANAGED BY: SANTOSH RUNGTA GROUP OF INSTITUTIONS

Prepared By
Mr. LAKSHMAN SAHU
(Assistant Professor)
RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 1
RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI

RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY

DEPARTMENT OF CSE (DATA SCIENCE)

LAB MANUAL

DATA ANALYTICS &


VISUALIZATION LAB
SEMESTER - 5 th

PREPARED AS PER THE SYLLABUS PRESCRIBED BY

CHHATTISGARH SWAMI VIVEKANAND TECHNICAL UNIVERSITY, BHILAI

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 2


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI

List of DOs & DON’Ts.

(Give instructions as per Department of CSE Data Science Laboratories)

DOs:

▪ Remove your shoes outside the laboratory.

▪ Come prepared in the lab regarding the experiment to be performed in the lab.

▪ Take help from the Manual / Work Book for preparation of the experiment.

▪ For any abnormal working of the machine consult the Faculty In-charge/ Lab
Assistant.

▪ Shut down the machine and switch off the power supply after performing the
experiment.

▪ Maintain silence and proper discipline in the lab.

▪ Enter your machine number in the Login register.

DON’Ts:

▪ Do not bring any magnetic material in the lab.

▪ Do not eat or drink any thing in the lab.

▪ Do not tamper the instruments in the Lab and do not disturb their settings.

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 3


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI

LIST OF EXPERIMENTS
AS PER THE SYLLABUS PRESCRIBED BY THE UNIVERSITY

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 4


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI

LIST OF EXPERIMENTS
AS PER RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY

(MINIMUM 10% MORE THAN THE PRESCRIBED SYLLABUS)

LIST OF EXPERIMENTS
Exp. Name of Experiment Page
No. No.

1 write a program in Python for Cleaning and handling missing values in a 5


dataset and data normalization.

2 Write a program to perform descriptive statistics: mean, median, mode, 16


variance, etc.

3 Write a program for Creating line charts, bar plots, scatter plots, and 21
histograms, Plotting multiple graphs in a single figure.

4 Write a program for Hypothesis testing using t-tests, ANOVA, and chi- 28
square tests.

5 Write a program for Regression Analysis, fitting a linear model and making 35
predictions.

6 Write a program for binary Classification using Machine Learning 40


Algorithms.

7 Write a program for Model evaluation using accuracy, precision, recall, and 46
F1-score.

8 Write a program to use the K-means clustering algorithm. 55

9 Write a program for Text pre-processing using tokenization, stop word 60


removal, and stemming.

10 Write a program to work with time-series data in Python. 68

11 Write a program to work with NLTK in python. 71

12 Write a program to work with NLP in python. 76

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 5


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI

Experiment No. 1

Aim: write a program in Python for Cleaning and handling missing values
in a dataset and data normalization.

Theory: Data cleaning is one of the important parts of machine learning. It plays a significant part in
building a model. It surely isn’t the fanciest part of machine learning and at the same time, there aren’t any
hidden tricks or secrets to uncover. However, the success or failure of a project relies on proper data
cleaning. Professional data scientists usually invest a very large portion of their time in this step because of
the belief that “Better data beats fancier algorithms”.

If we have a well-cleaned dataset, there are chances that we can get achieve good results with simple
algorithms also, which can prove very beneficial at times especially in terms of computation when the
dataset size is large. Obviously, different types of data will require different types of cleaning. However, this
systematic approach can always serve as a good starting point.

Steps Involved in Data Cleaning

Data cleaning is a crucial step in the machine learning (ML) pipeline, as it involves identifying and
removing any missing, duplicate, or irrelevant data. The goal of data cleaning is to ensure that the data is
accurate, consistent, and free of errors, as incorrect or inconsistent data can negatively impact the
performance of the ML model.

Data cleaning, also known as data cleansing or data preprocessing, is a crucial step in the data science
pipeline that involves identifying and correcting or removing errors, inconsistencies, and inaccuracies in the
data to improve its quality and usability. Data cleaning is essential because raw data is often noisy,
incomplete, and inconsistent, which can negatively impact the accuracy and reliability of the insights
derived from it.

The following are the most common steps involved in data cleaning:

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 6


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI
Data Cleaning

• Import the necessary libraries


• Load the dataset
• Check the data information using df.info()

import pandas as pd
import numpy as np

# Load the dataset


df = pd.read_csv('train.csv')
df.head()

Output:

PassengerI Survive Pclas SibS Parc Cabi Embarke


Name Sex Age Ticket Fare
d d s p h n d
Braund,
22.
0 1 0 3 Mr. Owen male 1 0 A/5 21171 7.2500 NaN S
0
Harris
Cumings,
Mrs. John
Bradley femal 38. 71.283
1 2 1 1 1 0 PC 17599 C85 C
(Florence e 0 3
Briggs
Th…
Heikkinen
femal 26. STON/O2
2 3 1 3 , Miss. 0 0 7.9250 NaN S
e 0 . 3101282
Laina
Futrelle,
Mrs.
Jacques femal 35. 53.100
3 4 1 1 1 0 113803 C123 S
Heath e 0 0
(Lily May
Peel)
Allen, Mr.
35.
4 5 0 3 William male 0 0 373450 8.0500 NaN S
0
Henry

1. Data inspection and exploration:

This step involves understanding the data by inspecting its structure and identifying missing values, outliers,
and inconsistencies.

• Check the duplicate rows.

df.duplicated()

Output:

0 False
1 False
...
889 False
890 False
Length: 891, dtype: bool

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 7


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI
• Check the data information using df.info()

df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

From the above data info, we can see that Age and Cabin have an unequal number of counts. And some of
the columns are categorical and have data type objects and some are integer and float values.

Let’s see the descriptive structure of the data using df.describe()

df1.describe()

Output:

PassengerId Survived Pclass Age SibSp Parch Fare


count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

Check the categorical and numerical columns

# Categorical columns
cat_col = [col for col in df.columns if df[col].dtype == 'object']
print('Categorical columns :',cat_col)
# Numerical columns
num_col = [col for col in df.columns if df[col].dtype != 'object']
print('Numerical columns :',num_col)

Output:

Categorical columns : ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 8


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI
Numerical columns : ['PassengerId', 'Survived', 'Pclass', 'Age', 'SibSp', 'Parch',
'Fare']

Check the total number of unique values in the Categorical columns


df[cat_col].nunique()

Output:

Name 891
Sex 2
Ticket 681
Cabin 147
Embarked 3
dtype: int64

2. Removal of unwanted observations

This includes deleting duplicate/ redundant or irrelevant values from your dataset. Duplicate observations
most frequently arise during data collection and Irrelevant observations are those that don’t actually fit the
specific problem that you’re trying to solve.

• Redundant observations alter the efficiency to a great extent as the data repeats and may add towards the correct side or
towards the incorrect side, thereby producing unfaithful results.
• Irrelevant observations are any type of data that is of no use to us and can be removed directly.

Now we have to make a decision according to the subject of analysis, which factor is important for our
discussion. As we know our machines don’t understand the text data. So, we have to either drop or convert
the categorical column values into numerical types. Here we are dropping the Name columns because the
Name will be always unique and it hasn’t a great influence on target variables. For the ticket, Let’s first print
the 50 unique tickets.

df['Ticket'].unique()[:50]

Output:

array(['A/5 21171', 'PC 17599', 'STON/O2. 3101282', '113803', '373450',


'330877', '17463', '349909', '347742', '237736', 'PP 9549',
'113783', 'A/5. 2151', '347082', '350406', '248706', '382652',
'244373', '345763', '2649', '239865', '248698', '330923', '113788',
'347077', '2631', '19950', '330959', '349216', 'PC 17601',
'PC 17569', '335677', 'C.A. 24579', 'PC 17604', '113789', '2677',
'A./5. 2152', '345764', '2651', '7546', '11668', '349253',
'SC/Paris 2123', '330958', 'S.C./A.4. 23567', '370371', '14311',
'2662', '349237', '3101295'], dtype=object)

From the above tickets, we can observe that it is made of two like first values ‘A/5 21171’ is joint from of
‘A/5’ and ‘21171’ this may influence our target variables. It will the case of Feature Engineering. where
we derived new features from a column or a group of columns. In the current case, we are dropping the
“Name” and “Ticket” columns.

Drop Name and Ticket columns.

df1 = df.drop(columns=['Name','Ticket'])
df1.shape

Output:

(891, 10)

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 9


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI
3. Handling missing data:

Missing data is a common issue in real-world datasets, and it can occur due to various reasons such as
human errors, system failures, or data collection issues. Various techniques can be used to handle missing
data, such as imputation, deletion, or substitution.

Let’s check the % missing values columns-wise for each row using df.isnull() it checks whether the values
are null or not and gives returns boolean values. and .sum() will sum the total number of null values rows
and we divide it by the total number of rows present in the dataset then we multiply to get values in % i.e per
100 values how much values are null.

round((df1.isnull().sum()/df1.shape[0])*100,2)

Output:

PassengerId 0.00
Survived 0.00
Pclass 0.00
Sex 0.00
Age 19.87
SibSp 0.00
Parch 0.00
Fare 0.00
Cabin 77.10
Embarked 0.22
dtype: float64

We cannot just ignore or remove the missing observation. They must be handled carefully as they can be an
indication of something important.

The two most common ways to deal with missing data are:

• Dropping observations with missing values.


o The fact that the value was missing may be informative in itself.
o Plus, in the real world, you often need to make predictions on new data even if some of the features are missing!

As we can see from the above result that Cabin has 77% null values and Age has 19.87% and Embarked has
0.22% of null values. So, it’s not a good idea to fill 77% of null values. So, we will drop the Cabin column.
Embarked column has only 0.22% of null values so, we drop the null values rows of Embarked column.

df2 = df1.drop(columns='Cabin')
df2.dropna(subset=['Embarked'], axis=0, inplace=True)
df2.shape

Output:

(889, 9)

• Imputing the missing values from past observations.


o Again, “missingness” is almost always informative in itself, and you should tell your algorithm if a value was
missing.
o Even if you build a model to impute your values, you’re not adding any real information. You’re just reinforcing
the patterns already provided by other features.

From the above describe table, we can see that there are very less differences between the mean and median
i..e 29.6 and 28. So, here we can do any one from mean imputation or Median imputations.

Note:

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 10


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI
• Mean imputation is suitable when the data is normally distributed and has no extreme outliers.
• Median imputation is preferable when the data contains outliers or is skewed.

# Mean imputation
df3 = df2.fillna(df2.Age.mean())
# Let's check the null values again
df3.isnull().sum()

Output:

PassengerId 0
Survived 0
Pclass 0
Sex 0
Age 0
SibSp 0
Parch 0
Fare 0
Embarked 0
dtype: int64

4. Handling outliers:

Outliers are extreme values that deviate significantly from the majority of the data. They can negatively
impact the analysis and model performance. Techniques such as clustering, interpolation, or transformation
can be used to handle outliers.

To check the outliers, We generally use a box plot. A box plot, also referred to as a box-and-whisker plot, is
a graphical representation of a dataset’s distribution. It shows a variable’s median, quartiles, and potential
outliers. The line inside the box denotes the median, while the box itself denotes the interquartile range
(IQR). The whiskers extend to the most extreme non-outlier values within 1.5 times the IQR. Individual
points beyond the whiskers are considered potential outliers. A box plot offers an easy-to-understand
overview of the range of the data and makes it possible to identify outliers or skewness in the distribution.

Let’s plot the box plot for Age column data.

import matplotlib.pyplot as plt

plt.boxplot(df3['Age'], vert=False)
plt.ylabel('Variable')
plt.xlabel('Age')
plt.title('Box Plot')
plt.show()

Output:

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 11


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI
Box Plot

As we can see from the above Box and whisker plot, Our age dataset has outliers values. The values less
than 5 and more 55 are outliers.

# calculate summary statistics


mean = df3['Age'].mean()
std = df3['Age'].std()

# Calculate the lower and upper bounds


lower_bound = mean - std*2
upper_bound = mean + std*2

print('Lower Bound :',lower_bound)


print('Upper Bound :',upper_bound)

# Drop the outliers


df4 = df3[(df3['Age'] >= lower_bound)
& (df3['Age'] <= upper_bound)]

Output:

Lower Bound : 3.705400107925648


Upper Bound : 55.578785285332785

Similarly, we can remove the outliers of the remaining columns.

5. Data transformation

Data transformation involves converting the data from one form to another to make it more suitable for
analysis. Techniques such as normalization, scaling, or encoding can be used to transform the data.

• Data validation and verification: Data validation and verification involve ensuring that the data is accurate and
consistent by comparing it with external sources or expert knowledge.

For the machine learning prediction, First, we separate independent and target features. Here we will
consider only ‘Sex’ ‘Age’ ‘SibSp’, ‘Parch’ ‘Fare’ ‘Embarked’ only as the independent features and
Survived as target variables. Because PassengerId will not affect the survival rate.

X = df3[['Pclass','Sex','Age', 'SibSp','Parch','Fare','Embarked']]
Y = df3['Survived']

• Data formatting: Data formatting involves converting the data into a standard format or structure that can be easily
processed by the algorithms or models used for analysis. Here we will discuss commonly used data formatting techniques
i.e. Scaling and Normalization.

Scaling:
• Scaling involves transforming the values of features to a specific range. It maintains the shape of the original distribution
while changing the scale.
• Scaling is particularly useful when features have different scales, and certain algorithms are sensitive to the magnitude of
the features.
• Common scaling methods include Min-Max scaling and Standardization (Z-score scaling).

Min-Max Scaling:

• Min-Max scaling rescales the values to a specified range, typically between 0 and 1.
• It preserves the original distribution and ensures that the minimum value maps to 0 and the maximum value maps to 1.

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 12


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI
from sklearn.preprocessing import MinMaxScaler

# initialising the MinMaxScaler


scaler = MinMaxScaler(feature_range=(0, 1))

# Numerical columns
num_col_ = [col for col in X.columns if X[col].dtype != 'object']
x1 = X
# learning the statistical parameters for each of the data and transforming
x1[num_col_] = scaler.fit_transform(x1[num_col_])
x1.head()

Output:

Pclass Sex Age SibSp Parch Fare Embarked


0 1.0 male 0.271174 0.125 0.0 0.014151 S
1 0.0 female 0.472229 0.125 0.0 0.139136 C
2 1.0 female 0.321438 0.000 0.0 0.015469 S
3 0.0 female 0.434531 0.125 0.0 0.103644 S
4 1.0 male 0.434531 0.000 0.0 0.015713 S

Standardization (Z-score scaling):

• Standardization transforms the values to have a mean of 0 and a standard deviation of 1.


• It centers the data around the mean and scales it based on the standard deviation.
• Standardization makes the data more suitable for algorithms that assume a Gaussian distribution or require features to
have zero mean and unit variance.

Z = (X - μ) / σ

Where,

• X = Data
• μ = Mean value of X
• σ = Standard deviation of X

Some data cleansing tools:

• OpenRefine
• Trifacta Wrangler
• TIBCO Clarity
• Cloudingo
• IBM Infosphere Quality Stage

Advantages of Data Cleaning in Machine Learning:

1. Improved model performance: Data cleaning helps improve the performance of the ML model by removing errors,
inconsistencies, and irrelevant data, which can help the model to better learn from the data.
2. Increased accuracy: Data cleaning helps ensure that the data is accurate, consistent, and free of errors, which can help
improve the accuracy of the ML model.
3. Better representation of the data: Data cleaning allows the data to be transformed into a format that better represents the
underlying relationships and patterns in the data, making it easier for the ML model to learn from the data.
4. Improved data quality: Data cleaning helps to improve the quality of the data, making it more reliable and accurate. This
ensures that the machine learning models are trained on high-quality data, which can lead to better predictions and
outcomes.
5. Improved data security: Data cleaning can help to identify and remove sensitive or confidential information that could
compromise data security. By eliminating this information, data cleaning can help to ensure that only the necessary and
relevant data is used for machine learning.

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 13


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI
Disadvantages of Data Cleaning in Machine Learning:

1. Time-consuming: Data cleaning can be a time-consuming task, especially for large and complex datasets.
2. Error-prone: Data cleaning can be error-prone, as it involves transforming and cleaning the data, which can result in the
loss of important information or the introduction of new errors.
3. Limited understanding of the data: Data cleaning can lead to a limited understanding of the data, as the transformed data
may not be representative of the underlying relationships and patterns in the data.
4. Data loss: Data cleaning can result in the loss of important information that may be valuable for machine learning
analysis. In some cases, data cleaning may result in the removal of data that appears to be irrelevant or inconsistent, but
which may contain valuable insights or patterns.
5. Cost and resource-intensive: Data cleaning can be a resource-intensive process that requires significant time, effort, and
expertise. It can also require the use of specialized software tools, which can add to the cost and complexity of data
cleaning.
6. Overfitting: Overfitting occurs when a machine learning model is trained too closely on a particular dataset, resulting in
poor performance when applied to new or different data. Data cleaning can inadvertently contribute to overfitting by
removing too much data, leading to a loss of information that could be important for model training and performance.

Sample Source Code:


import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Load the dataset (replace 'your_dataset.csv' with the actual file path)
df = pd.read_csv('your_dataset.csv')

# Display the initial state of the dataset


print("Initial Dataset:")
print(df.head())

# Handling missing values


df.dropna(inplace=True) # Remove rows with missing values

# Display the dataset after handling missing values


print("\nDataset after Handling Missing Values:")
print(df.head())

# Data normalization using Min-Max scaling


scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(df.iloc[:, 1:]) # Assuming the first column is
not numeric

# Create a new DataFrame with normalized data


normalized_df = pd.DataFrame(normalized_data, columns=df.columns[1:])

# Display the normalized dataset


print("\nNormalized Dataset:")
print(normalized_df.head())

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 14


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI

Experiment No. 2

Aim: Write a program to perform descriptive statistics: mean, median,


mode, variance, etc.

Theory: Descriptive Statistics is the building block of data science. Advanced analytics is often
incomplete without analyzing descriptive statistics of the key metrics. In simple terms, descriptive statistics
can be defined as the measures that summarize a given data, and these measures can be broken down further
into the measures of central tendency and the measures of dispersion.

Measures of central tendency include mean, median, and the mode, while the measures of variability include
standard deviation, variance, and the interquartile range. In this guide, you will learn how to compute these
measures of descriptive statistics and use them to interpret the data.

We will cover the topics given below:

1. Mean
2. Median
3. Mode
4. Standard Deviation
5. Variance
6. Interquartile Range
7. Skewness

We will begin by loading the dataset to be used in this guide.

Data

In this guide, we will be using fictitious data of loan applicants containing 600 observations and 10
variables, as described below:

1. Marital_status: Whether the applicant is married ("Yes") or not ("No").


2. Dependents: Number of dependents of the applicant.
3. Is_graduate: Whether the applicant is a graduate ("Yes") or not ("No").
4. Income: Annual Income of the applicant (in USD).
5. Loan_amount: Loan amount (in USD) for which the application was submitted.
6. Term_months: Tenure of the loan (in months).
7. Credit_score: Whether the applicant's credit score was good ("Satisfactory") or not ("Not_satisfactory").
8. Age: The applicant’s age in years.
9. Sex: Whether the applicant is female (F) or male (M).
10. approval_status: Whether the loan application was approved ("Yes") or not ("No").

Let's start by loading the required libraries and the data.

import pandas as pd
import numpy as np
import statistics as st

# Load the data


df = pd.read_csv("data_desc.csv")
print(df.shape)

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 15


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI
print(df.info())
python

Output:

(600, 10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 600 entries, 0 to 599
Data columns (total 10 columns):
Marital_status 600 non-null object
Dependents 600 non-null int64
Is_graduate 600 non-null object
Income 600 non-null int64
Loan_amount 600 non-null int64
Term_months 600 non-null int64
Credit_score 600 non-null object
approval_status 600 non-null object
Age 600 non-null int64
Sex 600 non-null object
dtypes: int64(5), object(5)
memory usage: 47.0+ KB
None

Five of the variables are categorical (labelled as 'object') while the remaining five are numerical (labelled as
'int').

Measures of Central Tendency

Measures of central tendency describe the center of the data, and are often represented by the mean, the
median, and the mode.

Mean

Mean represents the arithmetic average of the data. The line of code below prints the mean of the numerical
variables in the data. From the output, we can infer that the average age of the applicant is 49 years, the
average annual income is USD 705,541, and the average tenure of loans is 183 months. The command
df.mean(axis = 0) will also give the same output.

df.mean()
python

Output:

Dependents 0.748333
Income 705541.333333
Loan_amount 323793.666667
Term_months 183.350000
Age 49.450000

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 16


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI
dtype: float64

It is also possible to calculate the mean of a particular variable in a data, as shown below, where we
calculate the mean of the variables 'Age' and 'Income'.

print(df.loc[:,'Age'].mean())
print(df.loc[:,'Income'].mean())
python

Output:

49.45
705541.33

In the previous sections, we computed the column-wise mean. It is also possible to calculate the mean of the
rows by specifying the (axis = 1) argument. The code below calculates the mean of the first five rows.

df.mean(axis = 1)[0:5]
python

Output:

0 70096.0
1 161274.0
2 125113.4
3 119853.8
4 120653.8
dtype: float64

Median

In simple terms, median represents the 50th percentile, or the middle value of the data, that separates the
distribution into two halves. The line of code below prints the median of the numerical variables in the data.
The command df.median(axis = 0) will also give the same output.

df.median()
python

Output:

Dependents 0.0
Income 508350.0
Loan_amount 76000.0
Term_months 192.0
Age 51.0
dtype: float64

From the output, we can infer that the median age of the applicants is 51 years, the median annual income is
USD 508,350, and the median tenure of loans is 192 months. There is a difference between the mean and the
median values of these variables, which is because of the distribution of the data. We will learn more about
this in the subsequent sections.

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 17


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI
It is also possible to calculate the median of a particular variable in a data, as shown in the first two lines of
code below. We can also calculate the median of the rows by specifying the (axis = 1) argument. The third
line below calculates the median of the first five rows.

#to calculate a median of a particular column


print(df.loc[:,'Age'].median())
print(df.loc[:,'Income'].median())

df.median(axis = 1)[0:5]
python

Output:

51.0
508350.0

0 102.0
1 192.0
2 192.0
3 192.0
4 192.0
dtype: float64

Mode

Mode represents the most frequent value of a variable in the data. This is the only central tendency measure
that can be used with categorical variables, unlike the mean and the median which can be used only with
quantitative data.

The line of code below prints the mode of all the variables in the data. The .mode() function returns the most
common value or most repeated value of a variable. The command df.mode(axis = 0) will also give the
same output.

df.mode()
python

Output:

| | Marital_status | Dependents | Is_graduate | Income |


Loan_amount | Term_months | Credit_score | approval_status | Age | Sex |
|--- |---------------- |------------ |------------- |-------- |---------
---- |------------- |-------------- |----------------- |----- |----- |
| 0 | Yes | 0 | Yes | 333300 | 70000
| 192.0 | Satisfactory | Yes | 55 | M |

The interpretation of the mode is simple. The output above shows that most of the applicants are married, as
depicted by the 'Marital_status' value of "Yes". Similar interpreation could be done for the other categorical
variables like 'Sex' and 'Credit-Score'. For numerical variables, the mode value represents the value that
occurs most frequently. For example, the mode value of 55 for the variable 'Age' means that the highest
number (or frequency) of applicants are 55 years old.

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 18


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI

Sample Source Code:

import numpy as np
from scipy import stats

# Sample data
data = [22, 25, 28, 31, 35, 25, 30, 27, 29, 32]

# Calculate mean, median, and mode


mean = np.mean(data)
median = np.median(data)
mode = stats.mode(data).mode[0]

# Calculate variance and standard deviation


variance = np.var(data)
std_deviation = np.std(data)

# Display descriptive statistics


print("Data:", data)
print("Mean:", mean)
print("Median:", median)
print("Mode:", mode)
print("Variance:", variance)
print("Standard Deviation:", std_deviation)

Output:

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 19


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI

Experiment No. 3

Aim: Write a program for Creating line charts, bar plots, scatter plots, and
histograms, Plotting multiple graphs in a single figure.

Theory: Matplotlib is a data visualization library in Python. The pyplot, a sublibrary of matplotlib, is a
collection of functions that helps in creating a variety of charts. Line charts are used to represent the
relation between two data X and Y on a different axis. Here we will see some of the examples of a line chart
in Python:

Line plots

First import Matplotlib.pyplot library for plotting functions. Also, import the Numpy library as per
requirement. Then define data values x and y.

# importing the required libraries


import matplotlib.pyplot as plt
import numpy as np

# define data values


x = np.array([1, 2, 3, 4]) # X-axis points
y = x*2 # Y-axis points

plt.plot(x, y) # Plot the chart


plt.show() # display

Output:

Simple line plot between X and Y data

we can see in the above output image that there is no label on the x-axis and y-axis. Since labeling is
necessary for understanding the chart dimensions. In the following example, we will see how to add labels,
Ident in the charts

Bar Plots

A bar plot or bar chart is a graph that represents the category of data with rectangular bars with lengths and
heights that is proportional to the values which they represent. The bar plots can be plotted horizontally or
vertically. A bar chart describes the comparisons between the discrete categories. One of the axis of the plot
represents the specific categories being compared, while the other axis represents the measured values

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 20


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI
corresponding to those categories.

Creating a bar plot

The matplotlib API in Python provides the bar() function which can be used in MATLAB style use or as an
object-oriented API. The syntax of the bar() function to be used with the axes is as follows:-

plt.bar(x, height, width, bottom, align)

The function creates a bar plot bounded with a rectangle depending on the given parameters. Following is a
simple example of the bar plot, which represents the number of students enrolled in different courses of an
institute.

import numpy as np
import matplotlib.pyplot as plt

# creating the dataset


data = {'C':20, 'C++':15, 'Java':30,
'Python':35}
courses = list(data.keys())
values = list(data.values())

fig = plt.figure(figsize = (10, 5))

# creating the bar plot


plt.bar(courses, values, color ='maroon',
width = 0.4)

plt.xlabel("Courses offered")
plt.ylabel("No. of students enrolled")
plt.title("Students enrolled in different courses")
plt.show()

Output-

Here plt.bar(courses, values, color=’maroon’) is used to specify that the bar chart is to be plotted by using
the courses column as the X-axis, and the values as the Y-axis. The color attribute is used to set the color of

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 21


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI
the bars(maroon in this case).plt.xlabel(“Courses offered”) and plt.ylabel(“students enrolled”) are used to
label the corresponding axes.plt.title() is used to make a title for the graph.plt.show() is used to show the
graph as output using the previous commands.

Scatter plots

Scatter plots are used to observe relationship between variables and uses dots to represent the relationship
between them. The scatter() method in the matplotlib library is used to draw a scatter plot. Scatter plots are
widely used to represent relation among variables and how change in one affects the other.
Syntax
The syntax for scatter() method is given below:

matplotlib.pyplot.scatter(x_axis_data, y_axis_data, s=None, c=None, marker=None, cmap=None,


vmin=None, vmax=None, alpha=None, linewidths=None, edgecolors=None)

The scatter() method takes in the following parameters:

• x_axis_data- An array containing x-axis data


• y_axis_data- An array containing y-axis data
• s- marker size (can be scalar or array of size equal to size of x or y)
• c- color of sequence of colors for markers
• marker- marker style
• cmap- cmap name
• linewidths- width of marker border
• edgecolor- marker border color
• alpha- blending value, between 0 (transparent) and 1 (opaque)

Except x_axis_data and y_axis_data all other parameters are optional and their default value is None. Below
are the scatter plot examples with various parameters.
Example 1: This is the most basic example of a scatter plot.

import matplotlib.pyplot as plt

x =[5, 7, 8, 7, 2, 17, 2, 9,
4, 11, 12, 9, 6]

y =[99, 86, 87, 88, 100, 86,


103, 87, 94, 78, 77, 85, 86]

plt.scatter(x, y, c ="blue")

# To show the plot


plt.show()

Output

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 22


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI

Histogram

A histogram is basically used to represent data provided in a form of some groups.It is accurate method for
the graphical representation of numerical data distribution.It is a type of bar plot where X-axis represents the
bin ranges while Y-axis gives information about frequency.

Creating a Histogram

To create a histogram the first step is to create bin of the ranges, then distribute the whole range of the
values into a series of intervals, and count the values which fall into each of the intervals.Bins are clearly
identified as consecutive, non-overlapping intervals of variables.The matplotlib.pyplot.hist() function is used
to compute and create histogram of x.

The following table shows the parameters accepted by matplotlib.pyplot.hist() function :

Attribute parameter
x array or sequence of array
bins optional parameter contains integer or sequence or strings
density optional parameter contains boolean values
range optional parameter represents upper and lower range of bins
histtype optional parameter used to create type of histogram [bar, barstacked, step, stepfilled], default is “bar”
align optional parameter controls the plotting of histogram [left, right, mid]

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 23


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI
Attribute parameter
weights optional parameter contains array of weights having same dimensions as x
bottom location of the baseline of each bin
rwidth optional parameter which is relative width of the bars with respect to bin width
color optional parameter used to set color or sequence of color specs
label optional parameter string or sequence of string to match with multiple datasets
log optional parameter used to set histogram axis on log scale

Let’s create a basic histogram of some random values. Below code creates a simple histogram of some
random values:

from matplotlib import pyplot as plt


import numpy as np

# Creating dataset
a = np.array([22, 87, 5, 43, 56,
73, 55, 54, 11,
20, 51, 5, 79, 31,
27])

# Creating histogram
fig, ax = plt.subplots(figsize =(10, 7))
ax.hist(a, bins = [0, 25, 50, 75, 100])

# Show plot
plt.show()

Output :

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 24


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI

Plotting multiple graphs in a single figure


In Matplotlib, we can draw multiple graphs in a single plot in two ways. One is by using subplot() function
and other by superimposition of second graph on the first i.e, all graphs will appear on the same plot. We
will look into both the ways one by one.

Multiple Plots using subplot () Function

A subplot () function is a wrapper function which allows the programmer to plot more than one graph in a
single figure by just calling it once.

Syntax: matplotlib.pyplot.subplots(nrows=1, ncols=1, sharex=False, sharey=False, squeeze=True,


subplot_kw=None, gridspec_kw=None, **fig_kw)

Parameters:

1. nrows, ncols: These gives the number of rows and columns respectively. Also, it must be noted that both these
parameters are optional and the default value is 1.
2. sharex, sharey: These parameters specify about the properties that are shared among a and y axis.Possible values for
them can be, row, col, none or default value which is False.
3. squeeze: This parameter is a boolean value specified, which asks the programmer whether to squeeze out, meaning
remove the extra dimension from the array. It has a default value False.
4. subplot_kw: This parameters allow us to add keywords to each subplot and its default value is None.
5. gridspec_kw: This allows us to add grids on each subplot and has a default value of None.
6. **fig_kw: This allows us to pass any other additional keyword argument to the function call and has a default value of
None.

Example :

# importing libraries
import matplotlib.pyplot as plt
import numpy as np
import math

# Get the angles from 0 to 2 pie (360 degree) in narray object


X = np.arange(0, math.pi*2, 0.05)

# Using built-in trigonometric function we can directly plot


# the given cosine wave for the given angles
Y1 = np.sin(X)
Y2 = np.cos(X)
Y3 = np.tan(X)
Y4 = np.tanh(X)

# Initialise the subplot function using number of rows and columns


figure, axis = plt.subplots(2, 2)

# For Sine Function


axis[0, 0].plot(X, Y1)
axis[0, 0].set_title("Sine Function")

# For Cosine Function


axis[0, 1].plot(X, Y2)
axis[0, 1].set_title("Cosine Function")

# For Tangent Function


axis[1, 0].plot(X, Y3)
axis[1, 0].set_title("Tangent Function")

# For Tanh Function


axis[1, 1].plot(X, Y4)

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 25


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI
axis[1, 1].set_title("Tanh Function")

# Combine all the operations and display


plt.show()

Output

Multiple plots using subplot() function

In Matplotlib, there is another function very similar to subplot which is subplot2grid (). It is same almost
same as subplot function but provides more flexibility to arrange the plot objects according to the need of
the programmer.

This function is written as follows:

Syntax: matplotlib.pyplot.subplot2grid(shape, loc, rowspan=1, colspan=1, fig=None, **kwargs)

Parameter:

1. shape
This parameter is a sequence of two integer values which tells the shape of the grid for which we need to place the axes.
The first entry is for row, whereas the second entry is for column.
2. loc
Like shape parameter, even Ioc is a sequence of 2 integer values, where first entry remains for the row and the second is
for column to place axis within grid.
3. rowspan
This parameter takes integer value and the number which indicates the number of rows for the axis to span to or increase
towards right side.
4. colspan
This parameter takes integer value and the number which indicates the number of columns for the axis to span to or
increase the length downwards.
5. fig
This is an optional parameter and takes Figure to place axis in. It defaults to current figure.
6. **kwargs
This allows us to pass any other additional keyword argument to the function call and has a default value of None.

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 26


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI

Experiment No. 4

Aim: Write a program for Hypothesis testing using t-tests, ANOVA, and
chi-square tests.
Statistics is an important part of data science where we use statical assumptions to get assertions from
population data, to make assumptions from the population we make hypothesis about population parameters.
A hypothesis is a statement about a given problem.

What is Hypothesis Testing

Hypothesis testing is a statistical method that is used in making a statistical decision using experimental
data. Hypothesis testing is basically an assumption that we make about a population parameter. It evaluates
two mutually exclusive statements about a population to determine which statement is best supported by the
sample data.

Example: You say an average student in the class is 30 or a boy is taller than a girl. All of these is an
assumption that we are assuming and we need some statistical way to prove these. We need some
mathematical conclusion whatever we are assuming is true.

Need for Hypothesis Testing

Hypothesis testing is an important procedure in statistics. Hypothesis testing evaluates two mutually
exclusive population statements to determine which statement is most supported by sample data. When we
say that the findings are statistically significant, it is thanks to hypothesis testing.

Parameters of hypothesis testing

• Null hypothesis(H0): In statistics, the null hypothesis is a general given statement or default position that there is no
relationship between two measured cases or no relationship among groups. In other words, it is a basic assumption or
made based on the problem knowledge.

Example: A company production is = 50 units/per day etc.

• Alternative hypothesis(H1): The alternative hypothesis is the hypothesis used in hypothesis testing that is contrary to
the null hypothesis.

Example: A company’s production is not equal to 50 units/per day etc.

• Level of significance It refers to the degree of significance in which we accept or reject the null hypothesis. 100%
accuracy is not possible for accepting a hypothesis, so we, therefore, select a level of significance that is usually 5%. This
is normally denoted with and generally, it is 0.05 or 5%, which means your output should be 95% confident to give a
similar kind of result in each sample.
• P-value The P value, or calculated probability, is the probability of finding the observed/extreme results when the null
hypothesis(H0) of a study-given problem is true. If your P-value is less than the chosen significance level then you reject
the null hypothesis i.e. accept that your sample claims to support the alternative hypothesis.

Steps in Hypothesis Testing

• Step 1– We first identify the problem about which we want to make an assumption keeping in mind that our assumption
should be contradictory to one another
• Step 2 – We consider statical assumption such that the data is normal or not, statical independence between the data.
• Step 3 – We decide our test data on which we will check our hypothesis

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 27


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI
• Step 4 – The data for the tests are evaluated in this step we look for various scores in this step like z-score and mean
values.
• Step 5 – In this stage, we decide where we should accept the null hypothesis or reject the null hypothesis

Example: Given a coin and it is not known whether that is fair or tricky so let’s decide the null and alternate
hypothesis

• Null Hypothesis(H0): a coin is a fair coin.


• Alternative Hypothesis(H1): a coin is a tricky coin.
• =
• Toss a coin 1st time and assume that the result is head- P-value = (as head and tail have equal probability)

• Toss a coin 2nd time and assume that result again is head, now p-value =

and similarly, we Toss 6 consecutive times and got the result as all heads, now P-value = But we set our
significance level as an error rate we allow and here we see we are beyond that level i.e. our null- hypothesis
does not hold good so we need to reject and propose that this coin is a tricky coin which is actually because
it gives us 6 consecutive heads.

Formula For Hypothesis Testing

To validate our hypothesis about a population parameter we use statistical functions. we use the z-score, p-
value, and, level of significance(alpha) to make evidence for our hypothesis.

where,

is the sample mean,

μ represents the population mean,

σ is the standard deviation and

n is the size of the sample.

Python Implementation of Hypothesis Testing

We will use the scipy python library to compute the p-value and z-score for our sample dataset. Scipy is a
mathematical library in Python that is mostly used for mathematical equations and computations. In this
code, we will create a function hypothesis_test in which we will pass arguments like pop_mean(population
parameter upon which we are checking our hypothesis), sample dataset, level of confidence(alpha value),
and type of testing (whether it’s a one-tailed test or two-tailed test).

The information we are using in this Hypothesis test is

Level of confidence – 0.5

Null hypothesis – population mean = 5.0

Alternative hypothesis – population mean != 5.0

import numpy as np

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 28


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI
from scipy.stats import norm

def hypothesis_test(sample, pop_mean,


alpha=0.05, two_tailed=True):
# len sample dataset
n = len(sample)
# mean and stard-deviation of dataset
sample_mean = np.mean(sample)
sample_std = np.std(sample, ddof=1)

# Calculate the test statistic


z = (sample_mean - pop_mean) / (sample_std / np.sqrt(n))

# Calculate the p-value based on the test type


if two_tailed:
p_value = 2 * (1 - norm.cdf(abs(z)))
else:
if z < 0:
p_value = norm.cdf(z)
else:
p_value = 1 - norm.cdf(z)

# Determine whether to reject or fail to


# reject the null hypothesis
if p_value < alpha:
result = "reject"
else:
result = "fail to reject"

return z, p_value, result

Evaluate Hypothesis Function on Sample Dataset

To evaluate our hypothesis test function we will create a sample dataset of 20 points having 4.5 as the mean
and 2 as the standard deviation. Here, We will consider that our population has a mean equals to 5 .

np.random.seed(0)
sample = np.random.normal(loc=4.5, scale=2, size=20)
pop_mean = 5.0

# Test the null hypothesis that


# the population mean is equal to 5.0
z, p_value, result = hypothesis_test(sample, pop_mean)

print(f"Test statistic: {z:.4f}")


print(f"P-value: {p_value:.4f}")
print(f"Result: {result} null hypothesis at alpha=0.05")

Output :

Test statistic: 1.6372


P-value: 0.1016
Result: fail to reject null hypothesis at alpha=0.05

In the above example, we can see that we are getting a p-value of 0.101 from the dataset which is less than
our level of confidence(alpha value) which is 0.5 hence in this case we will reject our null hypothesis the
population mean is 5.0

What if we get a p-value greater than our test statistics but we still reject our null hypothesis in this case we
will be making an error. Based on the error we make we define error in two types.

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 29


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI
Error in Hypothesis Testing

• Type I error: When we reject the null hypothesis, although that hypothesis was true. Type I error is denoted by alpha.
• Type II errors: When we accept the null hypothesis but it is false. Type II errors are denoted by beta.


• In my previous blog, I have given an overview of hypothesis testing what it is, and errors related to
it.
• In this blog, we will discuss different techniques for hypothesis testing mainly theoretical and when
to use what?
• What is P-value?
• The job of the p-value is to decide whether we should accept our Null Hypothesis or reject it. The
lower the p-value, the more surprising the evidence is, the more ridiculous our null hypothesis
looks. And when we feel ridiculous about our null hypothesis we simply reject it and accept our
Alternate Hypothesis.
• If we found the p-value is lower than the predetermined significance value(often called alpha or
threshold value) then we reject the null hypothesis. The alpha should always be set before an
experiment to avoid bias.
• For example, we generally consider a large population data to be in Normal Distribution so while
selecting alpha for that distribution we select it as 0.05 (it means we are accepting if it lies in the 95
percent of our distribution). This means that if our p-value is less than 0.05 we will reject the null
hypothesis.

• But wait, guys!! Significance of p-value comes in after performing Statistical tests and when to use
which technique is important. So now I will list when to perform which statistical technique for
hypothesis testing.
• Chi-Square Test
• Chi-Square test is used when we perform hypothesis testing on two categorical variables from a
single population or we can say that to compare categorical variables from a single population. By
this we find is there any significant association between the two categorical variables.

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 30


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI


• The hypothesis being tested for chi-square is
• Null: Variable A and Variable B are independent.
• Alternate: Variable A and Variable B are not independent.
• T-Test
• The T-test is an inferential statistic that is used to determine the difference or to compare the means
of two groups of samples which may be related to certain features. It is performed on continuous
variables.
• There are three different versions of t-tests:
• → One sample t-test which tells whether means of sample and population are different.


• One Sample t-test
• → Two sample t-test also is known as Independent t-test — it compares the means of two
independent groups and determines whether there is statistical evidence that the associated
population means are significantly different.


• Two sample t-test
• → Paired t-test when you want to compare means of the different samples from the same group or
which compares means from the same group at different times.

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 31


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI
• ANOVA Test
• It is also called an analysis of variance and is used to compare multiple (three or more) samples
with a single test. It is used when the categorical feature has more than two categories.
• The hypothesis being tested in ANOVA is
• Null: All pairs of samples are same i.e. all sample means are equal
• Alternate: At least one pair of samples is significantly different

Sample Source Code:


import numpy as np
from scipy import stats

# Generate sample data for t-test and ANOVA


group1 = np.array([25, 28, 30, 27, 32])
group2 = np.array([22, 23, 26, 29, 31])
group3 = np.array([18, 21, 24, 26, 30])

# Perform independent t-test


t_statistic, p_value = stats.ttest_ind(group1, group2)
print("Independent t-test:")
print("T-statistic:", t_statistic)
print("P-value:", p_value)

# Perform one-way ANOVA


f_statistic, p_value_anova = stats.f_oneway(group1, group2, group3)
print("\nOne-way ANOVA:")
print("F-statistic:", f_statistic)
print("P-value:", p_value_anova)

# Create a contingency table for chi-square test


observed = np.array([[10, 15, 20],
[25, 30, 35],
[5, 10, 15]])

# Perform chi-square test


chi2_stat, p_value_chi2, dof, expected = stats.chi2_contingency(observed)
print("\nChi-square Test:")
print("Chi-square statistic:", chi2_stat)
print("P-value:", p_value_chi2)
print("Degrees of freedom:", dof)
print("Expected frequencies table:")
print(expected)

This program showcases hypothesis testing for different scenarios:

1. Independent t-test: It performs an independent t-test between group1 and group2 and prints the
calculated t-statistic and p-value.
2. One-way ANOVA: It performs a one-way ANOVA on group1, group2, and group3 and displays
the calculated F-statistic and p-value.
3. Chi-square Test: It performs a chi-square test on the provided contingency table (observed) and
prints the chi-square statistic, p-value, degrees of freedom, and the expected frequencies table.

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 32


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI

Output:

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 33


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI

Experiment No. 5

Aim: Write a program for Regression Analysis, fitting a linear model and
making predictions.

Simple Linear Regression

Simple linear regression is an approach for predicting a response using a single feature. It is one of the
most basic machine learning models that a machine learning enthusiast gets to know about. In linear
regression, we assume that the two variables i.e. dependent and independent variables are linearly related.
Hence, we try to find a linear function that predicts the response value(y) as accurately as possible as a
function of the feature or independent variable(x). Let us consider a dataset where we have a value of
response y for every feature x:

For generality, we define:

x as feature vector, i.e x = [x_1, x_2, …., x_n],

y as response vector, i.e y = [y_1, y_2, …., y_n]

for n observations (in the above example, n=10). A scatter plot of the above dataset looks like this:-

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 34


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI
Scatter plot for the randomly generated data

Now, the task is to find a line that fits best in the above scatter plot so that we can predict the response for
any new feature values. (i.e a value of x not present in a dataset) This line is called a regression line. The
equation of the regression line is represented as:

Here,

• h(x_i) represents the predicted response value for ith observation.


• b_0 and b_1 are regression coefficients and represent the y-intercept and slope of the regression line respectively.

To create our model, we must “learn” or estimate the values of regression coefficients b_0 and b_1. And
once we’ve estimated these coefficients, we can use the model to predict responses!
In this article, we are going to use the principle of Least Squares.

Now consider:

Here, e_i is a residual error in ith observation. So, our aim is to minimize the total residual error. We define
the squared error or cost function, J as:

And our task is to find the value of b0 and b1 for which J(b0, b1) is minimum! Without going into the
mathematical details, we present the result here:

Where SSxy is the sum of cross-deviations of y and x:

And SSxx is the sum of squared deviations of x:

Python Implementation of Linear Regression

We can use the Python language to learn the coefficient of linear regression models. For plotting the input
data and best-fitted line we will use the matplotlib library. It is one of the most used Python libraries for
plotting graphs.

import numpy as np
import matplotlib.pyplot as plt

def estimate_coef(x, y):

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 35


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI
# number of observations/points
n = np.size(x)

# mean of x and y vector


m_x = np.mean(x)
m_y = np.mean(y)

# calculating cross-deviation and deviation about x


SS_xy = np.sum(y*x) - n*m_y*m_x
SS_xx = np.sum(x*x) - n*m_x*m_x

# calculating regression coefficients


b_1 = SS_xy / SS_xx
b_0 = m_y - b_1*m_x

return (b_0, b_1)

def plot_regression_line(x, y, b):


# plotting the actual points as scatter plot
plt.scatter(x, y, color = "m",
marker = "o", s = 30)

# predicted response vector


y_pred = b[0] + b[1]*x

# plotting the regression line


plt.plot(x, y_pred, color = "g")

# putting labels
plt.xlabel('x')
plt.ylabel('y')

# function to show plot


plt.show()

def main():
# observations / data
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])

# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \
\nb_1 = {}".format(b[0], b[1]))

# plotting regression line


plot_regression_line(x, y, b)

if __name__ == "__main__":
main()

Output:

Estimated coefficients:
b_0 = -0.0586206896552
b_1 = 1.45747126437

And the graph obtained looks like this:

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 36


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI

Scatterplot of the points along with the regression line

Sample Source Code:


import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

# Generate sample data


np.random.seed(0)
X = np.random.rand(50, 1) * 10 # Independent variable (feature)
y = 2 * X + 1 + np.random.randn(50, 1) * 2 # Dependent variable (target)

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Fit a linear regression model


model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Plot the data and regression line


plt.scatter(X, y, label="Original Data")
plt.plot(X_test, y_pred, color='red', label="Regression Line")
plt.xlabel("X")
plt.ylabel("y")
plt.title("Linear Regression")
plt.legend()
plt.show()

In this program:

1. We generate synthetic data for demonstration purposes, where the relationship between X and y
follows a linear model (y = 2X + 1 with some added noise).
2. We split the data into training and testing sets using train_test_split.
3. We create a LinearRegression model and fit it to the training data using fit.
4. We use the trained model to make predictions on the testing data using predict.
5. We plot the original data points and the regression line to visualize the linear regression.

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 37


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI

Output:

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 38


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI

Experiment No. 6

Aim: Write a program for binary Classification using Machine Learning


Algorithms.

Theory: In machine learning, binary classification is a supervised learning algorithm that categorizes
new observations into one of two classes.

The following are a few binary classification applications, where the 0 and 1 columns are two possible
classes for each observation:

Application Observation 0 1
Medical Diagnosis Patient Healthy Diseased
Email Analysis Email Not Spam Spam
Financial Data Analysis Transaction Not Fraud Fraud
Marketing Website visitor Won't Buy Will Buy
Image Classification Image Hotdog Not Hotdog

Quick example

In a medical diagnosis, a binary classifier for a specific disease could take a patient's symptoms as input
features and predict whether the patient is healthy or has the disease. The possible outcomes of the diagnosis
are positive and negative.

Evaluation of binary classifiers

If the model successfully predicts the patients as positive, this case is called True Positive (TP). If the model
successfully predicts patients as negative, this is called True Negative (TN). The binary classifier may
misdiagnose some patients as well. If a diseased patient is classified as healthy by a negative test result, this
error is called False Negative (FN). Similarly, If a healthy patient is classified as diseased by a positive test
result, this error is called False Positive(FP).

We can evaluate a binary classifier based on the following parameters:

• True Positive (TP): The patient is diseased and the model predicts "diseased"
• False Positive (FP): The patient is healthy but the model predicts "diseased"
• True Negative (TN): The patient is healthy and the model predicts "healthy"
• False Negative (FN): The patient is diseased and the model predicts "healthy"

After obtaining these values, we can compute the accuracy score of the binary classifier as follows:

The following is a confusion matrix, which represents the above parameters:

In machine learning, many methods utilize binary classification. The most common are:

• Support Vector Machines


• Naive Bayes
• Nearest Neighbor

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 39


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI
• Decision Trees
• Logistic Regression
• Neural Networks

The following Python example will demonstrate using binary classification in a logistic regression problem.

A Python example for binary classification

For our data, we will use the breast cancer dataset from scikit-learn. This dataset contains tumor
observations and corresponding labels for whether the tumor was malignant or benign.

First, we'll import a few libraries and then load the data. When loading the data, we'll specify
as_frame=True so we can work with pandas objects.

import matplotlib.pyplot as plt


from sklearn.datasets import load_breast_cancer

dataset = load_breast_cancer(as_frame=True)

The dataset contains a DataFrame for the observation data and a Series for the target data.

Let's see what the first few rows of observations look like:

dataset['data'].head()
Out:
mea wor
mean worst
me me mean me n mean wo wor worst wo wors st worst
mean mean mean fract . worst worst fract
an an peri an conc sym rst st peri rst t conc sym
smoot compa conc al . smoot compa al
rad text mete are ave metr rad text mete are conc ave metr
hness ctness avity dime . hness ctness dime
ius ure r a poin y ius ure r a avity poin y
nsion nsion
ts ts
.
17. 10.3 122.8 100 0.1184 0.2776 0.300 0.14 0.241 0.078 25. 17.3 184.6 201 0.711 0.26 0.460 0.118
0 . 0.1622 0.6656
99 8 0 1.0 0 0 1 710 9 71 38 3 0 9.0 9 54 1 90
.
.
20. 17.7 132.9 132 0.0847 0.0786 0.086 0.07 0.181 0.056 24. 23.4 158.8 195 0.241 0.18 0.275 0.089
1 . 0.1238 0.1866
57 7 0 6.0 4 4 9 017 2 67 99 1 0 6.0 6 60 0 02
.
.
19. 21.2 130.0 120 0.1096 0.1599 0.197 0.12 0.206 0.059 23. 25.5 152.5 170 0.450 0.24 0.361 0.087
2 . 0.1444 0.4245
69 5 0 3.0 0 0 4 790 9 99 57 3 0 9.0 4 30 3 58
.
.
11. 20.3 386 0.1425 0.2839 0.241 0.10 0.259 0.097 14. 26.5 567 0.686 0.25 0.663 0.173
3 77.58 . 98.87 0.2098 0.8663
42 8 .1 0 0 4 520 7 44 91 0 .7 9 75 8 00
.
.
20. 14.3 135.1 129 0.1003 0.1328 0.198 0.10 0.180 0.058 22. 16.6 152.2 157 0.400 0.16 0.236 0.076
4 . 0.1374 0.2050
29 4 0 7.0 0 0 0 430 9 83 54 7 0 5.0 0 25 4 78
.

5 rows × 30 columns

The output shows five observations with a column for each feature we'll use to predict malignancy.

Now, for the targets:

dataset['target'].head()
Out:
0 0

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 40


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI
1 0
2 0
3 0
4 0
Name: target, dtype: int32

The targets for the first five observations are all zero, meaning the tumors are benign. Here's how many
malignant and benign tumors are in our dataset:

dataset['target'].value_counts()
Out:
1 357
0 212
Name: target, dtype: int64

So we have 357 malignant tumors, denoted as 1, and 212 benign, denoted as 0. So, we have a binary
classification problem.

To perform binary classification using logistic regression with sklearn, we must accomplish the following
steps.

Step 1: Define explanatory and target variables

We'll store the rows of observations in a variable X and the corresponding class of those observations (0 or
1) in a variable y.

X = dataset['data']
y = dataset['target']

Step 2: Split the dataset into training and testing sets

We use 75% of data for training and 25% for testing. Setting random_state=0 will ensure your results are
the same as ours.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y , test_size=0.25,


random_state=0)

Step 3: Normalize the data for numerical stability

Note that we normalize after splitting the data. It's good practice to apply any data transformations to
training and testing data separately to prevent data leakage.

from sklearn.preprocessing import StandardScaler

ss_train = StandardScaler()
X_train = ss_train.fit_transform(X_train)

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 41


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI

ss_test = StandardScaler()
X_test = ss_test.fit_transform(X_test)

Step 4: Fit a logistic regression model to the training data

This step effectively trains the model to predict the targets from the data.

Step 5: Make predictions on the testing data

With the model trained, we now ask the model to predict targets based on the test data.

predictions = model.predict(X_test)

Step 6: Calculate the accuracy score by comparing the actual values and predicted values.

We can now calculate how well the model performed by comparing the model's predictions to the true target
values, which we reserved in the y_test variable.

First, we'll calculate the confusion matrix to get the necessary parameters:

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, predictions)

TN, FP, FN, TP = confusion_matrix(y_test, predictions).ravel()

print('True Positive(TP) = ', TP)


print('False Positive(FP) = ', FP)
print('True Negative(TN) = ', TN)
print('False Negative(FN) = ', FN)
Out:
True Positive(TP) = 86
False Positive(FP) = 2
True Negative(TN) = 51
False Negative(FN) = 4

With these values, we can now calculate an accuracy score:

accuracy = (TP + TN) / (TP + FP + TN + FN)

print('Accuracy of the binary classifier = {:0.3f}'.format(accuracy))


Out:
Accuracy of the binary classifier = 0.958

Initializing each binary classifier

To quickly train each model in a loop, we'll initialize each model and store it by name in a dictionary:

models = {}

# Logistic Regression
from sklearn.linear_model import LogisticRegression
models['Logistic Regression'] = LogisticRegression()

# Support Vector Machines


from sklearn.svm import LinearSVC

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 42


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI
models['Support Vector Machines'] = LinearSVC()

# Decision Trees
from sklearn.tree import DecisionTreeClassifier
models['Decision Trees'] = DecisionTreeClassifier()

# Random Forest
from sklearn.ensemble import RandomForestClassifier
models['Random Forest'] = RandomForestClassifier()

# Naive Bayes
from sklearn.naive_bayes import GaussianNB
models['Naive Bayes'] = GaussianNB()

# K-Nearest Neighbors
from sklearn.neighbors import KNeighborsClassifier
models['K-Nearest Neighbor'] = KNeighborsClassifier()

Performance evaluation of each binary classifier

Now that we'veinitialized the models, we'll loop over each one, train it by calling .fit(), make predictions,
calculate metrics, and store each result in a dictionary.

from sklearn.metrics import accuracy_score, precision_score, recall_score

accuracy, precision, recall = {}, {}, {}

for key in models.keys():

# Fit the classifier


models[key].fit(X_train, y_train)

# Make predictions
predictions = models[key].predict(X_test)

# Calculate metrics
accuracy[key] = accuracy_score(predictions, y_test)
precision[key] = precision_score(predictions, y_test)
recall[key] = recall_score(predictions, y_test)

With all metrics stored, we can use pandas to view the data as a table:

import pandas as pd

df_model = pd.DataFrame(index=models.keys(), columns=['Accuracy', 'Precision',


'Recall'])
df_model['Accuracy'] = accuracy.values()
df_model['Precision'] = precision.values()
df_model['Recall'] = recall.values()

df_model
Out:
Accuracy Precision Recall
Logistic Regression 0.958042 0.955556 0.977273
Support Vector Machines 0.937063 0.933333 0.965517
Decision Trees 0.902098 0.866667 0.975000
Random Forest 0.972028 0.966667 0.988636
Naive Bayes 0.937063 0.955556 0.945055

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 43


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI
Accuracy Precision Recall
K-Nearest Neighbor 0.951049 0.988889 0.936842

Finally, here's a quick bar chart to compare the classifiers' performance:

ax = df_model.plot.barh()
ax.legend(
ncol=len(models.keys()),
bbox_to_anchor=(0, 1),
loc='lower left',
prop={'size': 14}
)
plt.tight_layout()

Sample Source Code:


# Importing necessary libraries
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset


iris = load_iris()
X = iris.data
y = (iris.target == 0).astype(int) # Binary classification: Setosa or not

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize and train the Logistic Regression model


logreg_model = LogisticRegression()
logreg_model.fit(X_train_scaled, y_train)

# Initialize and train the Random Forest model


rf_model = RandomForestClassifier()
rf_model.fit(X_train_scaled, y_train)

# Predictions on the test set


logreg_pred = logreg_model.predict(X_test_scaled)
rf_pred = rf_model.predict(X_test_scaled)

# Evaluate model performance


logreg_accuracy = accuracy_score(y_test, logreg_pred)
rf_accuracy = accuracy_score(y_test, rf_pred)

# Print accuracy
print("Logistic Regression Accuracy:", logreg_accuracy)
print("Random Forest Accuracy:", rf_accuracy)

Output:
Logistic Regression Accuracy: 1.0
Random Forest Accuracy: 1.0

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 44


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI

Experiment No. 7

Aim: Write a program for Model evaluation using accuracy, precision,


recall, and F1-score.

Theory: Classification models are used in classification problems to predict the target class of the data
sample. The classification model predicts the probability that each instance belongs to one class or another.
It is important to evaluate the performance of the classifications model in order to reliably use these models
in production for solving real-world problems. Performance measures in machine learning classification
models are used to assess how well machine learning classification models perform in a given context.
These performance metrics include accuracy, precision, recall, and F1-score. Because it helps us
understand the strengths and limitations of these models when making predictions in new situations, model
performance is essential for machine learning. In this blog post, we will explore these four machine learning
classification model performance metrics through Python Sklearn example.

• Accuracy score
• Precision score
• Recall score
• F1-Score

As a data scientist, you must get a good understanding of concepts related to the above in relation to
measuring classification models’ performance. Before we get into the details of the performance metrics as
listed above, lets understand key terminologies such as true positive, false positive, true negative and false
negative with the help of confusion matrix. These terminologies will be used across different performance
metrics.

Table of Contents

• Terminologies – True Positive, False Positive, True Negative, False Negative


• What is Precision Score?
o Different real-world scenarios when precision scores can be used as evaluation metrics
• What is Recall Score?
o Different real-world scenarios when recall scores can be used as evaluation metrics
• Precision – Recall Tradeoff
• What is Accuracy Score?
o
▪ Caution with Accuracy Metrics / Score
• What is F1-Score?
• Conclusions
• Take a Quiz
• Results
o #1. Which of the following metrics is used to measure the proportion of true positive
predictions in a classification model?
o #2. How can one measure the tradeoff between false positives and false negatives within a
predictive model?
o #3. What kind of metric is used to quantify how well a classifier correctly identifies positive
instances?
o #4. What is accuracy in machine learning?
o #5. How does F1-score help evaluate machine learning models?
o #6. If you want to measure the ability of a machine learning model to avoid false positives,
which metric should you use?

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 45


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI
o #7. What is recall in machine learning?
o #8. What does higher precision mean when evaluating ML models?
o #9. When is precision more important than recall?
o #10. What are some ways to optimize accuracy metrics for a given model?
o #11. Which of the following metrics is used to measure the proportion of true positive
predictions out of all actual positive instances in a classification model?
o #12. In a classification model, a high precision value indicates:
o #13. Which of the following metrics is used to balance precision and recall in a classification
model?
o #14. In a binary classification problem, a trade-off between precision and recall can occur
when:

Terminologies – True Positive, False Positive, True Negative, False Negative

Before we get into the definitions, lets work with Sklearn breast cancer datasets for classifying whether a
particular instance of data belongs to benign or malignant breast cancer class. You can load the dataset
using the following code:

1import pandas as pd
2import numpy as np
3from sklearn import datasets
4#
5# Load the breast cancer data set
6#
7bc = datasets.load_breast_cancer()
8X = bc.data
9y = bc.target

The target labels in the breast cancer dataset are Benign (1) and Malignant (0). There are 212 records with
labels as malignant and 357 records with labels as benign. Let’s create a training and test split where 30% of
the dataset is set aside for testing purposes.

from sklearn.model_selection import train_test_split


1
#
2# Create training and test split
3#
4X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30,
5random_state=1, stratify=y)

Splitting the breast cancer dataset into training and test set results in the test set consisting of 64 records’
labels as benign and 107 records’ labels as malignant. Thus, the actual positive is 107 records and the actual
negative is 64 records. Let’s train the model and get the confusion matrix. Here is the code for training the
model and printing the confusion matrix.

1 from sklearn.preprocessing import StandardScaler


2 from sklearn.svm import SVC
3 from sklearn.metrics import confusion_matrix
4 from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
5 import matplotlib.pyplot as plt
6 #
7 # Standardize the data set
8 #
9 sc = StandardScaler()
10sc.fit(X_train)
11X_train_std = sc.transform(X_train)
12X_test_std = sc.transform(X_test)
13#
14# Fit the SVC model
15#
16svc = SVC(kernel='linear', C=10.0, random_state=1)

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 46


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI
17svc.fit(X_train, y_train)
18#
19# Get the predictions
20#
21y_pred = svc.predict(X_test)
22#
23# Calculate the confusion matrix
24#
25conf_matrix = confusion_matrix(y_true=y_test, y_pred=y_pred)
26#
27# Print the confusion matrix using Matplotlib
28#
29fig, ax = plt.subplots(figsize=(5, 5))
30ax.matshow(conf_matrix, cmap=plt.cm.Oranges, alpha=0.3)
31for i in range(conf_matrix.shape[0]):
32 for j in range(conf_matrix.shape[1]):
33 ax.text(x=j, y=i,s=conf_matrix[i, j], va='center', ha='center', size='xx-
34large')
35
36plt.xlabel('Predictions', fontsize=18)
37plt.ylabel('Actuals', fontsize=18)
38plt.title('Confusion Matrix', fontsize=18)
plt.show()

The following confusion matrix is printed:

The predicted data results in the above diagram could be read in the following manner given 1 represents
malignant cancer (positive).

• True Positive (TP): True positive measures the extent to which the model correctly predicts the positive class. That is,
the model predicts that the instance is positive, and the instance is actually positive. True positives are relevant when we
want to know how many positives our model correctly predicts. For example, in a binary classification problem with
classes “A” and “B”, if our goal is to predict class “A” correctly, then a true positive would be the number of instances of
class “A” that our model correctly predicted as class “A”. Taking a real-world example, if the model is designed to
predict whether an email is spam or not, a true positive would occur when the model correctly predicts that an email is a
spam. The true positive rate is the percentage of all instances that are correctly classified as belonging to a certain class.
True positives are important because they indicate how well our model performs on positive instances. In the above
confusion matrix, out of 107 actual positives, 104 are correctly predicted positives. Thus, the value of True Positive is
104.
• False Positive (FP): False positives occur when the model predicts that an instance belongs to a class that it actually does
not. False positives can be problematic because they can lead to incorrect decision-making. For example, if a medical
diagnosis model has a high false positive rate, it may result in patients undergoing unnecessary treatment. False positives
can be detrimental to classification models because they lower the overall accuracy of the model. There are a few ways to
measure false positives, including false positive rates. The false positive rate is the proportion of all negative examples
that are predicted as positive. While false positives may seem like they would be bad for the model, in some cases they
can be desirable. For example, in medical applications, it is often better to err on the side of caution and have a few false
positives than to miss a diagnosis entirely. However, in other applications, such as spam filtering, false positives can be
very costly. Therefore, it is important to carefully consider the trade-offs involved when choosing between different
classification models. In the above example, the false positive represents the number of negatives (out of 64) that get
falsely predicted as positive. Out of 64 actual negatives, 3 is falsely predicted as positive. Thus, the value of False
Positive is 3.
• True Negative (TN): True negatives are the outcomes that the model correctly predicts as negative. For example, if the
model is predicting whether or not a person has a disease, a true negative would be when the model predicts that the
person does not have the disease and they actually don’t have the disease. True negatives are one of the measures used to
assess how well a classification model is performing. In general, a high number of true negatives indicates that the model
is performing well. True negative is used in conjunction with false negative, true positive, and false positive to compute a
variety of performance metrics such as accuracy, precision, recall, and F1 score. While true negative provides valuable
insight into the classification model’s performance, it should be interpreted in the context of other metrics to get a
complete picture of the model’s accuracy. Out of 64 actual negatives, 61 is correctly predicted negative. Thus, the value
of True Negative is 61.
• False Negative (FN): A false negative occurs when a model predicts an instance as negative when it is actually positive.
False negatives can be very costly, especially in the field of medicine. For example, if a cancer screening test predicts
that a patient does not have cancer when they actually do, this could lead to the disease progressing without treatment.
False negatives can also occur in other fields, such as security or fraud detection. In these cases, a false negative may

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 47


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI
result in someone being granted access or approving a transaction that should not have been allowed. False negatives are
often more serious than false positives, and so it is important to take them into account when evaluating the performance
of a classification model. This value represents the number of positives (out of 107) that get falsely predicted as negative.
Out of 107 actual positives, 3 is falsely predicted as negative. Thus, the value of False Negative is 3.

Given the above definitions, let’s try and understand the concept of accuracy, precision, recall, and f1-score.

What is Precision Score?

The model precision score measures the proportion of positively predicted labels that are actually correct.
Precision is also known as the positive predictive value. Precision is used in conjunction with the recall to
trade-off false positives and false negatives. Precision is affected by the class distribution. If there are more
samples in the minority class, then precision will be lower. Precision can be thought of as a measure of
exactness or quality. If we want to minimize false positives, we would choose a model with high precision.
Conversely, if we want to minimize false negatives, we would choose a model with high recall. Precision is
mainly used when we need to predict the positive class and there is a greater cost associated with false
positives than with false negatives such as in medical diagnosis or spam filtering. For example, if a model is
99% accurate but only has 50% precision, that means that half of the time when it predicts an email is a
spam, it is actually not spam.

The precision score is a useful measure of the success of prediction when the classes are very
imbalanced. Mathematically, it represents the ratio of true positive to the sum of true positive and false
positive.

Precision Score = TP / (FP + TP)

From the above formula, you could notice that the value of false-positive would impact the precision score.
Thus, while building predictive models, you may choose to focus appropriately to build models with lower
false positives if a high precision score is important for the business requirements.

The precision score from the above confusion matrix will come out to be the following:

Precision score = 104 / (3 + 104) = 104/107 = 0.972

The same score can be obtained by using the precision_score method from sklearn.metrics

1print('Precision: %.3f' % precision_score(y_test, y_pred))

Different real-world scenarios when precision scores can be used as evaluation


metrics

The precision score can be used in the scenario where the machine learning model is required to identify all
positive examples without any false positives. For example, machine learning models are used in medical
diagnosis applications where the doctor wants machine learning models that will not provide a label of
pneumonia if the patient does not have this disease. Oncologists ideally want models that can identify all
cancerous lesions without any false-positive results, and hence one could use a precision score in such cases.
Note that a greater number of false positives will result in a lot of stress for the patients in general although
that may not turn out to be fatal from a health perspective. Further tests will be able to negate the false
positive prediction.

The other example where the precision score can be useful is credit card fraud detection. In credit card fraud
detection problems, classification models are evaluated using the precision score to determine how many
positive samples were correctly classified by the classification model. You would not like to have a high
number of false positives or else you might end up blocking many credit cards and hence a lot of frustrations
with the end-users.

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 48


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI
Another example where you would want greater precision is spam filters. A greater number of false
positives in a spam filter would mean that one or more important emails could be tagged as spam and moved
to spam folders. This could hamper in so many different ways including impact on your day-to-day work.

What is Recall Score?

Model recall score represents the model’s ability to correctly predict the positives out of actual positives.
This is unlike precision which measures how many predictions made by models are actually positive out of
all positive predictions made. For example: If your machine learning model is trying to identify positive
reviews, the recall score would be what percent of those positive reviews did your machine learning model
correctly predict as a positive. In other words, it measures how good our machine learning model is at
identifying all actual positives out of all positives that exist within a dataset. Recall is also known as
sensitivity or the true positive rate.

The higher the recall score, the better the machine learning model is at identifying both positive and negative
examples. A high recall score indicates that the model is good at identifying positive examples. Conversely,
a low recall score indicates that the model is not good at identifying positive examples.

Recall is often used in conjunction with other performance metrics, such as precision and accuracy, to get a
complete picture of the model’s performance. Mathematically, it represents the ratio of true positive to the
sum of true positive and false negative.

Recall Score = TP / (FN + TP)

From the above formula, you could notice that the value of false-negative would impact the recall score.
Thus, while building predictive models, you may choose to focus appropriately to build models with lower
false negatives if a high recall score is important for the business requirements.

The recall score from the above confusion matrix will come out to be the following:

Recall score = 104 / (3 + 104) = 104/107 = 0.972

The same score can be obtained by using the recall_score method from sklearn.metrics

1print('Recall: %.3f' % recall_score(y_test, y_pred))

Recall score can be used in the scenario where the labels are not equally divided among classes. For
example, if there is a class imbalance ratio of 20:80 (imbalanced data), then the recall score will be more
useful than accuracy because it can provide information about how well the machine learning model
identified rarer events.

Different real-world scenarios when recall scores can be used as evaluation metrics

Recall score is an important metric to consider when measuring the effectiveness of your machine learning
models. It can be used in a variety of real-world scenarios, and it’s important to always aim to improve
recall and precision scores together. The following are examples of some real-world scenarios where recall
scores can be used as evaluation metrics:

• In medical diagnosis, the recall score should be an extremely high otherwise greater number of false negatives would
prove to be fatal to the life of patients. The lower recall score would mean a greater false negative which essentially
would mean that some patients who are positive are termed as falsely negative. That would mean that patients would get
assured that he/she is not suffering from the disease and therefore he/she won’t take any further action. That could result
in the disease getting aggravated and prove fatal to life.
o Lets understand with an example of detection of breast cancer through mammography screening. ML
models can be trained on large datasets of mammography images to assist radiologists in interpreting them. A
high recall score is important in this scenario because it indicates that the model is able to correctly identify all

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 49


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI
cases of breast cancer, including those that may be difficult for a human radiologist to detect. A model with a
low recall score may miss some cases of breast cancer, leading to delayed diagnosis and potentially worse
outcomes for patients.
• In manufacturing, it is important to identify defects in products as early as possible to avoid producing faulty products
and wasting resources. Machine learning models can be trained on large datasets of images or sensor data from the
production line to identify defects and anomalies. A high recall score is important in this scenario because it indicates
that the model is able to correctly identify all instances of defects, including those that may be rare or difficult to detect.
A model with a low recall score may miss some instances of defects, leading to faulty products and potential safety issues
for consumers. For example, in the automotive industry, machine learning models can be used to identify defects in
car parts such as engines or brakes. A high recall score is critical in ensuring that all defects are identified, allowing for
prompt repair or replacement of the faulty parts before they cause safety issues for drivers.
• In a credit card fraud detection system, you would want to have a higher recall score of the predictive models predicting
fraud transactions. A lower recall score would mean a higher false-negative which would mean greater fraud and hence
loss to business in terms of upset users.
• In sentiment analysis, the recall score determines how many relevant tweets or comments are found while the precision
score is the fraction of retrieved tweets that are actually tagged as positive. A high recall score will benefit from a
focused analysis.

Precision – Recall Tradeoff

The precision-recall tradeoff is a common issue that arises when evaluating the performance of a
classification model. Precision and recall are two metrics that are often used to evaluate the performance of a
classifier, and they are often in conflict with each other.

Precision measures the proportion of true positive predictions made by the model (i.e. the number of correct
positive predictions divided by the total number of positive predictions). It is a useful metric for evaluating
the model’s ability to avoid false positives.

Recall, on the other hand, measures the proportion of true positive cases that were correctly predicted by the
model (i.e. the number of correct positive predictions divided by the total number of true positive cases). It
is a useful metric for evaluating the model’s ability to avoid false negatives.

In general, increasing the precision of a model will decrease its recall, and vice versa. This is because
precision and recall are inversely related – improving one will typically result in a decrease in the other. For
example, a model with a high precision will make few false positive predictions, but it may also miss some
true positive cases. On the other hand, a model with a high recall will correctly identify most of the true
positive cases, but it may also make more false positive predictions.

In order to evaluate a classification model, it is important to consider both precision and recall, rather than
just one of these metrics. The appropriate balance between precision and recall will depend on the specific
goals and requirements of the model, as well as the characteristics of the dataset. In some cases, it may be
more important to have a high precision (e.g. in medical diagnosis), while in others, a high recall may be
more important (e.g. in fraud detection).

To balance precision and recall, practitioners often use the F1 score, which is a combination of the two
metrics. The F1 score is calculated as the harmonic mean of precision and recall, and it provides a balance
between the two metrics. However, even the F1 score is not a perfect solution, as it can be difficult to
determine the optimal balance between precision and recall for a given application.

What is Accuracy Score?

Model accuracy is a machine learning classification model performance metric that is defined as the ratio of
true positives and true negatives to all positive and negative observations. In other words, accuracy tells us
how often we can expect our machine learning model will correctly predict an outcome out of the total
number of times it made predictions. For example: Let’s assume that you were testing your machine
learning model with a dataset of 100 records and that your machine learning model predicted all 90 of those
instances correctly. The accuracy metric, in this case, would be: (90/100) = 90%. The accuracy rate is great

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 50


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI
but it doesn’t tell us anything about the errors our machine learning models make on new data we haven’t
seen before.

Mathematically, it represents the ratio of the sum of true positive and true negatives out of all the
predictions.

Accuracy Score = (TP + TN)/ (TP + FN + TN + FP)

The accuracy score from above confusion matrix will come out to be the following:

Accuracy score = (104 + 61) / (104 + 3 + 61 + 3) = 165/171 = 0.965

The same score can be obtained by using accuracy_score method from sklearn.metrics

1print('Accuracy: %.3f' % accuracy_score(y_test, y_pred))

Caution with Accuracy Metrics / Score

The following are some of the issues with accuracy metrics / score:

• The same accuracy metrics for two different models may indicate different model performance towards different classes.
• In case of imbalanced dataset, accuracy metrics is not the most effective metrics to be used.

One should be cautious when relying on the accuracy metrics of model to evaluate the model
performance. Take a look at the following confusion matrix. For model accuracy represented using both the
cases (left and right), the accuracy is 60%. However, both the models exhibit different behaviors.

The model performance represented by left confusion matrix indicates that the model has weak positive
recognition rate while the right confusion matrix represents that the model has strong positive recognition
rate. Note that the accuracy is 60% for both the models. Thus, one needs to dig deeper to understand about
the model performance given the accuracy metrics.

The accuracy metrics is also not reliable for the models trained on imbalanced or skewed datasets. Take
a scenario of dataset with 95% imbalance (95% data is negative class). The accuracy of the classifier will be
very high as it will be correctly doing right prediction issuing negative most of the time. A better classifier
that actually deals with the class imbalance issue, is likely to have a worse accuracy metrics score. In such
scenario of imbalanced dataset, another metrics AUC (the area under ROC curve) is more robust than
the accuracy metrics score. The AUC takes into the consideration, the class distribution in imbalanced
dataset. The ROC curve is a plot that shows the relationship between the true positive rate and the false
positive rate of a classification model. The area under the ROC curve (AUC) is a metric that quantifies the
overall performance of the model. A model with a higher AUC is considered to be a better classifier. Also, a
much better way to evaluate the performance of a classifier is to look at the confusion matrix.

Accuracy metrics only considers the number of correct predictions (true positives and true negatives)
made by the model. It does not take into account the relative importance of different types of errors, such as
false positives and false negatives. For example, if a model is being used to predict whether a patient has a
certain disease, a false positive (predicting that a patient has the disease when they actually do not) may be
less severe than a false negative (predicting that a patient does not have the disease when they actually do).
In this case, using accuracy as the sole evaluation metric may not provide a clear picture of the model’s
performance.

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 51


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI

What is F1-Score?

Model F1 score represents the model score as a function of precision and recall score. F-score is a machine
learning model performance metric that gives equal weight to both the Precision and Recall for measuring
its performance in terms of accuracy, making it an alternative to Accuracy metrics (it doesn’t require us to
know the total number of observations). It’s often used as a single value that provides high-level information
about the model’s output quality. This is a useful measure of the model in the scenarios where one tries to
optimize either of precision or recall score and as a result, the model performance suffers. The following
represents the aspects relating to issues with optimizing either precision or recall score:

• Optimizing for recall helps with minimizing the chance of not detecting a malignant cancer. However, this comes at the
cost of predicting malignant cancer in patients although the patients are healthy (a high number of FP).
• Optimize for precision helps with correctness if the patient has malignant cancer. However, this comes at the cost of
missing malignant cancer more frequently (a high number of FN).

Mathematically, it can be represented as a harmonic mean of precision and recall score.

F1 Score = 2* Precision Score * Recall Score/ (Precision Score + Recall Score/)

The accuracy score from the above confusion matrix will come out to be the following:

F1 score = (2 * 0.972 * 0.972) / (0.972 + 0.972) = 1.89 / 1.944 = 0.972

The same score can be obtained by using f1_score method from sklearn.metrics

1print('F1 Score: %.3f' % f1_score(y_test, y_pred))

Sample Source Code:


# Importing necessary libraries
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score,
classification_report, confusion_matrix

# Load the Iris dataset


iris = load_iris()
X = iris.data
y = (iris.target == 0).astype(int) # Binary classification: Setosa or not

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize and train the Logistic Regression model


model = LogisticRegression()
model.fit(X_train_scaled, y_train)

# Predictions on the test set


y_pred = model.predict(X_test_scaled)

# Calculate evaluation metrics

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 52


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print evaluation metrics


print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)

# Print detailed classification report


print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Print confusion matrix


conf_matrix = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(conf_matrix)

Output:

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 53


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI

Experiment No. 8
Aim: Write a program to use the K-means clustering algorithm.
K-means clustering algorithm computes the centroids and iterates until we it finds optimal centroid. It
assumes that the number of clusters are already known. It is also called flat clustering algorithm. The
number of clusters identified from data by algorithm is represented by ‘K’ in K-means.

In this algorithm, the data points are assigned to a cluster in such a manner that the sum of the squared
distance between the data points and centroid would be minimum. It is to be understood that less variation
within the clusters will lead to more similar data points within same cluster.

Working of K-Means Algorithm

We can understand the working of K-Means clustering algorithm with the help of following steps −

• Step 1 − First, we need to specify the number of clusters, K, need to be generated by this algorithm.
• Step 2 − Next, randomly select K data points and assign each data point to a cluster. In simple
words, classify the data based on the number of data points.
• Step 3 − Now it will compute the cluster centroids.
• Step 4 − Next, keep iterating the following until we find optimal centroid which is the assignment of
data points to the clusters that are not changing any more −

4.1 − First, the sum of squared distance between data points and centroids would be computed.

4.2 − Now, we have to assign each data point to the cluster that is closer than other cluster (centroid).

4.3 − At last compute the centroids for the clusters by taking the average of all data points of that cluster.

K-means follows Expectation-Maximization approach to solve the problem. The Expectation-step is used
for assigning the data points to the closest cluster and the Maximization-step is used for computing the
centroid of each cluster.

While working with K-means algorithm we need to take care of the following things −

• While working with clustering algorithms including K-Means, it is recommended to standardize the
data because such algorithms use distance-based measurement to determine the similarity between
data points.
• Due to the iterative nature of K-Means and random initialization of centroids, K-Means may stick in
a local optimum and may not converge to global optimum. That is why it is recommended to use
different initializations of centroids.

Implementation in Python

The following two examples of implementing K-Means clustering algorithm will help us in its better
understanding −

Example 1

It is a simple example to understand how k-means works. In this example, we are going to first generate 2D
dataset containing 4 different blobs and after that will apply k-means algorithm to see the result.

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 54


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI
First, we will start by importing the necessary packages −

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
from sklearn.cluster import KMeans

The following code will generate the 2D, containing four blobs −

from sklearn.datasets.samples_generator import make_blobs


X, y_true = make_blobs(n_samples=400, centers=4, cluster_std=0.60, random_state=0)

Next, the following code will help us to visualize the dataset −

plt.scatter(X[:, 0], X[:, 1], s=20);


plt.show()

Next, make an object of KMeans along with providing number of clusters, train the model and do the
prediction as follows −

kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)

Now, with the help of following code we can plot and visualize the cluster’s centers picked by k-means
Python estimator −

plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=20, cmap='summer')


centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='blue', s=100, alpha=0.9);
plt.show()

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 55


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI
Example 2

Let us move to another example in which we are going to apply K-means clustering on simple digits dataset.
K-means will try to identify similar digits without using the original label information.

First, we will start by importing the necessary packages −

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
from sklearn.cluster import KMeans

Next, load the digit dataset from sklearn and make an object of it. We can also find number of rows and
columns in this dataset as follows −

from sklearn.datasets import load_digits


digits = load_digits()
digits.data.shape

Output
(1797, 64)

The above output shows that this dataset is having 1797 samples with 64 features.

We can perform the clustering as we did in Example 1 above −

kmeans = KMeans(n_clusters=10, random_state=0)


clusters = kmeans.fit_predict(digits.data)
kmeans.cluster_centers_.shape

Output
(10, 64)

The above output shows that K-means created 10 clusters with 64 features.

fig, ax = plt.subplots(2, 5, figsize=(8, 3))


centers = kmeans.cluster_centers_.reshape(10, 8, 8)
for axi, center in zip(ax.flat, centers):
axi.set(xticks=[], yticks=[])
axi.imshow(center, interpolation='nearest', cmap=plt.cm.binary)

Output

As output, we will get following image showing clusters centers learned by k-means.

The following lines of code will match the learned cluster labels with the true labels found in them −

from scipy.stats import mode

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 56


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI
labels = np.zeros_like(clusters)
for i in range(10):
mask = (clusters == i)
labels[mask] = mode(digits.target[mask])[0]

Next, we can check the accuracy as follows −

from sklearn.metrics import accuracy_score


accuracy_score(digits.target, labels)

Output
0.7935447968836951

The above output shows that the accuracy is around 80%.

Advantages and Disadvantages


Advantages

The following are some advantages of K-Means clustering algorithms −

• It is very easy to understand and implement.


• If we have large number of variables then, K-means would be faster than Hierarchical clustering.
• On re-computation of centroids, an instance can change the cluster.
• Tighter clusters are formed with K-means as compared to Hierarchical clustering.

Disadvantages

The following are some disadvantages of K-Means clustering algorithms −

• It is a bit difficult to predict the number of clusters i.e. the value of k.


• Output is strongly impacted by initial inputs like number of clusters (value of k).
• Order of data will have strong impact on the final output.
• It is very sensitive to rescaling. If we will rescale our data by means of normalization or
standardization, then the output will completely change.final output.
• It is not good in doing clustering job if the clusters have a complicated geometric shape.

Applications of K-Means Clustering Algorithm

The main goals of cluster analysis are −

• To get a meaningful intuition from the data we are working with.


• Cluster-then-predict where different models will be built for different subgroups.

To fulfill the above-mentioned goals, K-means clustering is performing well enough. It can be used in
following applications −

• Market segmentation
• Document Clustering
• Image segmentation
• Image compression
• Customer segmentation
• Analyzing the trend on dynamic data

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 57


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI

Sample Source Code:


# Importing necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

# Generate synthetic data


data, _ = make_blobs(n_samples=300, centers=4, cluster_std=1.0, random_state=42)

# Initialize and fit K-means model


num_clusters = 4
kmeans = KMeans(n_clusters=num_clusters)
kmeans.fit(data)

# Predict cluster labels for the data points


labels = kmeans.predict(data)

# Get cluster centers


cluster_centers = kmeans.cluster_centers_

# Visualize the data and cluster centers


plt.scatter(data[:, 0], data[:, 1], c=labels, cmap='viridis')
plt.scatter(cluster_centers[:, 0], cluster_centers[:, 1], marker='X', s=200, c='red')
plt.title("K-means Clustering")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()

Output:

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 58


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI

Experiment No. 9
Aim: Write a program for Text pre-processing using tokenization, stop word
removal, and stemming.
Text preprocessing is an essential step in natural language processing (NLP) that involves cleaning and
transforming unstructured text data to prepare it for analysis. It includes tokenization, stemming,
lemmatization, stop-word removal, and part-of-speech tagging. In this article, we will introduce the basics of
text preprocessing and provide Python code examples to illustrate how to implement these tasks using the
NLTK library. By the end of the article, readers will better understand how to prepare text data for NLP
tasks.

What is Text Preprocessing in NLP?

Natural Language Processing (NLP) is a branch of Data Science which deals with Text data. Apart from
numerical data, Text data is available to a great extent which is used to analyze and solve business problems.
But before using the data for analysis or prediction, processing the data is important.

To prepare the text data for the model building we perform text preprocessing. It is the very first step of
NLP projects. Some of the preprocessing steps are:

• Removing punctuations like . , ! $( ) * % @


• Removing URLs
• Removing Stop words
• Lower casing
• Tokenization
• Stemming
• Lemmatization

SMS Spam Data for Text Preprocessing

We need to use the required steps based on our dataset. In this article, we will use SMS Spam data to
understand the steps involved in Text Preprocessing in NLP.

Let’s start by importing the pandas library and reading the data.

Ready to start your data science journey?

Master 23+ tools & learn 50+ real-world projects to transform your career in Data Science.

<span data-mce-type="bookmark" style="display: inline-block; width: 0px; overflow: hidden; line-height: 0;"
class="mce_SELRES_start"></span>

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 59


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI

#expanding the dispay of text sms column


pd.set_option('display.max_colwidth', -1)
#using only v1 and v2 column
data= data [['v1','v2']]
data.head()

The data has 5572 rows and 2 columns. You can check the shape of data using data.shape function. Let’s
check the dependent variable distribution between spam and ham.

#checking the count of the dependent variable


data['v1'].value_counts()

Steps to Clean the Data


Punctuation Removal

In this step, all the punctuations from the text are removed. string library of Python contains some pre-
defined list of punctuations such as ‘!”#$%&'()*+,-./:;?@[\]^_`{|}~’

#library that contains punctuation


import string
string.punctuation
#defining the function to remove punctuation
def remove_punctuation(text):
punctuationfree="".join([i for i in text if i not in string.punctuation])
return punctuationfree
#storing the puntuation free text
data['clean_msg']= data['v2'].apply(lambda x:remove_punctuation(x))
data.head()

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 60


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI

We can see in the above output, all the punctuations are removed from v2 and stored in the clean_msg
column.

Lowering the Text

It is one of the most common text preprocessing Python steps where the text is converted into the same case
preferably lower case. But it is not necessary to do this step every time you are working on an NLP problem
as for some problems lower casing can lead to loss of information.

For example, if in any project we are dealing with the emotions of a person, then the words written in upper
cases can be a sign of frustration or excitement.

data['msg_lower']= data['clean_msg'].apply(lambda x: x.lower())

Output: All the text of clean_msg column are converted into lower case and stored in msg_lower column

Tokenization

In this step, the text is split into smaller units. We can use either sentence tokenization or word tokenization
based on our problem statement.

#defining function for tokenization


import re
def tokenization(text):
tokens = re.split('W+',text)
return tokens

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 61


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI
#applying function to the column
data['msg_tokenied']= data['msg_lower'].apply(lambda x: tokenization(x))

Output: Sentences are tokenized into words.

Stop Word Removal

Stopwords are the commonly used words and are removed from the text as they do not add any value to the
analysis. These words carry less or no meaning.

NLTK library consists of a list of words that are considered stopwords for the English language. Some of
them are : [i, me, my, myself, we, our, ours, ourselves, you, you’re, you’ve, you’ll, you’d, your, yours,
yourself, yourselves, he, most, other, some, such, no, nor, not, only, own, same, so, then, too, very, s, t, can,
will, just, don, don’t, should, should’ve, now, d, ll, m, o, re, ve, y, ain, aren’t, could, couldn’t, didn’t, didn’t]

But it is not necessary to use the provided list as stopwords as they should be chosen wisely based on the
project. For example, ‘How’ can be a stop word for a model but can be important for some other problem
where we are working on the queries of the customers. We can create a customized list of stop words for
different problems.

#importing nlp library


import nltk
#Stop words present in the library
stopwords = nltk.corpus.stopwords.words('english')
stopwords[0:10]
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]
#defining the function to remove stopwords from tokenized text
def remove_stopwords(text):
output= [i for i in text if i not in stopwords]
return output
#applying the function
data['no_stopwords']= data['msg_tokenied'].apply(lambda x:remove_stopwords(x))

Output: Stop words that are present in the nltk library such as in, until, to, I, here are removed from the
tokenized text and the rest are stored in the no_stopwords column.

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 62


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI

Stemming
It is also known as the text standardization step where the words are stemmed or diminished to their
root/base form. For example, words like ‘programmer’, ‘programming, ‘program’ will be stemmed to
‘program’.

But the disadvantage of stemming is that it stems the words such that its root form loses the meaning or it is
not diminished to a proper English word. We will see this in the steps done below.

#importing the Stemming function from nltk library


from nltk.stem.porter import PorterStemmer
#defining the object for stemming
porter_stemmer = PorterStemmer()
#defining a function for stemming
def stemming(text):
stem_text = [porter_stemmer.stem(word) for word in text]
return stem_text
data['msg_stemmed']=data['no_sw_msg'].apply(lambda x: stemming(x))

Output: In the below image, we can see how some words are stemmed to their base.

crazy-> crazi

available-> avail

entry-> entri

early-> earli

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 63


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI

Now let’s see how Lemmatization is different from Stemming.

Lemmatization

It stems the word but makes sure that it does not lose its meaning. Lemmatization has a pre-defined
dictionary that stores the context of words and checks the word in the dictionary while diminishing.

The difference between Stemming and Lemmatization can be understood with the example provided below.

Original Word After Stemming After Lemmatization


goose goos goose
geese gees goose
from nltk.stem import WordNetLemmatizer
#defining the object for Lemmatization
wordnet_lemmatizer = WordNetLemmatizer()
#defining the function for lemmatization
def lemmatizer(text):
lemm_text = [wordnet_lemmatizer.lemmatize(word) for word in text]
return lemm_text
data['msg_lemmatized']=data['no_stopwords'].apply(lambda x:lemmatizer(x))

Output: The difference between Stemming and Lemmatization can be seen in the below output.

In the first row- crazy has been changed to crazi which has no meaning but for lemmatization, it remained
the same i.e crazy

In the last row- goes has changed to goe while stemming but for lemmatization, it has converted into go
which is meaningful.

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 64


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI

After all the text processing steps are performed, the final acquired data is converted into the numeric form
using Bag of words or TF-IDF.

Sample Source Code:


# Importing necessary libraries
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

# Download NLTK resources if not already downloaded


nltk.download('punkt')
nltk.download('stopwords')

# Sample input text


input_text = "Text pre-processing is an important step in natural language processing."

# Tokenization
tokens = word_tokenize(input_text)

# Remove stop words


stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

# Stemming
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]

# Print pre-processed text

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 65


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI
print("Original Text:", input_text)
print("Tokens:", tokens)
print("Filtered Tokens (without stop words):", filtered_tokens)
print("Stemmed Tokens:", stemmed_tokens)

Output:

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 66


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI

Experiment No. 10

Aim: Write a program to work with time-series data in Python.

Theory: Time series is a series of data points in which each data point is associated with a timestamp. A
simple example is the price of a stock in the stock market at different points of time on a given day. Another
example is the amount of rainfall in a region at different months of the year.

In the below example we take the value of stock prices every day for a quarter for a particular stock symbol.
We capture these values as a csv file and then organize them to a dataframe using pandas library. We then
set the date field as index of the dataframe by recreating the additional Valuedate column as index and
deleting the old valuedate column.

Sample Data

Below is the sample data for the price of the stock on different days of a given quarter. The data is saved in a
file named as stock.csv

ValueDate Price
01-01-2018, 1042.05
02-01-2018, 1033.55
03-01-2018, 1029.7
04-01-2018, 1021.3
05-01-2018, 1015.4
...
...
...
...
23-03-2018, 1161.3
26-03-2018, 1167.6
27-03-2018, 1155.25
28-03-2018, 1154

Creating Time Series


from datetime import datetime
import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('path_to_file/stock.csv')
df = pd.DataFrame(data, columns = ['ValueDate', 'Price'])

# Set the Date as Index


df['ValueDate'] = pd.to_datetime(df['ValueDate'])
df.index = df['ValueDate']
del df['ValueDate']

df.plot(figsize=(15, 6))
plt.show()

Its output is as follows −

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 67


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI

Sample Source Code:


# Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Generate a simple time-series dataset


date_rng = pd.date_range(start='2023-01-01', end='2023-01-10', freq='D')
data = np.random.randn(len(date_rng))
df = pd.DataFrame(data, columns=['value'], index=date_rng)

# Display the time-series data


print("Time-Series Data:")
print(df)

# Plot the time-series data


plt.figure(figsize=(10, 6))
plt.plot(df.index, df['value'], marker='o')
plt.title("Time-Series Data")
plt.xlabel("Date")
plt.ylabel("Value")
plt.grid(True)
plt.show()

Output:

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 68


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 69


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI

Experiment No. 11

Aim: Write a program to work with NLTK in python.

Theory: The method of communication with the help of which humans can speak, read, and write, is
language. In other words, we humans can think, make plans, make decisions in our natural language. Here
the big question is, in the era of artificial intelligence, machine learning and deep learning, can humans
communicate in natural language with computers/machines? Developing NLP applications is a huge
challenge for us because computers require structured data, but on the other hand, human speech is
unstructured and often ambiguous in nature.

Natural language is that subfield of computer science, more specifically of AI, which enables
computers/machines to understand, process and manipulate human language. In simple words, NLP is a way
of machines to analyze, understand and derive meaning from human natural languages like Hindi, English,
French, Dutch, etc.

How does it work?

Before getting deep dive into the working of NLP, we must have to understand how human beings use
language. Every day, we humans use hundreds or thousands of words and other humans interpret them and
answer accordingly. It’s a simple communication for humans, isn’t it? But we know words run much-much
deeper than that and we always derive a context from what we say and how we say. That’s why we can say
rather than focuses on voice modulation, NLP does draw on contextual pattern.

Let us understand it with an example −

Man is to woman as king is to what?


We can interpret it easily and answer as follows:
Man relates to king, so woman can relate to queen.
Hence the answer is Queen.

How humans know what word means what? The answer to this question is that we learn through our
experience. But, how do machines/computers learn the same?

Let us understand it with following easy steps −

• First, we need to feed the machines with enough data so that machines can learn from experience.
• Then machine will create word vectors, by using deep learning algorithms, from the data we fed
earlier as well as from its surrounding data.
• Then by performing simple algebraic operations on these word vectors, machine would be able to
provide the answers as human beings.

Components of NLP

Following diagram represents the components of natural language processing (NLP) −

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 70


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI

Morphological Processing

Morphological processing is the first component of NLP. It includes breaking of chunks of language input
into sets of tokens corresponding to paragraphs, sentences and words. For example, a word like “everyday”
can be broken into two sub-word tokens as “every-day”.

Syntax analysis

Syntax Analysis, the second component, is one of the most important components of NLP. The purposes of
this component are as follows −

• To check that a sentence is well formed or not.


• To break it up into a structure that shows the syntactic relationships between the different words.
• E.g. The sentences like “The school goes to the student” would be rejected by syntax analyzer.

Semantic analysis

Semantic Analysis is the third component of NLP which is used to check the meaningfulness of the text. It
includes drawing exact meaning, or we can say dictionary meaning from the text. E.g. The sentences like
“It’s a hot ice-cream.” would be discarded by semantic analyzer.

Pragmatic analysis

Pragmatic analysis is the fourth component of NLP. It includes fitting the actual objects or events that exist
in each context with object references obtained by previous component i.e. semantic analysis. E.g. The

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 71


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI
sentences like “Put the fruits in the basket on the table” can have two semantic interpretations hence the
pragmatic analyzer will choose between these two possibilities.

Examples of NLP Applications

NLP, an emerging technology, derives various forms of AI we used to see these days. For today’s and
tomorrow’s increasingly cognitive applications, the use of NLP in creating a seamless and interactive
interface between humans and machines will continue to be a top priority. Following are some of the very
useful applications of NLP.

Machine Translation

Machine translation (MT) is one of the most important applications of natural language processing. MT is
basically a process of translating one source language or text into another language. Machine translation
system can be of either Bilingual or Multilingual.

Fighting Spam

Due to enormous increase in unwanted emails, spam filters have become important because it is the first line
of defense against this problem. By considering its false-positive and false-negative issues as the main
issues, the functionality of NLP can be used to develop spam filtering system.

N-gram modelling, Word Stemming and Bayesian classification are some of the existing NLP models that
can be used for spam filtering.

Information retrieval & Web search

Most of the search engines like Google, Yahoo, Bing, WolframAlpha, etc., base their machine translation
(MT) technology on NLP deep learning models. Such deep learning models allow algorithms to read text on
webpage, interprets its meaning and translate it to another language.

Automatic Text Summarization

Automatic text summarization is a technique which creates a short, accurate summary of longer text
documents. Hence, it helps us in getting relevant information in less time. In this digital era, we are in a
serious need of automatic text summarization because we have the flood of information over internet which
is not going to stop. NLP and its functionalities play an important role in developing an automatic text
summarization.

Grammar Correction

Spelling correction & grammar correction is a very useful feature of word processor software like Microsoft
Word. Natural language processing (NLP) is widely used for this purpose.

Question-answering

Question-answering, another main application of natural language processing (NLP), focuses on building
systems which automatically answer the question posted by user in their natural language.

Sentiment analysis

Sentiment analysis is among one other important applications of natural language processing (NLP). As its
name implies, Sentiment analysis is used to −

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 72


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI
• Identify the sentiments among several posts and
• Identify the sentiment where the emotions are not expressed explicitly.

Online E-commerce companies like Amazon, ebay, etc., are using sentiment analysis to identify the opinion
and sentiment of their customers online. It will help them to understand what their customers think about
their products and services.

Speech engines

Speech engines like Siri, Google Voice, Alexa are built on NLP so that we can communicate with them in
our natural language.

Implementing NLP

In order to build the above-mentioned applications, we need to have specific skill set with a great
understanding of language and tools to process the language efficiently. To achieve this, we have various
open-source tools available. Some of them are open-sourced while others are developed by organizations to
build their own NLP applications. Following is the list of some NLP tools −

• Natural Language Tool Kit (NLTK)


• Mallet
• GATE
• Open NLP
• UIMA
• Genism
• Stanford toolkit

Most of these tools are written in Java.

Natural Language Tool Kit (NLTK)

Among the above-mentioned NLP tool, NLTK scores very high when it comes to the ease of use and
explanation of the concept. The learning curve of Python is very fast and NLTK is written in Python so
NLTK is also having very good learning kit. NLTK has incorporated most of the tasks like tokenization,
stemming, Lemmatization, Punctuation, Character Count, and Word count. It is very elegant and easy to
work with.

Sample Source Code:

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.probability import FreqDist
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Download NLTK resources if not already downloaded


nltk.download('punkt')
nltk.download('stopwords')
nltk.download('vader_lexicon')

# Sample text

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 73


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI
text = "NLTK is a leading platform for building Python programs to work with human
language data."

# Tokenization
tokens = word_tokenize(text)

# Remove stop words


stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

# Stemming
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]

# Frequency distribution
freq_dist = FreqDist(stemmed_tokens)

# Sentiment analysis
sia = SentimentIntensityAnalyzer()
sentiment_score = sia.polarity_scores(text)

# Print results
print("Original Text:", text)
print("Tokens:", tokens)
print("Filtered Tokens (without stop words):", filtered_tokens)
print("Stemmed Tokens:", stemmed_tokens)
print("Most Common Words:", freq_dist.most_common(5))
print("Sentiment Analysis Score:", sentiment_score)

Output:

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 74


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI

Experiment No. 12

Aim: Write a program to work with NLP in python.

Theory:

NLP is a branch of data science that consists of systematic processes for analyzing, understanding, and
deriving information from the text data in a smart and efficient manner. By utilizing NLP and its
components, one can organize the massive chunks of text data, perform numerous automated tasks and solve
a wide range of problems such as – automatic summarization, machine translation, named entity recognition,
relationship extraction, sentiment analysis, speech recognition, and topic segmentation etc.

Before moving further, I would like to explain some terms that are used in the article:

• Tokenization – process of converting a text into tokens


• Tokens – words or entities present in the text
• Text object – a sentence or a phrase or a word or an article

Steps to install NLTK and its data:

Install Pip: run in terminal:

sudo easy_install pip

Install NLTK: run in terminal :

sudo pip install -U nltk

Download NLTK data: run python shell (in terminal) and write the following code:

```
import nltk nltk.download() ```

Follow the instructions on screen and download the desired package or collection. Other libraries can be
directly installed using pip.

Text Preprocessing

Since, text is the most unstructured form of all the available data, various types of noise are present in it and
the data is not readily analyzable without any pre-processing. The entire process of cleaning and
standardization of text, making it noise-free and ready for analysis is known as text preprocessing.

It is predominantly comprised of three steps:

• Noise Removal
• Lexicon Normalization
• Object Standardization

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 75


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI
The following image shows the architecture of text preprocessing pipeline.

Noise Removal

Any piece of text which is not relevant to the context of the data and the end-output can be specified as the
noise.

For example – language stopwords (commonly used words of a language – is, am, the, of, in etc), URLs or
links, social media entities (mentions, hashtags), punctuations and industry specific words. This step deals
with removal of all types of noisy entities present in the text.

A general approach for noise removal is to prepare a dictionary of noisy entities, and iterate the text object
by tokens (or by words), eliminating those tokens which are present in the noise dictionary.

Following is the python code for the same purpose.

Python Code:

<span data-mce-type="bookmark" style="display: inline-block; width: 0px; overflow: hidden; line-height: 0;"
class="mce_SELRES_start"></span>

Another approach is to use the regular expressions while dealing with special patterns of noise. We have
explained regular expressions in detail in one of our previous article. Following python code removes a
regex pattern from the input text:

```

# Sample code to remove a regex pattern


import re

def _remove_regex(input_text, regex_pattern):


urls = re.finditer(regex_pattern, input_text)
for i in urls:
input_text = re.sub(i.group().strip(), '', input_text)
return input_text

regex_pattern = "#[\w]*"

_remove_regex("remove this #hashtag from analytics vidhya", regex_pattern)


>>> "remove this from analytics vidhya"

```

Lexicon Normalization

Another type of textual noise is about the multiple representations exhibited by single word.

For example – “play”, “player”, “played”, “plays” and “playing” are the different variations of the word –
“play”, Though they mean different but contextually all are similar. The step converts all the disparities of a

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 76


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI
word into their normalized form (also known as lemma). Normalization is a pivotal step for feature
engineering with text as it converts the high dimensional features (N different features) to the low
dimensional space (1 feature), which is an ideal ask for any ML model.

The most common lexicon normalization practices are :

• Stemming: Stemming is a rudimentary rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from a
word.
• Lemmatization: Lemmatization, on the other hand, is an organized & step by step procedure of obtaining the root form
of the word, it makes use of vocabulary (dictionary importance of words) and morphological analysis (word structure and
grammar relations).

Below is the sample code that performs lemmatization and stemming using python’s popular library –
NLTK.

```

from nltk.stem.wordnet import WordNetLemmatizer


lem = WordNetLemmatizer()

from nltk.stem.porter import PorterStemmer


stem = PorterStemmer()

word = "multiplying"
lem.lemmatize(word, "v")
>> "multiply"
stem.stem(word)
>> "multipli"

```

<span data-mce-type="bookmark" style="display: inline-block; width: 0px; overflow: hidden; line-height: 0;"
class="mce_SELRES_start"></span>

Object Standardization

Text data often contains words or phrases which are not present in any standard lexical dictionaries. These
pieces are not recognized by search engines and models.

Some of the examples are – acronyms, hashtags with attached words, and colloquial slangs. With the help of
regular expressions and manually prepared data dictionaries, this type of noise can be fixed, the code below
uses a dictionary lookup method to replace social media slangs from a text.

```
lookup_dict = {'rt':'Retweet', 'dm':'direct message', "awsm" : "awesome", "luv"
:"love", "..."}
def _lookup_words(input_text):
words = input_text.split()
new_words = []
for word in words:
if word.lower() in lookup_dict:
word = lookup_dict[word.lower()]
new_words.append(word) new_text = " ".join(new_words)
return new_text

_lookup_words("RT this is a retweeted tweet by Shivam Bansal")


>> "Retweet this is a retweeted tweet by Shivam Bansal"

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 77


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI

```

Apart from three steps discussed so far, other types of text preprocessing includes encoding-decoding noise,
grammar checker, and spelling correction etc. The detailed article about preprocessing and its methods is
given in one of my previous article.

3.Text to Features (Feature Engineering on text data)

To analyse a preprocessed data, it needs to be converted into features. Depending upon the usage, text
features can be constructed using assorted techniques – Syntactical Parsing, Entities / N-grams / word-based
features, Statistical features, and word embeddings. Read on to understand these techniques in detail.

3.1 Syntactic Parsing

Syntactical parsing invol ves the analysis of words in the sentence for grammar and their arrangement in a
manner that shows the relationships among the words. Dependency Grammar and Part of Speech tags are
the important attributes of text syntactics.

Dependency Trees – Sentences are composed of some words sewed together. The relationship among the
words in a sentence is determined by the basic dependency grammar. Dependency grammar is a class of
syntactic text analysis that deals with (labeled) asymmetrical binary relations between two lexical items
(words). Every relation can be represented in the form of a triplet (relation, governor, dependent). For
example: consider the sentence – “Bills on ports and immigration were submitted by Senator Brownback,
Republican of Kansas.” The relationship among the words can be observed in the form of a tree
representation as shown:

The tree shows that “submitted” is the root word of this


sentence, and is linked by two sub-trees (subject and object subtrees). Each subtree is a itself a dependency
tree with relations such as – (“Bills” <-> “ports” <by> “proposition” relation), (“ports” <-> “immigration”
<by> “conjugation” relation).

This type of tree, when parsed recursively in top-down manner gives grammar relation triplets as output
which can be used as features for many nlp problems like entity wise sentiment analysis, actor & entity
identification, and text classification. The python wrapper StanfordCoreNLP (by Stanford NLP Group, only
commercial license) and NLTK dependency grammars can be used to generate dependency trees.

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 78


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI
Part of speech tagging – Apart from the grammar relations, every word in a sentence is also associated with
a part of speech (pos) tag (nouns, verbs, adjectives, adverbs etc). The pos tags defines the usage and function
of a word in the sentence. H ere is a list of all possible pos-tags defined by Pennsylvania university.
Following code using NLTK performs pos tagging annotation on input text. (it provides several
implementations, the default one is perceptron tagger)

```
from nltk import word_tokenize, pos_tag
text = "I am learning Natural Language Processing on Analytics Vidhya"
tokens = word_tokenize(text)
print pos_tag(tokens)
>>> [('I', 'PRP'), ('am', 'VBP'), ('learning', 'VBG'), ('Natural', 'NNP'),('Language',
'NNP'),
('Processing', 'NNP'), ('on', 'IN'), ('Analytics', 'NNP'),('Vidhya', 'NNP')]
```

Part of Speech tagging is used for many important purposes in NLP:

A.Word sense disambiguation: Some language words have multiple meanings according to their usage.
For example, in the two sentences below:

I. “Please book my flight for Delhi”

II. “I am going to read this book in the flight”

“Book” is used with different context, however the part of speech tag for both of the cases are different. In
sentence I, the word “book” is used as v erb, while in II it is used as no un. (Lesk Algorithm is also us ed for
similar purposes)

B.Improving word-based features: A learning model could learn different contexts of a word when used
word as the features, however if the part of speech tag is linked with them, the context is preserved, thus
making strong features. For example:

Sentence -“book my flight, I will read this book”

Tokens – (“book”, 2), (“my”, 1), (“flight”, 1), (“I”, 1), (“will”, 1), (“read”, 1), (“this”, 1)

Tokens with POS – (“book_VB”, 1), (“my_PRP$”, 1), (“flight_NN”, 1), (“I_PRP”, 1), (“will_MD”, 1),
(“read_VB”, 1), (“this_DT”, 1), (“book_NN”, 1)

C. Normalization and Lemmatization: POS tags are the basis of lemmatization process for converting a
word to its base form (lemma).

D.Efficient stopword removal : P OS tags are also useful in efficient removal of stopwords.

For example, there are some tags which always define the low frequency / less important words of a
language. For example: (IN – “within”, “upon”, “except”), (CD – “one”,”two”, “hundred”), (MD – “may”,
“mu st” etc)

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 79


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI

Sample Source Code:

import spacy

# Load the spaCy NLP model


nlp = spacy.load("en_core_web_sm")

# Sample text
text = "Apple Inc. was founded by Steve Jobs and Steve Wozniak. It is headquartered in
Cupertino, California."

# Process the text with spaCy


doc = nlp(text)

# Extract and print named entities


print("Named Entities:")
for entity in doc.ents:
print(entity.text, "-", entity.label_)

Output:

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 80


RUNGTA COLLEGE OF ENGINEERING & TECHNOLOGY, BHILAI

Reference

• https://www.geeksforgeeks.org/data-cleansing-introduction/
• https://www.pluralsight.com/guides/interpreting-data-using-descriptive-statistics-
python
• https://www.geeksforgeeks.org/plotting-histogram-in-python-using-matplotlib/
• https://medium.datadriveninvestor.com/p-value-t-test-chi-square-test-anova-when-to-
use-which-strategy-32907734aa0e
• https://www.geeksforgeeks.org/linear-regression-python-implementation/
• https://www.learndatasci.com/glossary/binary-classification/
• https://vitalflux.com/accuracy-precision-recall-f1-score-python-example/
• https://www.geeksforgeeks.org/k-means-clustering-introduction/
• https://www.analyticsvidhya.com/blog/2021/06/text-preprocessing-in-nlp-with-
python-codes/
• https://www.tutorialspoint.com/python_data_science/python_time_series.htm

RCET, BHILAI DEPARTMENT OF CSE – DATA SCIENCE 81

You might also like