0% found this document useful (0 votes)
5 views

ADS LAB Merged

The document outlines two experiments focused on statistical analysis and data cleaning techniques. Experiment 1 covers descriptive and inferential statistics, detailing measures of central tendency, dispersion, and hypothesis testing methods, along with code for analyzing a dataset. Experiment 2 discusses various data imputation techniques to handle missing values, including mean, median, mode, and regression imputation, supported by code examples.

Uploaded by

varun.jajoo18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

ADS LAB Merged

The document outlines two experiments focused on statistical analysis and data cleaning techniques. Experiment 1 covers descriptive and inferential statistics, detailing measures of central tendency, dispersion, and hypothesis testing methods, along with code for analyzing a dataset. Experiment 2 discusses various data imputation techniques to handle missing values, including mean, median, mode, and regression imputation, supported by code examples.

Uploaded by

varun.jajoo18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 86

EXPERIMENT 1

AIM: Explore the descriptive statistics on the given dataset

Theory:

Descriptive Statistics
Descriptive statistics summarize and describe the main features of a dataset. They provide a
quick overview of the data and help in understanding its distribution, central tendency, and
variability. Key measures include:
1.​ Measures of Central Tendency:
○​ Mean: The average value of the dataset.
○​ Median: The middle value when the data is sorted.
○​ Mode: The most frequently occurring value.
2.​ Measures of Dispersion:
○​ Range: The difference between the maximum and minimum values.
○​ Variance: The average of the squared differences from the mean.
○​ Standard Deviation: The square root of variance, indicating the spread of data.
○​ Interquartile Range (IQR): The range between the first quartile (Q1) and the
third quartile (Q3).
3.​ Shape of the Distribution:
○​ Skewness: Measures the asymmetry of the data distribution.
○​ Kurtosis: Measures the "tailedness" of the data distribution.
4.​ Other Measures:
○​ Coefficient of Variation (CV): The ratio of the standard deviation to the mean,
expressed as a percentage.
○​ Trimmed Mean: The mean after removing a certain percentage of the
smallest and largest values.
○​ Sum of Squares: The sum of squared deviations from the mean.
5.​ Visualizations:
○​ Box-and-Whisker Plot: Displays the distribution of data based on quartiles and
identifies outliers.
○​ Scatter Plot: Shows the relationship between two numeric variables.
○​ Correlation Matrix: Displays the correlation coefficients between numeric
variables.

Inferential Statistics
Inferential statistics make inferences about a population based on a sample of data. They
help in testing hypotheses and drawing conclusions.
1.​ Distributions:
○​ Normal Distribution: A symmetric, bell-shaped distribution where most values
cluster around the mean.
○​ Poisson Distribution: A discrete distribution that describes the probability of a
given number of events occurring in a fixed interval.
2.​ Population Parameters and Sampling Errors:
○​ Population Parameters: Characteristics of the entire population (e.g.,
population mean, population variance).
○​ Sampling Errors: Differences between the sample statistic and the population
parameter.
3.​ Confidence Intervals:
○​ A range of values within which the population parameter is expected to lie,
with a certain level of confidence (e.g., 95%).
4.​ Hypothesis Testing:
○​ Null Hypothesis (H0): A statement that there is no effect or no difference.
○​ Alternative Hypothesis (H1): A statement that contradicts the null hypothesis.
○​ Type I Error: Rejecting the null hypothesis when it is true (false positive).
○​ Type II Error: Failing to reject the null hypothesis when it is false (false
negative).
○​ Z-Test: A statistical test used when the sample size is large and the
population variance is known.
○​ T-Test: A statistical test used when the sample size is small and the
population variance is unknown.
○​ ANOVA (Analysis of Variance): A test used to compare the means of three or
more groups.

Code & Output:

#importing
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
df = pd.read_csv("loan_data_set.csv")
print(df.head())

#output:
A. DESCRIPTIVE STATISTICS

print("Descriptive Statistics:")
print(df.describe())

print("\nMean:")
print(df.mean(numeric_only=True))

print("\nMedian:")
print(df.median(numeric_only=True))

print("\nMode:")
print(df.mode().iloc[0])
print("\nMin:")
print(df.min(numeric_only=True))

print("\nMax:")
print(df.max(numeric_only=True))

print("\nSum:")
print(df.sum(numeric_only=True))
print("\nRange:")
print(df.max(numeric_only=True) - df.min(numeric_only=True))

Q1 = df.quantile(0.25, numeric_only=True)
Q3 = df.quantile(0.75, numeric_only=True)
IQR = Q3 - Q1
print("\nFirst Quartile (Q1):")
print(Q1)

print("\nThird Quartile (Q3):")


print(Q3)

print("\nInterquartile Range (IQR):")


print(IQR)
print("\nStandard Deviation:")
print(df.std(numeric_only=True))

variance = df.var(numeric_only=True)

print("Variance of Numeric Columns:")


print(variance)

print("\nCorrelation Matrix:")
corr = df.corr(numeric_only=True)
print(corr)
# Standard Error of the Mean (SEM)
print("\nStandard Error of the Mean (SEM):")
print(df.sem(numeric_only=True))

CV = lambda x: np.std(x) / np.mean(x) * 100

# Apply CV only to numeric columns


numeric_df = df.select_dtypes(include=[np.number]) # Select only
numeric columns
cv_results = numeric_df.apply(CV)

print("Coefficient of Variation:")
print(cv_results)

print("\nMissing Values:")
print(df.isnull().sum())

# Total Rows
print("\nTotal Rows (N total):")
print(len(df))
# Cumulative Sum
print("\nCumulative Sum:")
print(df.select_dtypes(include='number').cumsum())

# Percent and Cumulative Percent


print("\nPercent and Cumulative Percent:")
print(df.value_counts(normalize=True) * 100)
# Trimmed Mean
print("\nTrimmed Mean (5%):")
print(stats.trim_mean(df['ApplicantIncome'], 0.05))

# Sum of Squares
print("\nSum of Squares:")
print((df.select_dtypes(include='number') ** 2).sum())

# Skewness
print("\nSkewness:")
print(df.skew(numeric_only=True))

# Kurtosis
print("\nKurtosis:")
print(df.kurtosis(numeric_only=True))
# Box-and-Whisker Plot
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x="ApplicantIncome")
plt.title("Box-and-Whisker Plot of ApplicantIncome")
plt.show()

# Scatter Plot
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x="ApplicantIncome", y="LoanAmount")
plt.title("Scatter Plot of ApplicantIncome vs LoanAmount")
plt.show()

# Correlation Matrix Heatmap


plt.figure(figsize=(10, 6))
sns.heatmap(corr, annot=True, cmap="Blues")
plt.title("Correlation Matrix Heatmap")
plt.show()

B. INFERENTIAL STATISTICS

# Check for Normal Distribution


plt.figure(figsize=(10, 6))
sns.histplot(df['ApplicantIncome'], kde=True)
plt.title("Distribution of ApplicantIncome")
plt.show()

# Shapiro-Wilk Test for Normality


shapiro_test = stats.shapiro(df['ApplicantIncome'])
print("\nShapiro-Wilk Test for Normality:")
print(f"Test Statistic: {shapiro_test[0]}, p-value:
{shapiro_test[1]}")

# Poisson Distribution (Example for LoanAmount)


poisson_test = stats.kstest(df['LoanAmount'], 'poisson',
args=(df['LoanAmount'].mean(),))
print("\nPoisson Distribution Test for LoanAmount:")
print(f"Test Statistic: {poisson_test[0]}, p-value:
{poisson_test[1]}")

# Z-Test (Example: Compare ApplicantIncome mean to a population


mean)
population_mean = 5000 # Hypothetical population mean
z_test = stats.zscore(df['ApplicantIncome'])
print("\nZ-Test Results:")
print(z_test)

# T-Test (Example: Compare ApplicantIncome for two groups)


group1 = df[df['Loan_Status'] == 'Y']['ApplicantIncome']
group2 = df[df['Loan_Status'] == 'N']['ApplicantIncome']
t_test = stats.ttest_ind(group1, group2)
print("\nT-Test Results:")
print(f"Test Statistic: {t_test[0]}, p-value: {t_test[1]}")

# ANOVA (Example: Compare ApplicantIncome across Property_Area


categories)
anova_test = stats.f_oneway(
df[df['Property_Area'] == 'Urban']['ApplicantIncome'],
df[df['Property_Area'] == 'Rural']['ApplicantIncome'],
df[df['Property_Area'] == 'Semiurban']['ApplicantIncome']
)
print("\nANOVA Test Results:")
print(f"Test Statistic: {anova_test[0]}, p-value:
{anova_test[1]}")
EXPERIMENT 2

AIM: Apply Data Cleaning Techniques

Theory:
Imputation is the process of replacing missing or null values (like NaN or NA) in a dataset
with estimated or calculated values. This helps in maintaining the completeness of the data
for accurate analysis and modeling.

Imputation is important because:


Maintains Data Integrity: Ensures the dataset is complete for analysis.
Improves Model Performance: Most machine learning models can't handle missing values.
Imputation helps in making accurate predictions.
Avoids Data Bias: Proper imputation prevents bias in the analysis caused by missing values.

Data Imputation Techniques:

1.​ Deletion of Rows with Missing Data


○​ What It Is: Remove rows containing any missing values.
○​ When to Use: When the number of missing values is very small.
○​ Advantages: Simple and easy to implement.
○​ Disadvantages: Can lead to data loss and bias if many rows are deleted.
2.​ Mean/Median Imputation
○​ What It Is: Replace missing values with the mean or median of the
non-missing values in that column.
○​ When to Use: When the data is numerical and not heavily skewed.
○​ Advantages: Easy to implement and preserves the overall data distribution.
○​ Disadvantages: Can distort relationships between variables if the data is
skewed.
3.​ Mode Imputation
○​ What It Is: Replace missing values with the most frequent value (mode) in the
column.
○​ When to Use: For categorical data.
○​ Advantages: Simple and maintains the existing category distribution.
○​ Disadvantages: Can create over-representation of the mode, leading to biased
results.
4.​ Arbitrary Value Imputation
○​ What It Is: Replace missing values with a fixed, arbitrary value (e.g., -999).
○​ When to Use: When you want to flag missing values with a distinct value.
○​ Advantages: Maintains dataset size and highlights missingness.
○​ Disadvantages: Can introduce outliers and affect model performance.
5.​ End of Tail Imputation
○​ What It Is: Replace missing values with extreme values (e.g., Mean +
3*Standard Deviation).
○​ When to Use: When missing values are likely to be extreme.
○​ Advantages: Highlights missingness as an outlier.
○​ Disadvantages: Can distort statistical analysis and model predictions.
6.​ Random Sample Imputation
○​ What It Is: Replace missing values with random samples from the
non-missing data in the column.
○​ When to Use: When you want to maintain the existing distribution.
○​ Advantages: Preserves the distribution and variability.
○​ Disadvantages: Introduces randomness, which may reduce model stability.
7.​ Frequent Category Imputation
○​ What It Is: Replace missing values with the most frequent category.
○​ When to Use: For categorical data with a dominant category.
○​ Advantages: Maintains data consistency.
○​ Disadvantages: May introduce bias by over-representing the frequent
category.
8.​ Adding a New Category as “Missing”
○​ What It Is: Treat missing values as a separate category (e.g., "Unknown" or
"Missing").
○​ When to Use: For categorical data where missingness may have meaning.
○​ Advantages: Captures the information about missingness.
○​ Disadvantages: Increases the number of categories, possibly leading to
model complexity.
9.​ Regression Imputation
○​ What It Is: Predicts missing values using a regression model based on other
variables.
○​ When to Use: When there is a strong correlation between variables.
○​ Advantages: Leverages relationships between variables for accurate
imputation.
○​ Disadvantages: Can introduce bias if the regression model is not accurate.

Code & Output:


# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import seaborn as sns
import matplotlib.pyplot as plt
import random

# Load the dataset


df = pd.read_csv('loan_data_set.csv')

# Display basic information about the dataset


print("Dataset Overview:")
print(df.info())
print("\nFirst 5 rows of the dataset:\n", df.head())

# Visualize missing values using Heatmap


plt.figure(figsize=(10, 6))
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values Heatmap')
plt.show()
# Visualize the count of missing values in each column using a bar
plot
missing_values = df.isnull().sum()
missing_values = missing_values[missing_values > 0]
missing_values.sort_values(ascending=False, inplace=True)

plt.figure(figsize=(12, 6))
sns.barplot(x=missing_values.index, y=missing_values.values,
palette='viridis')
plt.xticks(rotation=45)
plt.title('Missing Values by Column')
plt.xlabel('Column')
plt.ylabel('Number of Missing Values')
plt.show()
# Separate numeric and categorical columns
numeric_cols = df.select_dtypes(include=['float64',
'int64']).columns
categorical_cols = df.select_dtypes(include=['object']).columns

# 1. Deletion of rows with missing data


df_deletion = df.dropna()
print("\nAfter Deletion of Rows with Missing Data:\n",
df_deletion.head())

# 2. Mean/Median Imputation for numeric data


mean_imputer = SimpleImputer(strategy='mean')
median_imputer = SimpleImputer(strategy='median')
df_mean = df.copy()
df_median = df.copy()

df_mean[numeric_cols] =
mean_imputer.fit_transform(df[numeric_cols])
df_median[numeric_cols] =
median_imputer.fit_transform(df[numeric_cols])

print("\nAfter Mean Imputation:\n", df_mean.head())


print("\nAfter Median Imputation:\n", df_median.head())

# 3. Mode Imputation for categorical data


mode_imputer = SimpleImputer(strategy='most_frequent')
df_mode = df.copy()
df_mode[categorical_cols] =
mode_imputer.fit_transform(df[categorical_cols])

print("\nAfter Mode Imputation:\n", df_mode.head())


# 4. Arbitrary Value Imputation (e.g., -999 for numeric data)
df_arbitrary = df.copy()
df_arbitrary[numeric_cols] =
df_arbitrary[numeric_cols].fillna(-999)

print("\nAfter Arbitrary Value Imputation:\n",


df_arbitrary.head())

# 5. End of Tail Imputation (using mean + 3*std)


df_tail = df.copy()
for col in numeric_cols:
end_of_tail_value = df[col].mean() + 3 * df[col].std()
df_tail[col].fillna(end_of_tail_value, inplace=True)
print("\nAfter End of Tail Imputation:\n", df_tail.head())

# 6. Random Sample Imputation for numeric data


df_random = df.copy()
for col in numeric_cols:
random_sample =
df[col].dropna().sample(df[col].isnull().sum(), random_state=42)
random_sample.index = df[df[col].isnull()].index
df_random.loc[df[col].isnull(), col] = random_sample

print("\nAfter Random Sample Imputation:\n", df_random.head())


# 7. Frequent Category Imputation for categorical data
df_frequent = df.copy()
for col in categorical_cols:
most_frequent = df[col].mode()[0]
df_frequent[col].fillna(most_frequent, inplace=True)

print("\nAfter Frequent Category Imputation:\n",


df_frequent.head())

# 8. Adding a New Category as "Missing" for categorical data


df_new_category = df.copy()
df_new_category[categorical_cols] =
df_new_category[categorical_cols].fillna("Missing")

print("\nAfter Adding New Category 'Missing':\n",


df_new_category.head())

# 9. Random Sample Imputation for categorical data


df_random_cat = df.copy()
for col in categorical_cols:
random_sample =
df[col].dropna().sample(df[col].isnull().sum(), random_state=42)
random_sample.index = df[df[col].isnull()].index
df_random_cat.loc[df[col].isnull(), col] = random_sample

print("\nAfter Random Sample Imputation (Categorical):\n",


df_random_cat.head())

# 10. Regression Imputation using Iterative Imputer


iterative_imputer = IterativeImputer(random_state=42)
df_regression = df.copy()
df_regression[numeric_cols] =
iterative_imputer.fit_transform(df[numeric_cols])

print("\nAfter Regression Imputation:\n", df_regression.head())


# Visualization of the dataset after various imputation techniques
fig, axes = plt.subplots(3, 2, figsize=(15, 15))
fig.suptitle('Distribution of Imputed Data', fontsize=20)

sns.histplot(df_mean[numeric_cols[0]], ax=axes[0, 0],


color='blue')
axes[0, 0].set_title('Mean Imputation')

sns.histplot(df_median[numeric_cols[0]], ax=axes[0, 1],


color='green')
axes[0, 1].set_title('Median Imputation')

sns.histplot(df_arbitrary[numeric_cols[0]], ax=axes[1, 0],


color='red')
axes[1, 0].set_title('Arbitrary Value Imputation')

sns.histplot(df_tail[numeric_cols[0]], ax=axes[1, 1],


color='orange')
axes[1, 1].set_title('End of Tail Imputation')

sns.histplot(df_random[numeric_cols[0]], ax=axes[2, 0],


color='purple')
axes[2, 0].set_title('Random Sample Imputation')

sns.histplot(df_regression[numeric_cols[0]], ax=axes[2, 1],


color='brown')
axes[2, 1].set_title('Regression Imputation')

plt.tight_layout()
plt.show()
EXPERIMENT 3

AIM: Explore Data Visualization Techniques

Theory:

Importance of Data Visualization


What is the Purpose of Data Visualization?

Data visualization is the graphical representation of information and data using visual
elements such as charts, graphs, and maps. The primary purpose of data visualization is to
simplify complex data and make it more understandable, accessible, and useful for
decision-making.

Key Purposes:

1.​ Simplifies Complex Data – Large datasets can be difficult to interpret in raw form.
Visualizations help summarize and present data in an intuitive way.
2.​ Enhances Understanding – Patterns, trends, and correlations are easier to recognize
when displayed visually.
3.​ Improves Decision-Making – Businesses and analysts can make informed decisions
quickly based on visual insights.
4.​ Identifies Patterns & Trends – Helps in detecting trends, correlations, and outliers
that may not be obvious in numerical data.
5.​ Facilitates Communication – Visual data representation is more effective for
presentations and reports, making it easier to share insights with stakeholders.
6.​ Increases Engagement – Interactive and visually appealing dashboards keep users
engaged and interested in the data.

Benefits of Data Visualization Tools

Data visualization tools help users create insightful visual representations of data with
minimal effort. These tools come with built-in features for customization, interactivity, and
analytics.

Key Benefits:

Faster Analysis – Converts raw data into meaningful visuals within seconds, making
analysis quicker and more efficient.​
Improved Accuracy – Reduces errors caused by manual data interpretation and helps in
making more precise predictions.​
Better Data Storytelling – Enables users to tell compelling stories through visuals that
highlight important insights.​
Enhanced Productivity – Saves time by automating data representation, allowing teams to
focus on decision-making rather than data processing.​
Real-time Data Monitoring – Many tools offer live dashboards that update dynamically,
helping organizations track performance in real-time.​
Customizable & Interactive – Users can filter, drill down, and explore data interactively for
deeper insights.​
Supports Multiple Data Sources – Most visualization tools integrate with databases,
spreadsheets, APIs, and cloud storage for seamless data analysis.

Popular Data Visualization Tools:

●​ Power BI – Ideal for business intelligence and analytics.


●​ Tableau – Advanced analytics and interactive dashboards.
●​ Matplotlib & Seaborn (Python) – Great for scientific and statistical visualization.
●​ Google Data Studio – Free tool for creating interactive reports.
●​ Excel – Basic visualization for small datasets.

Code & Output:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from bokeh.plotting import figure, output_notebook, show
import missingno as msno

from pandas.plotting import scatter_matrix


# Load dataset
df = pd.read_csv("loan_data_set.csv")
# Display first few rows
df.head()

#Univariate Visualization
#Histogram
plt.figure(figsize=(8, 5))
sns.histplot(df['LoanAmount'], kde=True, bins=20)
plt.title("Histogram of Loan Amount")
plt.xlabel("Loan Amount")
plt.ylabel("Frequency")
plt.show()
#BarChart
plt.figure(figsize=(8, 5))
sns.countplot(x="Education", data=df)
plt.title("Bar Chart of Education Levels")
plt.xlabel("Education")
plt.ylabel("Count")
plt.show()

#Quartile Plot(Box Plot)


plt.figure(figsize=(8, 5))
sns.boxplot(x=df["LoanAmount"])
plt.title("Box Plot of Loan Amount")
plt.show()
#Distribution Chart
plt.figure(figsize=(8, 5))
sns.kdeplot(df['LoanAmount'], shade=True)
plt.title("Density Distribution of Loan Amount")
plt.xlabel("Loan Amount")
plt.show()

#Multivariate Visualization
#Scatter Plot
plt.figure(figsize=(8, 5))
sns.scatterplot(x='ApplicantIncome', y='LoanAmount',
hue='Education', data=df)
plt.title("Scatter Plot: Applicant Income vs Loan Amount")
plt.xlabel("Applicant Income")
plt.ylabel("Loan Amount")
plt.show()

#Scatter Matrix
scatter_matrix(df[['ApplicantIncome', 'CoapplicantIncome',
'LoanAmount']], figsize=(8, 8), diagonal='kde')
plt.show()
#Bubble Chart
fig = px.scatter(df, x="ApplicantIncome", y="LoanAmount",
size="CoapplicantIncome", color="Education",
title="Bubble Chart: Loan Amount vs Applicant
Income")
fig.show()

#density chart
plt.figure(figsize=(8, 5))
# Corrected kdeplot syntax
sns.kdeplot(x=df['ApplicantIncome'], y=df['LoanAmount'],
cmap="Reds", fill=True)
plt.title("Density Chart: Applicant Income vs Loan Amount")
plt.xlabel("Applicant Income")
plt.ylabel("Loan Amount")
plt.show()

#Heat Map
# Select only numeric columns
numeric_df = df.select_dtypes(include=['number'])
plt.figure(figsize=(8, 5))
# Generate heatmap on numeric data only
sns.heatmap(numeric_df.corr(), annot=True, cmap="coolwarm",
linewidths=0.5)
plt.title("Heat Map of Correlation Matrix")
plt.show()
EXPERIMENT 4

AIM: Implement and explore performance evaluation metrics for data models (Supervised)

Theory:

Supervised learning is a type of machine learning where a model is trained on a labeled


dataset. The model learns patterns from the input data (features) and their corresponding
correct outputs (labels). Once trained, the model makes predictions on unseen data.

To measure how well the model performs, we use different evaluation metrics, which vary
depending on the type of supervised learning problem:

●​ Classification (predicting categories, e.g., spam detection)


●​ Regression (predicting continuous values, e.g., house prices

Classification Metrics
Classification models predict discrete categories (e.g., spam detection, disease
classification). The following metrics evaluate their performance:

1. Accuracy

Measures the proportion of correct predictions out of all predictions.

2. Precision (Positive Predictive Value - PPV)

Measures how many predicted positive cases are actually positive.

3. Error Rate

Represents the proportion of incorrect predictions.


4. Sensitivity (Recall or True Positive Rate - TPR)

Measures how well the model identifies actual positive cases.

5. Specificity (True Negative Rate - TNR)

Measures how well the model identifies actual negative cases.

6. Receiver Operating Characteristic (ROC) Curve & AUC

●​ ROC Curve: A plot of True Positive Rate (TPR) vs. False Positive Rate (FPR).
●​ AUC (Area Under Curve): Measures the model’s ability to distinguish between
classes.

7. F1 Score

A harmonic mean of Precision and Recall, balancing both metrics.

8. Geometric Mean (G-Mean)

Measures the balance between Sensitivity and Specificity, ensuring both positive and
negative cases are correctly classified.

Regression Metrics
Regression models predict continuous values (e.g., stock prices, temperature). The following
metrics evaluate their performance:
1. Pearson Correlation Coefficient (rrr)

Measures the strength and direction of the relationship between actual and predicted values.

2. Coefficient of Determination (R2R^2R2 Score)

Indicates how much variance in the target variable is explained by the model.

3. Mean Squared Error (MSE)

The average squared differences between actual and predicted values.

4. Root Mean Squared Error (RMSE)

The square root of MSE, measuring the standard deviation of errors.

5. Root Mean Squared Relative Error (RMSRE)

Similar to RMSE but considers relative errors.

6. Mean Absolute Error (MAE)

The average of absolute differences between actual and predicted values.


7. Mean Absolute Percentage Error (MAPE)

Measures prediction accuracy as a percentage of actual values.

Code & Output:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression,
LinearRegression
from sklearn.metrics import accuracy_score, precision_score,
recall_score, f1_score, confusion_matrix, roc_curve, auc
from sklearn.metrics import mean_absolute_error,
mean_squared_error, r2_score

# Load dataset
df = pd.read_csv("supermarket_sales.csv") # Update the path if
required

# Display first few rows


print("Dataset Preview:\n", df.head())
# Convert categorical target column into numeric (e.g., "Payment"
as classification label)
df['Payment'] = df['Payment'].astype('category').cat.codes #
Convert to numeric labels

# Feature selection for Classification


X_class = df[['Unit price', 'Quantity', 'Tax 5%', 'gross income']]
# Select numerical features
y_class = df['Payment'] # Classification Target

# Split classification data


X_train_class, X_test_class, y_train_class, y_test_class =
train_test_split(X_class, y_class, test_size=0.2, random_state=42)

# Train Logistic Regression Model


clf = LogisticRegression(max_iter=1000)
clf.fit(X_train_class, y_train_class)
y_pred_class = clf.predict(X_test_class)

# Classification Metrics Calculation


accuracy = accuracy_score(y_test_class, y_pred_class)
precision = precision_score(y_test_class, y_pred_class,
average='weighted')
recall = recall_score(y_test_class, y_pred_class,
average='weighted')
f1 = f1_score(y_test_class, y_pred_class, average='weighted')

# Confusion Matrix
conf_matrix = confusion_matrix(y_test_class, y_pred_class)

# ROC Curve
fpr, tpr, _ = roc_curve(y_test_class,
clf.predict_proba(X_test_class)[:, 1], pos_label=1)
roc_auc = auc(fpr, tpr)

# Print Classification Metrics


print("\n--- Classification Metrics ---")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall (Sensitivity): {recall:.4f}")
print(f"F1 Score: {f1:.4f}")
# Plot Confusion Matrix
plt.figure(figsize=(5,4))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap="Blues")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix")
plt.show()

# Plot ROC Curve


plt.figure(figsize=(6,4))
plt.plot(fpr, tpr, label=f'AUC = {roc_auc:.2f}')
plt.plot([0, 1], [0, 1], linestyle="--", color="grey")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.show()
# Feature selection for Regression
X_reg = df[['Unit price', 'Quantity', 'Tax 5%']] # Independent
Variables
y_reg = df['Total'] # Dependent Variable

# Split regression data


X_train_reg, X_test_reg, y_train_reg, y_test_reg =
train_test_split(X_reg, y_reg, test_size=0.2, random_state=42)

# Train Linear Regression Model


reg = LinearRegression()
reg.fit(X_train_reg, y_train_reg)
y_pred_reg = reg.predict(X_test_reg)

# Regression Metrics Calculation


mae = mean_absolute_error(y_test_reg, y_pred_reg)
mse = mean_squared_error(y_test_reg, y_pred_reg)
rmse = np.sqrt(mse)
r2 = r2_score(y_test_reg, y_pred_reg)

# Print Regression Metrics


print("\n--- Regression Metrics ---")
print(f"Mean Absolute Error (MAE): {mae:.4f}")
print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")
print(f"R² Score: {r2:.4f}")
# Plot Regression Predictions vs Actual
plt.figure(figsize=(6,4))
plt.scatter(y_test_reg, y_pred_reg, alpha=0.6, color="blue")
plt.plot(y_test_reg, y_test_reg, color="red", linestyle="--") #
Perfect fit line
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.title("Regression Predictions vs Actual")
plt.show()
EXPERIMENT 5

AIM: Implement and explore performance evaluation metrics for data models
(Unsupervised Learning)

Theory:

Clustering is an unsupervised learning technique used to group similar data points based on
certain features. Unlike classification, clustering does not rely on predefined labels, making
evaluation more challenging. To assess the performance of clustering models, various
internal and external evaluation metrics are used. Below are the key metrics used in
clustering evaluation:

1. Rand Index (RI): The Rand Index (RI) measures the similarity between the predicted
clustering assignments and the true class labels. It evaluates how well the clustering model
has assigned data points by comparing pairs of samples.

2. Adjusted Rand Index (ARI): The Adjusted Rand Index (ARI) is an improved version of the
Rand Index that accounts for the probability of random clustering. It ensures that the score
remains close to zero for random assignments and one for perfect clustering.

3. Mutual Information (MI): Mutual Information (MI) measures the amount of shared
information between the true labels and predicted clusters. It is based on entropy and
evaluates how much knowing one variable (true labels) reduces uncertainty about the other
variable (predicted clusters).
4. Silhouette Coefficient: The Silhouette Coefficient is an internal clustering metric that
measures how similar a data point is to its own cluster compared to other clusters. It is
based on the distances between data points.

Code:

import numpy as np
import pandas as pd
from itertools import combinations
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from collections import Counter
import math

# Load the Iris dataset from CSV


file_path = "Iris.csv" # Update this with the actual path if
needed
df = pd.read_csv(file_path)

# Drop the "Id" column if it exists (depends on the dataset


format)
if "Id" in df.columns:
df = df.drop(columns=["Id"])

# Extract features and true labels


X = df.iloc[:, :-1].values # Features (Sepal & Petal dimensions)
y_true = df.iloc[:, -1].astype('category').cat.codes.values #
Convert species names to numeric labels

# Standardize the features


scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply KMeans Clustering


kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
y_pred = kmeans.fit_predict(X_scaled)

### **1. Implementing Rand Index (RI) and Adjusted Rand Index
(ARI)**
def rand_index(y_true, y_pred):
pairs = list(combinations(range(len(y_true)), 2))
a = sum((y_true[i] == y_true[j]) and (y_pred[i] == y_pred[j])
for i, j in pairs)
b = sum((y_true[i] != y_true[j]) and (y_pred[i] != y_pred[j])
for i, j in pairs)
return (a + b) / len(pairs)

def adjusted_rand_index(y_true, y_pred):


n = len(y_true)
clusters_true = Counter(y_true)
clusters_pred = Counter(y_pred)

sum_comb_c = sum(math.comb(v, 2) for v in


clusters_true.values()) # Combinations in actual clusters
sum_comb_k = sum(math.comb(v, 2) for v in
clusters_pred.values()) # Combinations in predicted clusters
sum_comb_ck = sum(math.comb(sum((y_true[i] == c and y_pred[i]
== k) for i in range(n)), 2)
for c in clusters_true for k in
clusters_pred)

expected_index = (sum_comb_c * sum_comb_k) / math.comb(n, 2)


max_index = 0.5 * (sum_comb_c + sum_comb_k)

return (sum_comb_ck - expected_index) / (max_index -


expected_index)

### **2. Implementing Mutual Information (MI)**


def mutual_information(y_true, y_pred):
n = len(y_true)
clusters_true = Counter(y_true)
clusters_pred = Counter(y_pred)

mi = 0
for c in clusters_true:
for k in clusters_pred:
n_ck = sum((y_true[i] == c and y_pred[i] == k) for i
in range(n))
if n_ck > 0:
p_ck = n_ck / n
p_c = clusters_true[c] / n
p_k = clusters_pred[k] / n
mi += p_ck * math.log(p_ck / (p_c * p_k))

return mi

### **3. Implementing Silhouette Coefficient**


def silhouette_coefficient(X, labels):
n = len(X)
unique_clusters = set(labels)

def euclidean_distance(p1, p2):


return np.linalg.norm(p1 - p2)

silhouette_scores = []

for i in range(n):
same_cluster = [X[j] for j in range(n) if labels[j] ==
labels[i] and i != j]
other_clusters = {c: [X[j] for j in range(n) if labels[j]
== c] for c in unique_clusters if c != labels[i]}

if same_cluster:
a_i = np.mean([euclidean_distance(X[i], p) for p in
same_cluster]) # Intra-cluster distance
else:
a_i = 0

b_i = min(np.mean([euclidean_distance(X[i], p) for p in


cluster]) for cluster in other_clusters.values()) if
other_clusters else 0

s_i = (b_i - a_i) / max(a_i, b_i) if max(a_i, b_i) > 0


else 0
silhouette_scores.append(s_i)

return np.mean(silhouette_scores)

# Compute evaluation metrics


rand_idx = rand_index(y_true, y_pred)
adjusted_rand = adjusted_rand_index(y_true, y_pred)
mutual_info = mutual_information(y_true, y_pred)
silhouette_coeff = silhouette_coefficient(X_scaled, y_pred)

# Print results
print(f"Rand Index: {rand_idx:.4f}")
print(f"Adjusted Rand Index: {adjusted_rand:.4f}")
print(f"Mutual Information: {mutual_info:.4f}")
print(f"Silhouette Coefficient: {silhouette_coeff:.4f}")

Output:
EXPERIMENT 6

AIM: Implement time series forecasting

Theory:

Time series analysis and forecasting are essential for understanding patterns in data
collected over time. By studying historical trends, businesses and researchers can make
informed decisions about future events. The key reasons for conducting time series analysis
and forecasting include:

1. Differentiating Between Random Fluctuations and Outliers

Time series data often contains short-term irregular variations that can mislead
decision-making. By analyzing the data, we can determine whether a sudden change is a
natural fluctuation or an outlier. Outliers are extreme values that deviate significantly from
the trend and may be caused by unexpected events, errors in data collection, or external
influences. Identifying outliers helps in making more accurate predictions and avoiding
misleading conclusions.

2. Separating Genuine Insights from Seasonal Variations

Many datasets exhibit seasonality, meaning they follow a recurring pattern over specific time
intervals (e.g., daily, monthly, yearly). For example, sales of winter clothing increase in colder
months and decline in summer. Without time series analysis, it can be difficult to distinguish
whether a trend is genuine growth or just a seasonal effect. By decomposing the data into
trend, seasonal, and residual components, we can better understand the underlying behavior
and make accurate forecasts.

3. Understanding How Data Changes Over Time

Time series analysis helps in recognizing how a variable evolves over different periods. For
instance, stock prices, temperature variations, or economic indicators change continuously,
and analyzing past patterns helps predict future movements. By examining trends,
seasonality, and irregular components, we can determine whether a process is stable,
increasing, or declining over time, which aids in better planning and decision-making.

4. Identifying the Direction of Change in Data

One of the primary objectives of time series forecasting is to understand the direction in
which data is heading. Trends indicate whether a variable is experiencing consistent growth,
decline, or stability over time. For example, if a company's revenue shows a long-term
upward trend despite short-term fluctuations, it suggests business growth. Identifying these
trends helps organizations plan future strategies, allocate resources effectively, and respond
to market changes efficiently.
Code & Output:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.arima.model import ARIMA
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error,
mean_squared_error
import warnings
warnings.filterwarnings("ignore")

# Load dataset
df = pd.read_csv("supermarket_sales.csv", parse_dates=['Date'])
df.set_index('Date', inplace=True)
print(df.info())

# Decompose time series data


result = seasonal_decompose(df['Unit price'], model='additive',
period=12)
result.plot()
plt.show()
# ACF plot
plt.figure(figsize=(12, 5))
plot_acf(df['Quantity'].dropna(), lags=30)
plt.show()

# PACF plot
plt.figure(figsize=(12, 5))
plot_pacf(df['Quantity'].dropna(), lags=30)
plt.show()
# ADF test
adf_result = adfuller(df['Unit price'].dropna())
print(f"ADF Statistic: {adf_result[0]}")
print(f"p-value: {adf_result[1]}")
print("Critical Values:")
for key, value in adf_result[4].items():
print(f" {key}: {value}")

# Time series forecasting using Linear Regression


x = df['Quantity'].values.reshape(-1, 1)
y = df['Total'].values.reshape(-1, 1)

x_train, x_test, y_train, y_test = train_test_split(x, y,


test_size=0.3, random_state=42)

model = LinearRegression()
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
print(y_pred)
# Moving average smoothing model
df['Moving_Avg'] = df['Quantity'].rolling(window=3).mean()
plt.figure(figsize=(10, 5))
plt.plot(df.index, df['Quantity'], label='Original')
plt.plot(df.index, df['Moving_Avg'], label='Moving Average',
linestyle='dashed')
plt.legend()
plt.show()

# ARIMA model
train_size = int(len(df) * 0.8)
train, test = df.iloc[:train_size], df.iloc[train_size:]

p, d, q = 1, 1, 1
arima_model = ARIMA(train['Unit price'], order=(p, d, q))
arima_fit = arima_model.fit()
predictions = arima_fit.forecast(steps=len(test))

plt.figure(figsize=(10, 5))
plt.plot(train.index, train['Unit price'], label='Train Data')
plt.plot(test.index, test['Unit price'], label='Test Data')
plt.plot(test.index, predictions, label='Predictions',
linestyle='dashed')
plt.legend()
plt.show()

# Model Evaluation
mae = mean_absolute_error(y_test, y_pred)
mape = np.mean(np.abs((y_test - y_pred) / y_test)) * 100
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

print(f"Mean Absolute Error (MAE): {mae}")


print(f"Mean Absolute Percentage Error (MAPE): {mape}%")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")
EXPERIMENT 7

AIM: Outlier detection using distance based/density based method

Theory:

An outlier is a data point that significantly deviates from the rest of the dataset. It does not
conform to the general pattern of the data and may result from measurement errors,
variability in data, or genuine anomalies. Outliers can distort statistical analyses, affect
machine learning models, and lead to incorrect conclusions if not properly handled.

Types of Outliers

1.​ Global Outliers (Point Anomalies)​

○​ A data point that differs significantly from the entire dataset.​

○​ Example:​

■​ In a customer database, most users have an account balance between


$1,000 and $10,000, but one customer has a balance of
$500,000—this is a global outlier.​

2.​ Contextual Outliers (Conditional Anomalies)​

○​ A data point that is normal in one context but abnormal in another.​

○​ Example:​

■​ A high electricity bill of $500 in summer for an individual might be


normal due to air conditioning use, but the same bill in winter could
indicate an anomaly.​

3.​ Collective Outliers (Group Anomalies)​

○​ A group of data points that together deviate from the norm, even though
individual values might not be outliers.​

○​ Example:​

■​ A sudden burst of small transactions within seconds on a bank


account (e.g., 50 transfers of $5) may indicate a fraud attempt, even
though individual transactions appear normal.
Why is Outlier Detection Important?

1. Improves Data Quality and Accuracy

●​ Outliers can skew trends and forecasts in datasets, affecting data interpretation.​

●​ In business analytics, they can lead to poor strategic decisions based on incorrect
insights.​

2. Impact on Machine Learning Models

●​ Supervised learning: Outliers can bias model training, leading to incorrect predictions.​

●​ Unsupervised learning: Clustering algorithms like K-Means and DBSCAN may fail to
correctly classify data if outliers are present.​

3. Fraud Detection & Anomaly Detection

●​ Outlier detection is widely used in fraud detection for banking, insurance, and
cybersecurity.​

●​ Example:​

○​ A credit card being used in two different countries within minutes may
indicate fraud.​

4. Industrial & Medical Applications

●​ In manufacturing, outlier detection helps identify faulty products on an assembly line.​

●​ In healthcare, unusual spikes in patient vital signs could signal a medical emergency.

Code & Output:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.neighbors import NearestNeighbors, LocalOutlierFactor
from sklearn.cluster import DBSCAN

# Load dataset
df = pd.read_csv("Churn_Modelling.csv")
# Distribution plot
sns.displot(df['Balance'], kde=True)
plt.xlabel('Balance')
plt.ylabel('Density')
plt.title('Distribution of Balance')
plt.show()

# Boxplot (Whisker plot)


sns.boxplot(y=df['CreditScore'])
plt.xlabel('CreditScore')
plt.title('Boxplot of CreditScore')
plt.show()
# Scatter plot
plt.scatter(df['CreditScore'], df['Age'])
plt.xlabel('CreditScore')
plt.ylabel('Age')
plt.title('Scatter plot of CreditScore vs Age')
plt.show()
# Pie-chart for Tenure
data = df['Tenure'].value_counts().reset_index()
data.columns = ['Tenure', 'Count'] # Renaming for clarity

plt.pie(data['Count'], labels=data['Tenure'], autopct="%1.2f%%")


plt.legend()
plt.title("Distribution of Tenure")
plt.show()

# Distance-Based Outlier Detection


X = df[['CreditScore', 'Age']].values
nbrs = NearestNeighbors(n_neighbors=3)
nbrs.fit(X)
dist, indexes = nbrs.kneighbors(X)

plt.plot(dist.mean(axis=1))
plt.title("Mean Distance to Nearest Neighbors")
plt.show()
outlier_index = np.where(dist.mean(axis=1) >
np.percentile(dist.mean(axis=1), 95))
outlier_values = df.iloc[outlier_index]
print("Distance-based Outliers:")
print(outlier_values[['CreditScore', 'Age']])

# Plot outliers
plt.scatter(df['CreditScore'], df['Age'], color="b", label='Normal
Data')
plt.scatter(outlier_values['CreditScore'], outlier_values['Age'],
color="r", label='Outliers')
plt.xlabel('Credit Score')
plt.ylabel('Age')
plt.legend()
plt.title("Distance-Based Outlier Detection")
plt.show()
# Density-Based Outlier Detection (LOF)
lof = LocalOutlierFactor(n_neighbors=3, contamination=0.05)
preds = lof.fit_predict(X)
outliers = np.where(preds == -1)[0]
outlier_values = df.iloc[outliers]
print("LOF Outliers:")
print(outlier_values[['CreditScore', 'Age']])

# Plot LOF outliers


plt.scatter(df['CreditScore'], df['Age'], color="b", label='Normal
Data')
plt.scatter(outlier_values['CreditScore'], outlier_values['Age'],
color="r", label='LOF Outliers')
plt.xlabel('Credit Score')
plt.ylabel('Age')
plt.legend()
plt.title("Local Outlier Factor (LOF) Detection")
plt.show()
# Density-Based Outlier Detection (DBSCAN)
dbscan = DBSCAN(eps=5, min_samples=10)
df['Cluster'] = dbscan.fit_predict(df[['CreditScore', 'Age']])
colors = df['Cluster']

plt.scatter(df['CreditScore'], df['Age'], c=colors,


cmap='rainbow')
plt.xlabel('Credit Score')
plt.ylabel('Age')
plt.title("DBSCAN Clustering for Outlier Detection")
plt.colorbar(label='Cluster')
plt.show()
# Removing Outliers (Trimming) upper_limit =
df['CreditScore'].quantile(0.99)
lower_limit = df['CreditScore'].quantile(0.01)

trimmed_df = df[(df['CreditScore'] <= upper_limit) &


(df['CreditScore'] >= lower_limit)]

sns.boxplot(y=trimmed_df['CreditScore'])
plt.title('Boxplot after Trimming')
plt.show()
# Removing Outliers (Winsorization)
df['CreditScore'] = np.where(df['CreditScore'] >= upper_limit,
upper_limit,
np.where(df['CreditScore'] <=
lower_limit, lower_limit, df['CreditScore']))

# Boxplot before and after Winsorization


fig, axes = plt.subplots(1, 2, figsize=(12, 5))
sns.boxplot(y=df['CreditScore'], ax=axes[0])
axes[0].set_title('Boxplot after Winsorization')
sns.boxplot(y=df['CreditScore'], ax=axes[1])
axes[1].set_title('Boxplot before Winsorization')
plt.show()
EXPERIMENT 8

AIM: Use SMOTE technique to generate synthetic data (to solve the problem of class
imbalance)

Theory:
Class imbalance occurs when one class in a dataset has significantly more samples than
another. This imbalance can lead to biased models that favor the majority class while poorly
predicting the minority class. Several techniques can be used to handle class imbalance:

1. Data-Level Approaches

These methods focus on modifying the dataset before training the model.

a. Oversampling:

●​ Increases the number of samples in the minority class.​

●​ Example: SMOTE (Synthetic Minority Over-sampling Technique) generates synthetic


samples instead of duplicating existing ones.​

b. Undersampling:

●​ Reduces the number of samples in the majority class to balance the dataset.​

●​ Example: Random Undersampling (RUS) removes some majority class samples.​

●​ Risk: May lose valuable information.​

c. Hybrid Methods:

●​ Combine oversampling and undersampling to achieve better balance.​

●​ Example: SMOTE + Tomek Links removes overlapping majority class samples after
applying SMOTE.

2. Algorithm-Level Approaches

These methods modify the learning algorithm to be more sensitive to class imbalance.

a. Cost-Sensitive Learning:

●​ Assigns a higher misclassification cost to the minority class.​

●​ Used in Decision Trees, SVMs, and Neural Networks.​


b. Class Weight Adjustment:

●​ Adjusts class weights in algorithms like Random Forest and Logistic Regression
(class_weight="balanced" in Scikit-Learn).​

c. Anomaly Detection Methods:

●​ Treats the minority class as an anomaly.​

●​ Example: Isolation Forest, One-Class SVM.​

3. Ensemble Methods

These methods combine multiple models to improve minority class prediction.

a. Balanced Bagging & Boosting:

●​ Randomly samples from both classes in every iteration.​

●​ Example: Balanced Random Forest, AdaBoost with balanced weights.​

b. EasyEnsemble & BalanceCascade:

●​ Train multiple classifiers on different balanced subsets.​

Working of SMOTE (Synthetic Minority Oversampling Technique)

SMOTE is an oversampling technique that generates synthetic samples for the minority
class instead of simply duplicating existing data points.

Steps in SMOTE Algorithm:

1️. Select a minority class sample randomly.​


2. Find k-nearest neighbors (default k=5).​
3. Randomly choose one of these k-nearest neighbors.​
4️. Generate a new synthetic point along the line connecting the chosen sample and its
neighbor.​
5.Repeat until the desired class balance is achieved.
Mathematical Formula:

Code & Output:


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, f1_score
from imblearn.over_sampling import SMOTE
# Load the churn modeling dataset
data = pd.read_csv('Churn_Modelling.csv')
print("Initial dataset shape:", data.shape)
print("First few rows:")
print(data.head())

# Selecting features and target variable


target_column = 'Exited'
features = ['CreditScore', 'Age', 'Tenure', 'Balance',
'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary']
X = data[features]
y = data[target_column]
# 1) Analyze the class label distribution
plt.figure(figsize=(6, 4))
sns.countplot(x=y, palette='coolwarm')
plt.title("Class Distribution Before SMOTE")
plt.show()
print("Class distribution before SMOTE:")
print(y.value_counts())

# Splitting dataset before handling imbalance


X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42, stratify=y)

# 2) Train model without handling imbalance


clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("\nEvaluation before applying SMOTE:")
print(classification_report(y_test, y_pred))
print("F1-Score:", f1_score(y_test, y_pred, average='weighted'))
# 3) Apply SMOTE to handle class imbalance
# Ensure y_train is properly formatted
y_train = np.array(y_train)

# Apply SMOTE to handle class imbalance


smote = SMOTE(sampling_strategy="auto", k_neighbors=3,
random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

# Analyzing class distribution after SMOTE


plt.figure(figsize=(6, 4))
sns.barplot(x=np.unique(y_resampled), y=np.bincount(y_resampled),
palette="coolwarm")
plt.title("Class Distribution After SMOTE")
plt.xlabel("Class Labels")
plt.ylabel("Count")
plt.show()

# Print class distribution


print("Class distribution after SMOTE:")
print(np.unique(y_resampled, return_counts=True))
# 4) Train model after applying SMOTE
clf_smote = RandomForestClassifier(random_state=42)
clf_smote.fit(X_resampled, y_resampled)
y_pred_smote = clf_smote.predict(X_test)
print("\nEvaluation after applying SMOTE:")
print(classification_report(y_test, y_pred_smote))
print("F1-Score:", f1_score(y_test, y_pred_smote,
average='weighted'))

print("\nComparison before and after SMOTE:")


print("Before SMOTE F1-Score:", f1_score(y_test, y_pred,
average='weighted'))
print("After SMOTE F1-Score:", f1_score(y_test, y_pred_smote,
average='weighted'))
EXPERIMENT 9

AIM: Inferential statistics - Hypothesis testing, Z test, T-test , Anova

Theory:

When working with data, understanding its statistical properties is crucial for making
informed decisions. Various inferential statistics techniques help analyze and interpret the
data, estimate population parameters, and test hypotheses. Below are key statistical
methods and concepts used in data analysis.

1. Inferential Statistics

Inferential statistics allows us to draw conclusions about a population based on a sample. It


includes distribution analysis, confidence intervals, and hypothesis testing to make
predictions and validate findings.

a) Distributions

A distribution represents how data values are spread over a range. Common probability
distributions include:

1. Normal Distribution (Gaussian Distribution):

●​ Bell-shaped curve, symmetric around the mean.​

●​ Many natural phenomena (e.g., height, weight) follow this distribution.​

●​ Defined by mean (μ) and standard deviation (σ).​

2. Poisson Distribution:

●​ Used for modeling rare events (e.g., number of calls at a help desk per hour).​

●​ Defined by a single parameter: λ (mean occurrence rate per interval).​

3. Population Parameters vs. Sample Statistics:

●​ Population parameters (e.g., population mean μ, population variance σ²) describe the
entire dataset.​

●​ Sample statistics (e.g., sample mean X̄, sample variance S²) estimate population
parameters based on a sample.​
4. Sampling Errors:

●​ The difference between a sample statistic and the true population parameter.​

●​ Caused by variability in the sample selection process.​

b) Confidence Intervals (CI)

A confidence interval provides a range within which we expect the true population parameter
to lie with a certain probability (e.g., 95%).

Formula for Confidence Interval for the Mean:

Example: A 95% confidence interval of (50, 60) means we are 95% confident that the
population mean lies between 50 and 60.

c) Hypothesis Testing

Hypothesis testing is a statistical method used to make decisions about a population


parameter based on sample data.

Steps in Hypothesis Testing:

1. Define Null Hypothesis (H₀) and Alternative Hypothesis (H₁).​


2. Choose a significance level (α), usually 0.05 (5%).​
3. Select the appropriate statistical test (Z-test, t-test, ANOVA, etc.).​
4. Calculate the test statistic and p-value.​
5. Compare the p-value with α:

●​ If p ≤ α, reject H₀ (significant result).​

●​ If p > α, fail to reject H₀ (not significant).​


1) Z-Test

●​ Used when sample size (n) > 30 and population standard deviation is known.​

●​ Tests whether the sample mean significantly differs from the population mean.​

Formula for Z-test:

Example: Checking if the average height of students differs from 170 cm.

2) T-Test

●​ Used when sample size (n) < 30 or population standard deviation is unknown.​

●​ Compare the means of one or two samples.​

Types of t-tests:​
1. One-sample t-test: Compares sample mean to a known population mean.​
2. Independent (two-sample) t-test: Compares means of two independent groups.​
3. Paired t-test: Compares means of the same group before and after a treatment.

Formula for t-test:

Example: Comparing the effectiveness of two different drugs on blood pressure reduction.
3) ANOVA (Analysis of Variance)

●​ Used to compare means of three or more groups.​

●​ Determines if at least one group mean is significantly different from the others.​

Types of ANOVA:​
1. One-Way ANOVA: Compares means across one factor (e.g., test scores of students in
three different schools).​
2. Two-Way ANOVA: Compares means across two factors (e.g., test scores based on both
school and gender).

Formula for ANOVA (F-statistic):

●​ A higher F-value suggests significant differences between groups.​

Example: Testing if different fertilizers affect plant growth differently.

Code & Output:


import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
sales_df = pd.read_csv('supermarket_sales.csv')
print("Initial dataset shape:", sales_df.shape)
print("First few rows of the dataset:\n", sales_df.head())
# a) Distributions - Checking Normal and Poisson Distributions
plt.figure(figsize=(10, 5))
plt.hist(sales_df['Total'], bins=30, alpha=0.7, color='blue',
label='Total Sales')
plt.title("Distribution of Total Sales")
plt.xlabel("Total Sales")
plt.ylabel("Frequency")
plt.legend()
plt.show()

# Checking Population Parameters and Sampling Errors


sample = sales_df['Total'].sample(50, random_state=42)
print("Sample Mean:", sample.mean())
print("Population Mean:", sales_df['Total'].mean())
print("Sampling Error:", abs(sample.mean() -
sales_df['Total'].mean()))
# b) Confidence Intervals
confidence = 0.95
mean_total = np.mean(sample)
stdev_total = np.std(sample, ddof=1)
n = len(sample)
margin_of_error = stats.t.ppf((1 + confidence) / 2, n - 1) *
(stdev_total / np.sqrt(n))
print("\nConfidence Interval for Total Sales:")
print(f"{confidence*100}% CI: ({mean_total - margin_of_error},
{mean_total + margin_of_error})")

# c) Hypothesis Testing

# Z-Test: Comparing mean 'Total' sales between Male and Female


customers
print("\nZ-Test (Male vs Female Total Sales):")
print("Null Hypothesis (H0): There is no significant difference in
total sales between Male and Female customers.")
print("Alternative Hypothesis (H1): There is a significant
difference in total sales between Male and Female customers.")

male_total = sales_df[sales_df['Gender'] == 'Male']['Total']


female_total = sales_df[sales_df['Gender'] == 'Female']['Total']

z_stat, z_p_value = stats.ttest_ind(male_total, female_total)


print("Z-Statistic:", z_stat)
print("P-Value:", z_p_value)
print("Conclusion:", "Reject Null Hypothesis" if z_p_value < 0.05
else "Fail to Reject Null Hypothesis")

# T-Test: Comparing average 'Unit price' between Member and Normal


customers
print("\nT-Test (Member vs Normal Customer - Unit Price):")
print("Null Hypothesis (H0): There is no significant difference in
unit price between Member and Normal customers.")
print("Alternative Hypothesis (H1): There is a significant
difference in unit price between Member and Normal customers.")
member_price = sales_df[sales_df['Customer type'] ==
'Member']['Unit price']
normal_price = sales_df[sales_df['Customer type'] ==
'Normal']['Unit price']

t_stat, t_p_value = stats.ttest_ind(member_price, normal_price)


print("T-Statistic:", t_stat)
print("P-Value:", t_p_value)
print("Conclusion:", "Reject Null Hypothesis" if t_p_value < 0.05
else "Fail to Reject Null Hypothesis")

# ANOVA: Comparing 'Rating' across Branches A, B, and C


print("\nANOVA (Comparing Rating across Branches A, B, and C):")
print("Null Hypothesis (H0): There is no significant difference in
customer ratings across different branches.")
print("Alternative Hypothesis (H1): There is a significant
difference in customer ratings across different branches.")

branch_a = sales_df[sales_df['Branch'] == 'A']['Rating']


branch_b = sales_df[sales_df['Branch'] == 'B']['Rating']
branch_c = sales_df[sales_df['Branch'] == 'C']['Rating']

anova_stat, anova_p_value = stats.f_oneway(branch_a, branch_b,


branch_c)
print("ANOVA Statistic:", anova_stat)
print("P-Value:", anova_p_value)
print("Conclusion:", "Reject Null Hypothesis" if anova_p_value <
0.05 else "Fail to Reject Null Hypothesis")

# Boxplot for Rating Distribution across Branches


plt.figure(figsize=(8, 5))
sales_df.boxplot(column='Rating', by='Branch', grid=False,
patch_artist=True)
plt.title("Rating Distribution across Branches")
plt.suptitle("")
plt.xlabel("Branch")
plt.ylabel("Rating")
plt.show()
print("Hypothesis Testing Completed Successfully!")
EXPERIMENT 10

AIM: Demonstration of selected case study

Theory:

Code & Output:


Case Study: Exploratory Data Analysis of Zomato Restaurant Listings in
India

1. Objective
The purpose of this case study is to conduct a comprehensive exploratory data analysis
(EDA) on a dataset containing restaurant-related information from Zomato. As a group, our
objective is to examine the structure and characteristics of the data, identify underlying
patterns, and gain meaningful insights. We aim to explore aspects such as:
- Popular types of cuisines and restaurant categories
- Average ratings across different cities
- Distribution of cost for two people
- Influence of online delivery and table booking on customer ratings

Through this study, we intend to develop proficiency in using data analysis tools,
conducting descriptive statistics, and presenting data-driven conclusions with appropriate
visualizations.

2. Dataset Description
Dataset Used: Zomato Restaurants Dataset – Kaggle Link
(https://www.kaggle.com/datasets/PromptCloudHQ/zomato-restaurants-data)

This dataset consists of approximately 9500 records of restaurants listed on Zomato. The
data covers various attributes such as restaurant name, location, cuisines offered, average
cost for two people, rating metrics, availability of online delivery and table booking, and
customer votes. These features offer a balanced mix of numerical and categorical data for
analysis.

The dataset primarily represents restaurant data from Indian cities, though some entries
from other countries are included. It serves as a suitable dataset to understand consumer
preferences, market segmentation, and regional trends in the restaurant industry.

3. Tools and Technologies Required


To conduct this analysis, we as a group made use of the following tools and libraries:

- Python
- Jupyter Notebook or Google Colab
- Pandas and NumPy for data handling
- Matplotlib and Seaborn for data visualization
4. Steps in the Case Study

Step 1: Data Loading and Initial Exploration


 - Load dataset using Pandas.
 - Display first few records using head().
 - Check data types, null values, and basic statistics using .info(), .describe(),
.isnull().sum().

Step 2: Data Cleaning


 - Drop irrelevant columns like Restaurant ID, Address, etc.
 - Handle missing values (either fill or drop).
 - Convert data types if needed.
 - Normalize text fields (lowercase, remove extra spaces).

Step 3: Univariate Analysis


 - Distribution of restaurant ratings (pie chart and histogram).
 - Most common cuisines (bar chart of top 10).
 - Analyze average cost for two (histogram).

Step 4: Bivariate Analysis


 - Boxplot of rating vs city.
 - Average rating for online delivery vs non-online delivery restaurants.
 - Comparison of average rating for restaurants with and without table booking.
 - Scatter plot of average cost vs rating.

Step 5: City-Wise Analysis


 - Most restaurants per city (bar chart).
 - Average cost per city (bar chart).
 - Average rating per city (bar chart).

Step 6: Insights and Observations


 - Summarize key findings such as city-wise trends, cuisine popularity, impact of
delivery/table booking on ratings.

Step 7: Conclusion
 - Explain how users and restaurant owners can benefit from these insights.

5. Deliverables
- Python notebook with code and plots
- PDF report or PPT summarizing:
• Objectives
• Visualizations
• Key insights
• Conclusion
6. Optional Extensions
- Create interactive visualizations using Plotly.
- Use WordCloud to visualize frequent cuisine words.
- Use geopandas or folium to plot restaurant locations (if latitude/longitude data is
available).

7. Sample Visualizations
Here are some sample visualizations that can enhance the report and provide clear insights:

 - Bar chart: Top 10 cuisines served across restaurants


 - Pie chart: Distribution of rating text (Excellent, Good, etc.)
 - Box plot: Aggregate rating comparison across cities
 - Scatter plot: Relationship between average cost for two and restaurant rating
 - Heatmap: Correlation between numerical variables like votes, rating, and cost
8. Learning Outcomes
Upon completion of this case study, our group will have achieved the following learning
outcomes:

 - Perform end-to-end EDA on a real-world dataset


 - Identify key patterns and draw actionable insights from data
 - Generate visual reports using matplotlib and seaborn
 - Understand relationships between different types of restaurant features

You might also like