0% found this document useful (0 votes)
16 views

CS REPORT

This document is a laboratory report submitted by Laxmi B Magadum to Visvesvaraya Technological University for the completion of a B.Tech degree in Computer Science and Business Systems. It includes various programs related to computational statistics, data wrangling, data transformation, and statistical analysis. The report details the methodologies and outputs of these programs, demonstrating essential techniques in data manipulation and analysis.

Uploaded by

mtulsi1103
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

CS REPORT

This document is a laboratory report submitted by Laxmi B Magadum to Visvesvaraya Technological University for the completion of a B.Tech degree in Computer Science and Business Systems. It includes various programs related to computational statistics, data wrangling, data transformation, and statistical analysis. The report details the methodologies and outputs of these programs, demonstrating essential techniques in data manipulation and analysis.

Uploaded by

mtulsi1103
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

VISVESVARAYA TECHNOLOGICAL UNIVERSITY

BELAGAVI, KARNATAKA-590018

Department of Computer Science and Engineering

Laboratory Report on

“COMPUTATIONAL STATISTICS LAB”


A Laboratory report submi ed to Visvesvaraya Technological University in par al fulfillment of
requirements for the award of the Degree.

BACHELOR OF TECHNOLOGY
In

“Computer Science and Business Systems”

Submi ed By

NAME: Laxmi B Magadum

USN: 2VX22CB023

Department of Computer Science and Engineering

VTU Belagavi-590018.
ACADEMIC YEAR 2024-2025
Visvesvaraya Technological University, Belagavi

Department of Computer Science and Engineering

CERTIFICATE

This is to cer fy that Mr/Ms. Laxmi B Magadum bearing

USN 2VX22CB023 studying in V semester B.Tech (“Computer Science and Business Systems”) has
presented and successfully completed the Laboratory Report tled “COMPUTATIONAL STATISTICS LAB”
in the presence of the undersigned examiners for the par al fulfillment of the award of B.Tech. degree
VTU, Belagavi, for the academic year 2024-25.

________________________ _____________________________

Staff In Charge Course Coordinator


Dept of CSE, VTU, Belagavi. Dept of CSE, VTU Belagavi.

Examiner- 1 Examiner- 2
Name: Name:

Signature with Date: Signature with Date:


INDEX
SL DATE NAME OF THE PAGE SIGN
NO. EXPERIMENTS NO.
Program on data wrangling: Combining and
1 merging datasets, Reshaping and Pivoting.
Program on Data Transformation: String
2 Manipulation, Regular Expressions.
Program on Time series: GroupBy Mechanics to
3 display in data vector, multivariate time series and
forecasting formats.
Program to measure central tendency and measures
4 of dispersion: Mean, Median, Mode, Standard
Deviation, Variance, Mean deviation and Quartile
deviation for a frequency distribution/data.
Program to perform cross validation for a given
5 dataset to measure Root Mean Squared Error
(RMSE), Mean Absolute Error (MAE) and R2
Error using Validation Set, Leave One Out Cross-
Validation(LOOCV) and K-fold Cross-Validation
approaches.
Program to display Normal, Binomial Poisson,
6 Bernoulli disrtibutions for a given frequency
distribution and analyze the results.
Program to implement one sample, two sample and
7 paired sample t-tests for a sample data and analyse
the results.
Program to implement one sample, two sample and
8 paired sample t-tests for a sample data and analyse
the results.
Program to implement correlation, rank correlation
9 and regression and plot x-y plot and heat maps of
correlation matrices.

Program to implement PCA for Wisconsin dataset,


10 visualize and analyze the results.
Program to implement the working of linear
11 discriminant analysis using iris dataset and visualize
the results.
Program to implement the working of linear
12 discriminant analysis using iris dataset and visualize
the results.
1

1)Program on data wrangling: Combining and merging datasets, Reshaping and


Pivo ng.

Data wrangling is a cri cal process in data analysis, where data is transformed into a
structured, usable format. This program demonstrates key opera ons in data
wrangling: combining and merging datasets, reshaping, pivo ng, handling missing
data, and genera ng summary sta s cs.
1. Combining and Merging Datasets
Combining and merging datasets are essen al for integra ng mul ple data sources.
Merging: Combines datasets based on a common key using an inner join, retaining
only matching rows. This ensures focused analysis on shared data points.
Concatena on: Stacks datasets ver cally, appending new records to create a unified
dataset.
2. Reshaping Data with Melt
Reshaping is used to change the layout of a dataset to suit specific analy cal needs.
The melt opera on converts wide-format data into long format, turning columns into
rows. This format is ideal for grouping, filtering, and visualizing data across variables.
3. Pivo ng Data
Pivo ng reverses the mel ng process, conver ng long-format data back into wide
format. It summarizes data for easier interpreta on by using one column as the index
and another as columns. This transforma on is par cularly useful for summarizing
data in a matrix-like format, which is easier to interpret for certain sta s cal analyses.
4. Handling Missing Data
Missing values are replaced with column means to ensure completeness. This
maintains data integrity for further analysis.
5. Summary Sta s cs
The program concludes by calcula ng summary sta s cs (e.g., mean, standard
devia on, min, max) for the filled dataset. Summary sta s cs provide insights into
the dataset's central tendency, dispersion, and overall distribu on.
2

# Import necessary libraries


import pandas as pd
import numpy as np
# 1. Combining and Merging Datasets
# Create two sample DataFrames
sales_data_1 = pd.DataFrame({
'OrderID': [1, 2, 3, 4],
'Product': ['Laptop', 'Tablet', 'Smartphone', 'Headphones'],
'Sales': [1200, 800, 1500, 300]
})

sales_data_2 = pd.DataFrame({
'OrderID': [3, 4, 5, 6],
'Product': ['Smartphone', 'Headphones', 'Smartwatch', 'Tablet'],
'Sales': [1500, 300, 200, 900]
})
# Display the DataFrames
print("Sales Data 1:\n", sales_data_1)
print("\nSales Data 2:\n", sales_data_2)
# Merge DataFrames based on 'OrderID' using an inner join
merged_data = pd.merge(sales_data_1, sales_data_2, on='OrderID', how='inner',
suffixes=('_le ', '_right'))
print("\nMerged Data (Inner Join):\n", merged_data)
# Concatenate the DataFrames ver cally
combined_data = pd.concat([sales_data_1, sales_data_2], ignore_index=True)
print("\nCombined Data (Concatenated Ver cally):\n", combined_data)
3

# 2. Reshaping Data with Melt


# Create a sample DataFrame for reshaping
reshaping_data = pd.DataFrame({
'Month': ['Jan', 'Feb', 'Mar'],
'Product_A': [100, 150, 130],
'Product_B': [90, 80, 120]
})
print("\nReshaping Data (Original):\n", reshaping_data)
# Melt the DataFrame to reshape it from wide to long format
melted_data = pd.melt(reshaping_data, id_vars=['Month'], var_name='Product',
value_name='Sales')
print("\nMelted Data (Long Format):\n", melted_data)
# 3. Pivo ng Data
# Create a sample DataFrame for pivo ng
pivot_data = pd.DataFrame({
'Month': ['Jan', 'Jan', 'Feb', 'Feb', 'Mar', 'Mar'],
'Product': ['Product_A', 'Product_B', 'Product_A', 'Product_B', 'Product_A',
'Product_B'],
'Sales': [100, 90, 150, 80, 130, 120]
})
print("\nPivot Data (Original):\n", pivot_data)
# Pivot the DataFrame to reshape it back to wide format
pivoted_data = pivot_data.pivot(index='Month', columns='Product', values='Sales')
print("\nPivoted Data (Wide Format):\n", pivoted_data)
# 4. Handling Missing Data
# Introduce some missing values
4

pivoted_data.loc['Feb', 'Product_A'] = np.nan


pivoted_data.loc['Mar', 'Product_B'] = np.nan
print("\nPivoted Data with Missing Values:\n", pivoted_data)
# Fill missing values with the mean of each column
filled_data = pivoted_data.fillna(pivoted_data.mean())
print("\nFilled Data (Missing Values Handled):\n", filled_data)
# 5. Summary Sta s cs
print("\nSummary Sta s cs of Filled Data:\n", filled_data.describe())

OUTPUT:
Sales Data 1:
OrderID Product Sales
0 1 Laptop 1200
1 2 Tablet 800
2 3 Smartphone 1500
3 4 Headphones 300

Sales Data 2:
OrderID Product Sales
0 3 Smartphone 1500
1 4 Headphones 300
2 5 Smartwatch 200
3 6 Tablet 900

Merged Data (Inner Join):


OrderID Product_le Sales_le Product_right Sales_right
5

0 3 Smartphone 1500 Smartphone 1500


1 4 Headphones 300 Headphones 300

Combined Data (Concatenated Ver cally):


OrderID Product Sales
0 1 Laptop 1200
1 2 Tablet 800
2 3 Smartphone 1500
3 4 Headphones 300
4 3 Smartphone 1500
5 4 Headphones 300
6 5 Smartwatch 200
7 6 Tablet 900

Reshaping Data (Original):


Month Product_A Product_B
0 Jan 100 90
1 Feb 150 80
2 Mar 130 120

Melted Data (Long Format):


Month Product Sales
0 Jan Product_A 100
1 Feb Product_A 150
2 Mar Product_A 130
3 Jan Product_B 90
6

4 Feb Product_B 80
5 Mar Product_B 120

Pivot Data (Original):


Month Product Sales
0 Jan Product_A 100
1 Jan Product_B 90
2 Feb Product_A 150
3 Feb Product_B 80
4 Mar Product_A 130
5 Mar Product_B 120

Pivoted Data (Wide Format):


Product Product_A Product_B
Month
Feb 150 80
Jan 100 90
Mar 130 120

Pivoted Data with Missing Values:


Product Product_A Product_B
Month
Feb NaN 80.0
Jan 100.0 90.0
Mar 130.0 NaN
7

Filled Data (Missing Values Handled):


Product Product_A Product_B
Month
Feb 115.0 80.0
Jan 100.0 90.0
Mar 130.0 85.0

Summary Sta s cs of Filled Data:


Product Product_A Product_B
count 3.0 3.0
mean 115.0 85.0
std 15.0 5.0
min 100.0 80.0
25% 107.5 82.5
50% 115.0 85.0
75% 122.5 87.5
max 130.0 90.0
8

2) Program on Data Transforma on: String Manipula on, Regular Expressions.


Data transforma on is an essen al step in preprocessing text data for analysis. The
program demonstrates two cri cal techniques: string manipula on and regular
expressions (regex).
1. String Manipula on:
String manipula on involves performing opera ons on text data to clean or reformat
it for easier analysis. Common opera ons demonstrated include:
Trimming Spaces: Removes leading and trailing spaces for cleaner text.
Changing Case: Converts text to uppercase or lowercase to maintain consistency.
Coun ng Substrings: Counts occurrences of specific characters or words.
Replacing Text: Replaces specific words or pa erns with desired text.
Finding and Spli ng: Locates words in a string and splits the text into individual
words.
Checking Prefix/Suffix: Verifies if a string starts or ends with specific content.
These opera ons are fundamental in cleaning and reforma ng raw textual data.
2. Regular Expressions (Regex):
Regex is a powerful tool for pa ern matching and text extrac on. Key opera ons
include:
Removing Special Characters: Cleans text by removing unwanted symbols while
retaining meaningful content like emails.
Conver ng Case: Ensures uniformity by conver ng text to lowercase.
Replacing Spaces: Replaces mul ple spaces with a single space for be er readability.
Pa ern Matching: Finds specific pa erns like words star ng with vowels or extrac ng
emails.
Masking Sensi ve Informa on: Replaces email addresses with placeholders to
anonymize data.
Applica ons: These techniques are widely used for cleaning, structuring, and
processing textual datasets.
9

String Manupula on:


# Sample text to work with
text = " Hello, World! Welcome to Python programming. "
# 1. Strip leading and trailing spaces
clean_text = text.strip()
print(f"Original Text: '{text}'")
print(f"Text a er stripping spaces: '{clean_text}'")
# 2. Convert the text to uppercase
upper_text = clean_text.upper()
print(f"\nText in uppercase: '{upper_text}'")
# 3. Convert the text to lowercase
lower_text = clean_text.lower()
print(f"\nText in lowercase: '{lower_text}'")
# 4. Count occurrences of a substring (e.g., "o")
count_o = clean_text.count("o")
print(f"\nNumber of occurrences of 'o': {count_o}")
# 5. Replace a word in the string
replaced_text = clean_text.replace("Python", "Data Science")
print(f"\nText a er replacing 'Python' with 'Data Science': '{replaced_text}'")
# 6. Find the posi on of a word in the string
posi on_world = clean_text.find("World")
print(f"\nPosi on of 'World' in the text: {posi on_world}")
# 7. Split the text into words (by default on spaces)
words = clean_text.split()
print(f"\nList of words in the text: {words}")
# 8. Join the words back into a single string
10

joined_text = " ".join(words)


print(f"\nText a er joining words: '{joined_text}'")
# 9. Check if the text starts with "Hello"
starts_with_hello = clean_text.startswith("Hello")
print(f"\nDoes the text start with 'Hello'? {starts_with_hello}")
# 10. Check if the text ends with a specific word (e.g., "programming.")
ends_with_programming = clean_text.endswith("programming.")
print(f"\nDoes the text end with 'programming.'? {ends_with_programming}")

OUTPUT:
Original Text: ' Hello, World! Welcome to Python programming. '
Text a er stripping spaces: 'Hello, World! Welcome to Python programming.'
Text in uppercase: 'HELLO, WORLD! WELCOME TO PYTHON PROGRAMMING.'
Text in lowercase: 'hello, world! welcome to python programming.'
Number of occurrences of 'o': 6
Text a er replacing 'Python' with 'Data Science': 'Hello, World! Welcome to Data
Science programming.'
Posi on of 'World' in the text: 7
List of words in the text: ['Hello,', 'World!', 'Welcome', 'to', 'Python', 'programming.']
Text a er joining words: 'Hello, World! Welcome to Python programming.'
Does the text start with 'Hello'? True
Does the text end with 'programming.'? True

Regular Expressions:
import re
# Sample text
11

text = """
John's email is [email protected]. He said, "Python is awesome!!" It's a great
language.
Another email: [email protected].
"""
# 1. Remove special characters except for spaces and email-related characters.
# Using regex to remove non-alphabe c characters and non-email symbols
clean_text = re.sub(r"[^a-zA-Z0-9@\.\s]", "", text)
print("Text a er removing special characters:")
print(clean_text)
# 2. Convert the text to lowercase
clean_text = clean_text.lower()
print("\nText a er conver ng to lowercase:")
print(clean_text)
# 3. Replace mul ple spaces with a single space
clean_text = re.sub(r"\s+", " ", clean_text)
print("\nText a er replacing mul ple spaces:")
print(clean_text)
# 4. Extract all words star ng with a vowel (a, e, i, o, u)
vowel_words = re.findall(r"\b[aeiouAEIOU]\w+", clean_text)
print("\nWords star ng with a vowel:")
print(vowel_words)
# 5. Replace email addresses with '[email protected]'
masked_text = re.sub(r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", "[email
protected]", clean_text)
print("\nText a er replacing emails:")
12

print(masked_text)

OUTPUT:
Text a er removing special characters:
Johns email is email protected. He said Python is awesome Its a great language.
Another email email protected.

Text a er conver ng to lowercase:


johns email is email protected. he said python is awesome its a great language.
another email email protected.

Text a er replacing mul ple spaces:


johns email is email protected. he said python is awesome its a great language.
another email email protected.

Words star ng with a vowel:


['email', 'is', 'email', 'is', 'awesome', 'its', 'another', 'email', 'email']

Text a er replacing emails:


johns email is email protected. he said python is awesome its a great language.
another email email protected.
13

3) Program on Time series: GroupBy Mechanics to display in data vector,


mul variate me series and forecas ng formats.
Time series analysis involves working with data collected over me, helping in
understanding pa erns and making forecasts. The program demonstrates three key
aspects: GroupBy mechanics, data formats, and forecas ng.
1. GroupBy Mechanics:
Time series data can be grouped to summarize and analyze trends over specific
intervals (e.g., months).
The program groups daily data by month using the resample method and calculates
the monthly mean.
This helps iden fy pa erns or trends at a higher granularity, such as seasonal or
monthly varia ons.
2. Data Formats:
Vector Format: Displays a single variable (e.g., Value_A) as a sequence of values over
me, useful for analyzing one aspect of the dataset.
Mul variate Time Series: Includes mul ple variables (e.g., Value_A and Value_B),
allowing for the analysis of rela onships between variables over me.
3. Time Series Forecas ng:
Uses the Holt-Winters Exponen al Smoothing model to predict future values based
on historical data.
The program splits data into training and tes ng sets, fits the model to the training
data, and forecasts for the test period.
Results are visualized to compare actual values and predic ons, aiding in decision-
making.
Applica ons:
Time series analysis is widely used in fields like finance, economics, and weather
forecas ng.
14

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.holtwinters import Exponen alSmoothing
# Create sample me series data
np.random.seed(42)
date_range = pd.date_range(start="2022-01-01", end="2023-01-01", freq="D")
data = pd.DataFrame({
"Date": date_range,
"Value_A": np.random.normal(100, 10, len(date_range)),
"Value_B": np.random.normal(200, 20, len(date_range)),
})
# Set Date as the index
data.set_index("Date", inplace=True)
# GroupBy Mechanics
def groupby_mechanics(data):
print("\n--- GroupBy Mechanics ---")
# Group data by month and calculate mean
grouped = data.resample('ME').mean()
print(grouped)
return grouped
# Data Formats: Vector and Mul variate
def data_formats(data):
print("\n--- Data Formats ---")
# Display data as vector
15

print("\nVector Format:")
print(data["Value_A"].head())
# Display mul variate me series
print("\nMul variate Time Series:")
print(data.head())
# Forecas ng Example
def me_series_forecas ng(data):
print("\n--- Forecas ng ---")
# Select a single column for forecas ng
ts = data["Value_A"]
# Train-Test Split
train = ts[:int(0.8 * len(ts))]
test = ts[int(0.8 * len(ts)):]
# Fit the Holt-Winters Exponen al Smoothing model
model = Exponen alSmoothing(train, seasonal="add", seasonal_periods=30).fit()
# Forecast for the test period
forecast = model.forecast(len(test))
# Plot results
plt.figure(figsize=(12, 6))
plt.plot(train, label="Train")
plt.plot(test, label="Test")
plt.plot(forecast, label="Forecast")
plt.legend()
plt. tle("Time Series Forecas ng")
plt.show()
# Main func on
16

if __name__ == "__main__":
print("--- Time Series Data ---")
print(data.head())
# Grouping Mechanics
monthly_data = groupby_mechanics(data)
# Data Formats
data_formats(data)
# Time Series Forecas ng
me_series_forecas ng(data)

OUTPUT:
--- Time Series Data ---
Value_A Value_B
Date
2022-01-01 104.967142 204.481850
2022-01-02 98.617357 200.251848
2022-01-03 106.476885 201.953522
2022-01-04 115.230299 184.539804
2022-01-05 97.658466 200.490203

--- GroupBy Mechanics ---


Value_A Value_B
Date
2022-01-31 97.985125 202.137470
2022-02-28 98.568317 204.960833
2022-03-31 100.439383 194.956405
17

2022-04-30 99.797484 198.429574


2022-05-31 99.161855 199.020262
2022-06-30 102.912924 192.752508
2022-07-31 100.983406 199.844253
2022-08-31 99.784632 201.134556
2022-09-30 99.089296 203.720687
2022-10-31 100.649960 198.150774
2022-11-30 102.325711 199.427682
2022-12-31 99.467543 197.195680
2023-01-31 95.987795 180.432544

--- Data Formats ---

Vector Format:
Date
2022-01-01 104.967142
2022-01-02 98.617357
2022-01-03 106.476885
2022-01-04 115.230299
2022-01-05 97.658466
Name: Value_A, dtype: float64

Mul variate Time Series:


Value_A Value_B
Date
2022-01-01 104.967142 204.481850
18

2022-01-02 98.617357 200.251848


2022-01-03 106.476885 201.953522
2022-01-04 115.230299 184.539804
2022-01-05 97.658466 200.490203
19

4) Program to measure central tendency and measures of dispersion: Mean,


Median, Mode, Standard Devia on, Variance, Mean devia on and Quar le
devia on for a frequency distribu on/data.
These measures are essen al for understanding the distribu on and variability of
data in a systema c way.
1. Central Tendency: These measures help iden fy the "center" or typical value of a
dataset:
Mean: The average of the data values, showing the overall central value.
Median: The middle value when the data is arranged in order, represen ng the
midpoint of the dataset.
Mode: The most frequently occurring value in the data, showing the most common
observa on.
2. Dispersion: These measures describe how spread out the data is:
Variance: Shows how much the data values differ from the mean on average.
Standard Devia on: The square root of variance, indica ng the average distance of
data from the mean.
Mean Devia on: The average of the absolute differences between data values and
the mean.
Quar le Devia on: Focuses on the variability of the middle 50% of the data.
Program Working:
Input: The program takes two inputs: data values and their frequencies.
Processing: It calculates the measures of central tendency (mean, median, mode) and
dispersion (variance, standard devia on, etc.) using Python libraries like NumPy and
pandas.
Output: The program provides all the computed measures, giving insights into the
dataset's characteris cs.
Advantages of Computa onal Sta s cs:
Efficiency: Automates complex calcula ons, saving me.
Accuracy: Reduces human error in computa ons.
20

# Import necessary libraries


import numpy as np
import pandas as pd
def calculate_sta s cs(data, frequencies):
# Create a DataFrame for the frequency distribu on
df = pd.DataFrame({'Value': data, 'Frequency': frequencies})
# Calculate the total number of observa ons
total = df['Frequency'].sum()
# Calculate mean
df['Weighted_Value'] = df['Value'] * df['Frequency']
mean = df['Weighted_Value'].sum() / total
# Calculate median
cumula ve_frequency = df['Frequency'].cumsum()
median_index = cumula ve_frequency.searchsorted(total / 2)
median = df['Value'][median_index]
# Calculate mode
mode = df['Value'][df['Frequency'].idxmax()]
# Calculate variance and standard devia on
variance = np.average((df['Value'] - mean) ** 2, weights=df['Frequency'])
std_devia on = np.sqrt(variance)
# Calculate mean devia on
mean_devia on = np.average(np.abs(df['Value'] - mean), weights=df['Frequency'])
# Calculate quar le devia on
q1 = np.percen le(data, 25)
q3 = np.percen le(data, 75)
quar le_devia on = (q3 - q1) / 2
21

return {
'Mean': mean,
'Median': median,
'Mode': mode,
'Variance': variance,
'Standard Devia on': std_devia on,
'Mean Devia on': mean_devia on,
'Quar le Devia on': quar le_devia on
}
# Get user input for data and frequencies
data_input = input("Enter the data values separated by commas (e.g., 10, 20, 30): ")
frequencies_input = input("Enter the corresponding frequencies separated by
commas (e.g., 1, 2, 3): ")
# Convert input strings to lists of integers
data = list(map(int, data_input.split(',')))
frequencies = list(map(int, frequencies_input.split(',')))
# Calculate sta s cs
sta s cs = calculate_sta s cs(data, frequencies)
# Display the results
for stat, value in sta s cs.items():
print(f"{stat}: {value:.2f}")

OUTPUT:
Enter the data values separated by commas (e.g., 10, 20, 30): 20, 40, 60
Enter the corresponding frequencies separated by commas (e.g., 1, 2, 3): 7, 8, 9
Mean: 41.67
22

Median: 40.00
Mode: 60.00
Variance: 263.89
Standard Devia on: 16.24
Mean Devia on: 13.75
Quar le Devia on: 10.00

5) Program to perform cross valida on for a given dataset to measure Root Mean
Squared Error (RMSE),Mean Absolute Error (MAE) and R2 Error using Valida on Set,
Leave One Out Cross-Valida on(LOOCV) and K-fold Cross-Valida on approaches.
Cross-valida on is a method to evaluate a model's performance by tes ng it on
different subsets of data. It ensures that the model generalizes well to unseen data.
The program calculates three key metrics for model evalua on:
1. Root Mean Squared Error (RMSE): Measures the average predic on error,
emphasizing larger errors.
2. Mean Absolute Error (MAE): Measures the average predic on error without
emphasizing outliers.
3. R² Score: Indicates how well the model explains the variability in the data.
Cross-Valida on Techniques
1. Valida on Set Approach: Splits the data into training (80%) and valida on (20%).
Tests the model on the valida on set a er training.
2. Leave-One-Out Cross-Valida on (LOOCV): Uses one sample as the test set and the
rest for training. Repeats this process for all samples.
3. K-Fold Cross-Valida on: Divides the data into k equal parts (folds). Trains on folds
and tests on the remaining fold, repeated mes.
Purpose: The program evaluates a linear regression model using these techniques
and calculates RMSE, MAE, and R² to compare performance. It ensures reliable and
unbiased model evalua on.
23

# Import necessary libraries


import numpy as np
import pandas as pd
from sklearn.model_selec on import train_test_split, KFold, LeaveOneOut
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.datasets import fetch_california_housing
# Load the California housing dataset
data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)
# Func on to calculate and display metrics
def display_metrics(y_true, y_pred):
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
mae = mean_absolute_error(y_true, y_pred)
r2 = r2_score(y_true, y_pred)
print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")
print(f"Mean Absolute Error (MAE): {mae:.4f}")
print(f"R² Score: {r2:.4f}")
return rmse, mae, r2
# Valida on Set Approach
def valida on_set_approach(X, y):
print("Valida on Set Approach:")
# Split the dataset into training (80%) and valida on (20%) sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2,random_state=42)
# Ini alize and train the model
24

model = LinearRegression()
model.fit(X_train, y_train)
# Make predic ons on the valida on set
y_pred = model.predict(X_val)
# Display metrics
display_metrics(y_val, y_pred)
# Leave-One-Out Cross-Valida on (LOOCV) Approach
def loocv_approach(X, y):
print("Leave-One-Out Cross-Valida on (LOOCV):")
loo = LeaveOneOut()
y_true, y_pred = [], []
# Loop through each sample using LOOCV
for train_index, test_index in loo.split(X):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
# Ini alize and train the model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predic on for the single test sample
y_pred.append(model.predict(X_test)[0])
y_true.append(y_test.iloc[0])
# Display metrics
display_metrics(y_true, y_pred)
# K-Fold Cross-Valida on Approach
def kfold_approach(X, y, k=5):
print(f"{k}-Fold Cross-Valida on Approach:")
25

kf = KFold(n_splits=k, shuffle=True, random_state=42)


y_true, y_pred = [], []
# Loop through each fold
for train_index, test_index in kf.split(X):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
# Ini alize and train the model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predic ons on the test set
y_pred.extend(model.predict(X_test))
y_true.extend(y_test)
# Display metrics
display_metrics(y_true, y_pred)
# Main func on to run all approaches
def main():
print("Cross-Valida on for RMSE, MAE, and R²:\n")
valida on_set_approach(X, y)
print("\n")
loocv_approach(X, y)
print("\n")
kfold_approach(X, y, k=5) # You can change k for different K-Fold Cross-Valida on
# Execute the main func on
if __name__ == "__main__":
main()
26

OUTPUT:

Cross-Valida on for RMSE, MAE, and R²:

Valida on Set Approach:


Root Mean Squared Error (RMSE): 0.7456
Mean Absolute Error (MAE): 0.5332
R² Score: 0.5758

Leave-One-Out Cross-Valida on (LOOCV):


Root Mean Squared Error (RMSE): 0.7268
Mean Absolute Error (MAE): 0.5317
R² Score: 0.6033

5-Fold Cross-Valida on Approach:


Root Mean Squared Error (RMSE): 0.7284
Mean Absolute Error (MAE): 0.5317
R² Score: 0.6015
27

6) Program to display Normal, Binomial Poisson, Bernoulli disr bu ons for a given
frequency distribu on.
Probability distribu ons describe how the values of a random variable are
distributed. They help in understanding the behavior of data and are essen al in
sta s cs and data analysis. The program visualizes four key probability distribu ons
for a given frequency distribu on.
Distribu ons Covered
1. Normal Distribu on:
A con nuous distribu on forming a bell-shaped curve. It is symmetric about the
mean, and most data points cluster around the mean. Useful for modeling natural
phenomena.
2. Binomial Distribu on:
A discrete distribu on represen ng the number of successes in a fixed number of
trials. It depends on two parameters: the number of trials () and the probability of
success (). Common in scenarios like flipping a coin or rolling a die.
3. Poisson Distribu on:
A discrete distribu on that models the number of events in a fixed interval of me or
space. It is characterized by the average rate () of occurrence. Useful for modeling
rare events like system failures or call arrivals.
4. Bernoulli Distribu on:
A discrete distribu on represen ng a single trial with two outcomes: success or
failure. It is defined by the probability of success (). Used in binary events like yes/no
or true/false.
Purpose: Accepts user input for data values and their frequencies.
Visualizes the probability density func on (PDF) or probability mass func on (PMF)
for each distribu on.
Helps users compare how well each distribu on fits the data.
Importance: Understanding Data: Helps iden fy pa erns in data.
Modeling Real-World Scenarios: Simulates phenomena like natural varia ons or rare
events.
28

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm, binom, poisson, bernoulli
def get_user_data():
# Get the frequency distribu on input from the user
data_input = input("Enter the data values separated by commas (e.g., 10, 20, 30): ")
frequencies_input = input("Enter the corresponding frequencies separated by
commas (e.g., 2, 3, 4): ")
# Convert the inputs into lists of integers
data = list(map(int, data_input.split(',')))
frequencies = list(map(int, frequencies_input.split(',')))
return data, frequencies
def plot_normal_distribu on(data, frequencies):
# Fit and plot Normal distribu on
mean = np.mean(data)
std_dev = np.std(data)
x = np.linspace(min(data), max(data), 100)
pdf = norm.pdf(x, mean, std_dev)
plt.plot(x, pdf, 'r-', lw=2, label='Normal Distribu on')
plt. tle('Normal Distribu on')
plt.xlabel('Value')
plt.ylabel('Probability Density')
plt.show()
def plot_binomial_distribu on(data, frequencies):
# Fit and plot Binomial distribu on (assuming n is max(data) and p is
mean/len(data))
29

n = max(data)
p = np.mean(data) / n
x = np.arange(0, n+1)
pmf = binom.pmf(x, n, p)
plt.bar(x, pmf, alpha=0.7, color='b', label='Binomial Distribu on')
plt. tle('Binomial Distribu on')
plt.xlabel('Value')
plt.ylabel('Probability')
plt.show()
def plot_poisson_distribu on(data, frequencies):
# Fit and plot Poisson distribu on (lambda is the mean of the data)
lam = np.mean(data)
x = np.arange(0, max(data)+1)
pmf = poisson.pmf(x, lam)
plt.bar(x, pmf, alpha=0.7, color='g', label='Poisson Distribu on')
plt. tle('Poisson Distribu on')
plt.xlabel('Value')
plt.ylabel('Probability')
plt.show()
def plot_bernoulli_distribu on(data, frequencies):
# Assuming binary outcome for Bernoulli
success_prob = np.mean(data) / max(data)
x = [0, 1]
pmf = bernoulli.pmf(x, success_prob)
plt.bar(x, pmf, alpha=0.7, color='purple', label='Bernoulli Distribu on')
plt. tle('Bernoulli Distribu on')
30

plt.xlabel('Value')
plt.ylabel('Probability')
plt.show()
def analyze_distribu ons(data, frequencies):
print("Analyzing Normal Distribu on:")
plot_normal_distribu on(data, frequencies)

print("Analyzing Binomial Distribu on:")


plot_binomial_distribu on(data, frequencies)

print("Analyzing Poisson Distribu on:")


plot_poisson_distribu on(data, frequencies)

print("Analyzing Bernoulli Distribu on:")


plot_bernoulli_distribu on(data, frequencies)
# Main program
data, frequencies = get_user_data()
analyze_distribu ons(data, frequencies)

OUTPUT:
Enter the data values separated by commas (e.g., 10, 20, 30): 10, 30, 50, 70
Enter the corresponding frequencies separated by commas (e.g., 2, 3, 4): 1, 2, 3, 4
31

Analyzing Normal Distribu on:

Analyzing Binomial Distribu on:


32

Analyzing Poisson Distribu on:

Analyzing Bernoulli Distribu on:


33

7) Program to implement one sample, two sample and paired sample t-tests for
sample data and analyze the results.
T-Tests are commonly used to assess whether there is a sta s cally significant
difference between groups or condi ons. These tests help us make inferences about
popula ons based on sample data. Types of t-tests:
1. One-Sample T-Test:
This test compares the mean of a sample to a known value (o en a popula on mean)
to determine if the sample mean is significantly different from this reference value.
For eg, in the code, we compare the average exam scores of a group of students to a
popula on mean of 85. The null hypothesis assumes there is no difference, and the
alterna ve hypothesis suggests a difference in means.
2. Two-Sample T-Test:
This test is used to compare the means of two independent groups to determine if
they differ significantly. In the code, we compare the scores of two groups (Group A
and Group B). The null hypothesis suggests that there is no difference between the
two groups, while the alterna ve hypothesis indicates a significant difference.
3. Paired-Sample T-Test:
This test compares the means of two related groups, typically measuring the same
subjects before and a er an interven on. In the code, we compare the scores of the
same group of students before and a er a treatment. The null hypothesis assumes no
difference between the two sets of scores, while the alterna ve hypothesis suggests a
significant change.
Results are Interpreted as:
T-Sta s c: This value tells us how much the sample mean differs from the
hypothesized value (or the mean of the second group in case of two-sample or paired
tests) in terms of standard error.
P-Value: This value indicates the probability of observing the data if the null
hypothesis were true. If the p-value is smaller than the chosen significance level
(usually 0.05), we reject the null hypothesis and conclude there is a sta s cally
significant difference.
34

# Import necessary libraries


import numpy as np
import pandas as pd
from scipy import stats
# Sample data for demonstra on
# One-sample test: A group of exam scores
exam_scores = np.array([85, 87, 90, 78, 88, 95, 82, 79, 94, 91])
# Two-sample test: Scores of two different groups
group_A = np.array([85, 89, 88, 90, 93, 85, 84, 79, 90, 87])
group_B = np.array([82, 86, 85, 87, 92, 80, 81, 78, 89, 85])
# Paired-sample test: Before and a er treatment scores of the same group
before_treatment = np.array([82, 84, 88, 78, 80, 85, 90, 79, 87, 83])
a er_treatment = np.array([85, 87, 89, 81, 83, 88, 92, 82, 89, 86])
# Func on to perform one-sample t-test
def one_sample_ est(data, popula on_mean):
t_stat, p_value = stats. est_1samp(data, popula on_mean)
return t_stat, p_value
# Func on to perform two-sample t-test (independent samples)
def two_sample_ est(group1, group2):
t_stat, p_value = stats. est_ind(group1, group2)
return t_stat, p_value
# Func on to perform paired-sample t-test
def paired_sample_ est(before, a er):
t_stat, p_value = stats. est_rel(before, a er)
return t_stat, p_value
# Analyze results of the t-tests
35

def analyze_ est_results(t_stat, p_value, alpha=0.05):


print(f"T-sta s c: {t_stat}")
print(f"P-value: {p_value}")
if p_value < alpha:
print("Result: The null hypothesis is rejected (sta s cally significant difference).")
else:
print("Result: The null hypothesis cannot be rejected (no sta s cally significant
difference).")
# One-sample t-test: Compare exam scores with a popula on mean (e.g., 85)
print("One-Sample T-Test:")
t_stat, p_value = one_sample_ est(exam_scores, 85)
analyze_ est_results(t_stat, p_value)
print()
# Two-sample t-test: Compare the means of two independent groups
print("Two-Sample T-Test:")
t_stat, p_value = two_sample_ est(group_A, group_B)
analyze_ est_results(t_stat, p_value)
print()
# Paired-sample t-test: Compare before and a er treatment of the same group
print("Paired-Sample T-Test:")
t_stat, p_value = paired_sample_ est(before_treatment, a er_treatment)
analyze_ est_results(t_stat, p_value)

OUTPUT:
One-Sample T-Test:
T-sta s c: 1.0189950494649807
36

P-value: 0.3348142605778697
Result: The null hypothesis cannot be rejected (no sta s cally significant difference).
Two-Sample T-Test:
T-sta s c: 1.3547090246981803
P-value: 0.19227122007981406
Result: The null hypothesis cannot be rejected (no sta s cally significant difference).
Paired-Sample T-Test:
T-sta s c: -11.758942438532781
P-value: 9.151111215642479e-07
Result: The null hypothesis is rejected (sta s cally significant difference).

8) Program to implement One-way and Two-way ANOVA tests and analyze the
results.
ANOVA (Analysis of Variance) is a sta s cal method used to test if there are
significant differences between the means of mul ple groups.
1. One-Way ANOVA:
Used when comparing the means of more than two groups based on one factor. It
checks if the group means are significantly different. Null Hypothesis (H₀): All group
means are equal. Alterna ve Hypothesis (H₁): At least one group mean is different.
2. Two-Way ANOVA:
Used when there are two factors, and it tests the individual effects of each factor and
their interac on on the dependent variable. Null Hypothesis(H₀): Neither factor nor
their interac on significantly affects response. Alterna ve Hypothesis (H₁): At least
one factor or their interac on significantly affects the response.
Key Results:
F-sta s c: Indicates how much the group means differ.
P-value: If less than 0.05, we reject the null hypothesis, sugges ng a significant
difference.
37

import numpy as np
import pandas as pd
from scipy.stats import f_oneway
import statsmodels.api as sm
from statsmodels.formula.api import ols
# Func on for One-way ANOVA
def one_way_anova(data, groups, response):
"""
Perform one-way ANOVA.
:param data: DataFrame containing the dataset
:param groups: Column name for grouping variable
:param response: Column name for response variable
"""
grouped_data = [group[response].values for _, group in data.groupby(groups)]
f_stat, p_value = f_oneway(*grouped_data)
print("\nOne-way ANOVA Results:")
print(f"F-sta s c: {f_stat:.4f}, p-value: {p_value:.4f}")
if p_value < 0.05:
print("Reject the null hypothesis: Significant difference among group means.")
else:
print("Fail to reject the null hypothesis: No significant difference among group
means.")
# Func on for Two-way ANOVA
def two_way_anova(data, response, factor1, factor2):
"""
Perform two-way ANOVA.
38

:param data: DataFrame containing the dataset


:param response: Column name for response variable
:param factor1: Column name for first factor
:param factor2: Column name for second factor
"""
formula = f"{response} ~ C({factor1}) + C({factor2}) + C({factor1}):C({factor2})"
model = ols(formula, data).fit()
anova_table = sm.stats.anova_lm(model, typ=2) # Type II ANOVA
print("\nTwo-way ANOVA Results:")
print(anova_table)
# Example usage
if __name__ == "__main__":
# Example dataset for One-way ANOVA
data_one_way = pd.DataFrame({
"Group": np.repeat(['A', 'B', 'C'], 10),
"Score": np.concatenate([
np.random.normal(loc=50, scale=5, size=10),
np.random.normal(loc=55, scale=5, size=10),
np.random.normal(loc=60, scale=5, size=10)
])
})
# Perform One-way ANOVA
one_way_anova(data_one_way, groups="Group", response="Score")
# Example dataset for Two-way ANOVA
data_two_way = pd.DataFrame({
"Factor1": np.repeat(['Low', 'Medium', 'High'], 6),
39

"Factor2": np. le(['Type1', 'Type2'], 9),


"Response": np.concatenate([
np.random.normal(loc=50, scale=5, size=6),
np.random.normal(loc=55, scale=5, size=6),
np.random.normal(loc=60, scale=5, size=6)
])
})
# Perform Two-way ANOVA
two_way_anova(data_two_way, response="Response", factor1="Factor1",
factor2="Factor2")

OUTPUT:
One-way ANOVA Results:
F-sta s c: 10.6055, p-value: 0.0004
Reject the null hypothesis: Significant difference among group means.
Two-way ANOVA Results:
sum_sq df F PR(>F)
C(Factor1) 152.062998 2.0 2.245097 0.148502
C(Factor2) 38.519894 1.0 1.137435 0.307183
C(Factor1):C(Factor2) 8.827462 2.0 0.130331 0.879031
Residual 406.386901 12.0 NaN NaN
40

9) Program to implement correla on, rank correla on and regression and plot x-y
plot and heat maps of correla on matrices.
Correla on:
Pearson Correla on: Measures the linear rela onship between two variables (X and
Y). A value close to 1 means a strong posi ve rela onship, -1 means a strong nega ve
rela onship, and 0 means no linear rela onship.
The program calculates this correla on using the corr func on in Pandas.
Rank Correla on (Spearman's Rank Correla on):
This measures the strength of a monotonic (ordered) rela onship between two
variables, using their ranks rather than actual values.
It can detect non-linear rela onships, and values close to 1 or -1 indicate strong
posi ve or nega ve rela onships.
Linear Regression:
Linear regression fits a straight line to the data, modeling the rela onship between a
dependent variable (Y) and an independent variable (X).
The program uses scikit-learn to fit a regression line and calculates the Mean Squared
Error (MSE) to evaluate the fit.
Visualiza ons:
X-Y Sca er Plot:
Displays the data points, with a red regression line showing the fi ed model.
Heatmap:
Visualizes the correla on matrix, showing the strength of rela onships between
variables.
This program helps to understand rela onships between variables using correla on,
regression, and visual tools.
41

# Import required libraries


import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import spearmanr
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Generate sample data (or load your dataset here)
np.random.seed(42) # For reproducibility
x = np.random.rand(100) * 100 # Random values for x
y = 2.5 * x + np.random.normal(0, 25, 100) # Linear rela on with noise
# Convert data into a DataFrame
data = pd.DataFrame({'X': x, 'Y': y})
# Compute Correla on
pearson_corr = data.corr(method='pearson') # Pearson Correla on
spearman_corr, _ = spearmanr(data['X'], data['Y']) # Spearman Rank Correla on
# Linear Regression
X = data['X'].values.reshape(-1, 1) # Reshape for sklearn
Y = data['Y'].values
model = LinearRegression()
model.fit(X, Y)
Y_pred = model.predict(X)
regression_coeff = model.coef_[0] # Slope
regression_intercept = model.intercept_ # Intercept
mse = mean_squared_error(Y, Y_pred)
42

# Print sta s cal results


print("Pearson Correla on Coefficient Matrix:")
print(pearson_corr)
print("\nSpearman Rank Correla on Coefficient:", spearman_corr)
print("\nLinear Regression Equa on: Y = {:.2f}X + {:.2f}".format(regression_coeff,
regression_intercept))
print("Mean Squared Error (MSE):", mse)
# Plot X-Y sca er plot with regression line
plt.figure(figsize=(8, 6))
plt.sca er(data['X'], data['Y'], color='blue', label='Data Points')
plt.plot(data['X'], Y_pred, color='red', label='Regression Line')
plt. tle('X-Y Sca er Plot with Regression Line')
plt.xlabel('X')
plt.ylabel('Y')
plt.legend()
plt.show()
# Plot heatmap of correla on matrix
plt.figure(figsize=(6, 5))
sns.heatmap(pearson_corr, annot=True, cmap='coolwarm', fmt='.2f')
plt. tle('Heatmap of Correla on Matrix')
plt.show()

OUTPUT:
Pearson Correla on Coefficient Matrix:
X Y
X 1.000000 0.952966
43

Y 0.952966 1.000000
Spearman Rank Correla on Coefficient: 0.9519351935193517
Linear Regression Equa on: Y = 2.39X + 5.38
Mean Squared Error (MSE): 504.11535247940856
44

10) Program to implement PCA for Wisconsin dataset, visualize and analyze the
results.
This program demonstrates Principal Component Analysis (PCA) on the Wisconsin
Breast Cancer dataset to reduce the dimensionality of the data, visualize the results,
and analyze the explained variance of the components.
Principal Component Analysis (PCA):
PCA is a technique used to reduce the dimensionality of large datasets while
preserving as much informa on as possible. It transforms the original features into
new, uncorrelated variables called principal components.
The goal is to project the data into fewer dimensions, typically 2 or 3, for easier
visualiza on while retaining most of the data's variance.
Standardiza on:
Before applying PCA, the data is standardized using StandardScaler to ensure that
each feature has zero mean and unit variance. This is important because PCA is
sensi ve to the scale of the data.
Applying PCA:
PCA is performed to reduce the data to 2 principal components for visualiza on. The
program then calculates the explained variance ra o, which tells us how much
variance (informa on) each principal component captures.
Visualiza on:
PCA Sca er Plot: The program creates a sca er plot of the first two principal
components (PCA1 and PCA2) to visualize how the data points are distributed in the
reduced space. Points are colored based on the target variable (malignant or benign).
Explained Variance: A bar plot shows how much variance each of the first two
principal components explains.
Cumula ve Variance: A line plot shows how much cumula ve variance is explained as
more components are added.
45

# Import necessary libraries


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.decomposi on import PCA
from sklearn.preprocessing import StandardScaler
# Load the Wisconsin Breast Cancer dataset
data = load_breast_cancer()
X = data.data # Features
y = data.target # Target variable (0 = malignant, 1 = benign)
feature_names = data.feature_names
target_names = data.target_names
# Standardize the data (important for PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply PCA
pca = PCA(n_components=2) # Reduce to 2 dimensions for visualiza on
X_pca = pca.fit_transform(X_scaled)
# Get explained variance ra o for each component
explained_variance_ra o = pca.explained_variance_ra o_
# Create a DataFrame for visualiza on
pca_df = pd.DataFrame(X_pca, columns=['PCA1', 'PCA2'])
pca_df['Target'] = y
# Plot the PCA results
46

plt.figure(figsize=(8, 6))
sns.sca erplot(data=pca_df, x='PCA1', y='PCA2', hue='Target', pale e='Set1',
alpha=0.8)
plt. tle('PCA of Wisconsin Breast Cancer Dataset')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(target_names)
plt.grid()
plt.show()
# Plot explained variance ra o
plt.figure(figsize=(8, 5))
plt.bar(range(1, 3), explained_variance_ra o, ck_label=['PCA1', 'PCA2'],
color='skyblue')
plt. tle('Explained Variance Ra o of PCA Components')
plt.xlabel('Principal Components')
plt.ylabel('Variance Explained')
plt.show()
# Full PCA with all components for analysis
pca_full = PCA()
X_pca_full = pca_full.fit_transform(X_scaled)
cumula ve_variance = np.cumsum(pca_full.explained_variance_ra o_)
# Plot cumula ve explained variance
plt.figure(figsize=(8, 5))
plt.plot(range(1, len(cumula ve_variance) + 1), cumula ve_variance, marker='o',
linestyle='--', color='b')
plt. tle('Cumula ve Explained Variance')
plt.xlabel('Number of Principal Components')
47

plt.ylabel('Cumula ve Variance Explained')


plt.grid()
plt.show()
# Print key insights
print("PCA Analysis of Wisconsin Breast Cancer Dataset")
print("-------------------------------------------------")
print(f"Explained Variance (PCA1): {explained_variance_ra o[0]:.4f}")
print(f"Explained Variance (PCA2): {explained_variance_ra o[1]:.4f}")
print("Cumula ve Variance Explained by All Components:")
for i, cum_var in enumerate(cumula ve_variance, start=1):
print(f" Component {i}: {cum_var:.4f}")

OUTPUT:
48

PCA Analysis of Wisconsin Breast Cancer Dataset


-------------------------------------------------
Explained Variance (PCA1): 0.4427
49

Explained Variance (PCA2): 0.1897


Cumula ve Variance Explained by All Components:
Component 1: 0.4427
Component 2: 0.6324
Component 3: 0.7264
Component 4: 0.7924
Component 5: 0.8473
Component 6: 0.8876
Component 7: 0.9101
Component 8: 0.9260
Component 9: 0.9399
Component 10: 0.9516
Component 11: 0.9614
Component 12: 0.9701
Component 13: 0.9781
Component 14: 0.9834
Component 15: 0.9865
Component 16: 0.9892
Component 17: 0.9911
Component 18: 0.9929
Component 19: 0.9945
Component 20: 0.9956
Component 21: 0.9966
Component 22: 0.9975
Component 23: 0.9983
Component 24: 0.9989
50

Component 25: 0.9994


Component 26: 0.9997
Component 27: 0.9999
Component 28: 1.0000
Component 29: 1.0000
Component 30: 1.0000

11) Program to implement the working of linear discriminant analysis using iris
dataset and visualize the results.
Linear Discriminant Analysis (LDA) is a technique used for dimensionality reduc on
and classifica on. It aims to find the linear combina ons of features that best
separate the classes in the dataset. Unlike Principal Component Analysis (PCA), which
maximizes variance, LDA focuses on maximizing class separability.
Key Steps in LDA:
Data Standardiza on: Before applying LDA, the data is scaled so that each feature has
zero mean and unit variance. This ensures that all features contribute equally to the
analysis.
Compute Discriminants: LDA computes new axes (called discriminants) that maximize
the difference between classes.
Dimensionality Reduc on: LDA reduces the dataset to fewer dimensions while
preserving as much class separa on as possible. In this case, we reduce it to 2
dimensions for easier visualiza on.
Visualiza on: The transformed data is plo ed in a 2D space, showing how well the
classes (species in the Iris dataset) are separated.
Applica on in the Iris Dataset:
The Iris dataset has 4 features, and LDA reduces it to 2 components for visualiza on.
LDA is useful in classifica on tasks, where the goal is to predict the class label of new
data points based on the transformed features.
51

# Import necessary libraries


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.preprocessing import StandardScaler
# Load the Iris dataset
data = load_iris()
X = data.data # Features
y = data.target # Target variable (0, 1, 2)
target_names = data.target_names # Class names
# Standardize the data (LDA benefits from scaling)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply Linear Discriminant Analysis (LDA)
lda = LinearDiscriminantAnalysis(n_components=2) # Reduce to 2 components for
visualiza on
X_lda = lda.fit_transform(X_scaled, y)
# Create a DataFrame for LDA-transformed data
lda_df = pd.DataFrame(X_lda, columns=['LDA1', 'LDA2'])
lda_df['Target'] = y
# Plot the LDA results in 2D space
plt.figure(figsize=(8, 6))
52

sns.sca erplot(data=lda_df, x='LDA1', y='LDA2', hue='Target', pale e='Set1',


style='Target', s=100)
plt. tle('LDA of Iris Dataset')
plt.xlabel('Linear Discriminant 1')
plt.ylabel('Linear Discriminant 2')
plt.legend( tle='Class', labels=target_names)
plt.grid()
plt.show()
# Print key insights
print("Linear Discriminant Analysis (LDA) Results")
print("--------------------------------------------------")
print("Explained Variance Ra o by LDA Components:")
for i, ra o in enumerate(lda.explained_variance_ra o_, start=1):
print(f" LDA{i}: {ra o:.4f}")

OUTPUT:
53

12) Program to Implement mul ple linear regression using iris dataset, visualize and
analyze the results.
Mul ple Linear Regression (MLR) is a technique used to predict a target variable
based on the rela onship between mul ple input variables. It helps in understanding
how different features affect the outcome.
Key Concepts:
Predic on: MLR creates a model that predicts a target variable using mul ple
independent variables.
Training: The model learns from the training data by adjus ng coefficients for each
feature.
Evalua on: The model’s accuracy is measured using metrics like Mean Squared Error
(MSE) and R-squared (R²).
Applica on to the Iris Dataset:
In this case, we predict the petal length based on other features like sepal length and
petal width.
The dataset is split into a training set and a test set. The model is trained on the
training set and evaluated on the test set.
Steps:
1. Training: Fit the model using the training data.
2. Predic on: Make predic ons on the test data.
3. Evalua on: Use MSE and R² to assess model performance.
4. Visualiza on: Compare the actual vs predicted values using a plot.

MLR is commonly used when there are mul ple factors influencing the outcome and
helps in making predic ons based on them.
54

# Import necessary libraries


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.model_selec on import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Load the Iris dataset
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names) # Features
y = X['petal length (cm)'] # Let's predict 'petal length' as the dependent variable
X = X.drop(columns=['petal length (cm)']) # Remove 'petal length' from independent
variables
# Split the dataset into training and tes ng sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Apply Mul ple Linear Regression
model = LinearRegression()
model.fit(X_train, y_train) # Train the model
# Predict on the test set
y_pred = model.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
# Print model performance metrics
55

print("Mul ple Linear Regression Results")


print("----------------------------------")
print(f"Mean Squared Error (MSE): {mse:.4f}")
print(f"R-squared (R²): {r2:.4f}")
print("\nModel Coefficients:")
for feature, coef in zip(X.columns, model.coef_):
print(f" {feature}: {coef:.4f}")
print(f"Intercept: {model.intercept_:.4f}")
# Visualize actual vs predicted values
plt.figure(figsize=(8, 6))
plt.sca er(y_test, y_pred, color='blue', alpha=0.7)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], color='red',
linewidth=2, linestyle='--')
plt. tle('Actual vs Predicted Values (Test Set)')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.grid()
plt.show()
# Pairplot to explore rela onships in the dataset
sns.pairplot(pd.DataFrame(data.data, columns=data.feature_names),
diag_kind='kde')
plt.sup tle('Pairplot of Iris Dataset Features', y=1.02)
plt.show()

OUTPUT:
Mul ple Linear Regression Results
56

----------------------------------
Mean Squared Error (MSE): 0.1300
R-squared (R²): 0.9603

Model Coefficients:
sepal length (cm): 0.7228
sepal width (cm): -0.6358
petal width (cm): 1.4675
Intercept: -0.2622
57

You might also like