CS REPORT
CS REPORT
BELAGAVI, KARNATAKA-590018
Laboratory Report on
BACHELOR OF TECHNOLOGY
In
Submi ed By
USN: 2VX22CB023
VTU Belagavi-590018.
ACADEMIC YEAR 2024-2025
Visvesvaraya Technological University, Belagavi
CERTIFICATE
USN 2VX22CB023 studying in V semester B.Tech (“Computer Science and Business Systems”) has
presented and successfully completed the Laboratory Report tled “COMPUTATIONAL STATISTICS LAB”
in the presence of the undersigned examiners for the par al fulfillment of the award of B.Tech. degree
VTU, Belagavi, for the academic year 2024-25.
________________________ _____________________________
Examiner- 1 Examiner- 2
Name: Name:
Data wrangling is a cri cal process in data analysis, where data is transformed into a
structured, usable format. This program demonstrates key opera ons in data
wrangling: combining and merging datasets, reshaping, pivo ng, handling missing
data, and genera ng summary sta s cs.
1. Combining and Merging Datasets
Combining and merging datasets are essen al for integra ng mul ple data sources.
Merging: Combines datasets based on a common key using an inner join, retaining
only matching rows. This ensures focused analysis on shared data points.
Concatena on: Stacks datasets ver cally, appending new records to create a unified
dataset.
2. Reshaping Data with Melt
Reshaping is used to change the layout of a dataset to suit specific analy cal needs.
The melt opera on converts wide-format data into long format, turning columns into
rows. This format is ideal for grouping, filtering, and visualizing data across variables.
3. Pivo ng Data
Pivo ng reverses the mel ng process, conver ng long-format data back into wide
format. It summarizes data for easier interpreta on by using one column as the index
and another as columns. This transforma on is par cularly useful for summarizing
data in a matrix-like format, which is easier to interpret for certain sta s cal analyses.
4. Handling Missing Data
Missing values are replaced with column means to ensure completeness. This
maintains data integrity for further analysis.
5. Summary Sta s cs
The program concludes by calcula ng summary sta s cs (e.g., mean, standard
devia on, min, max) for the filled dataset. Summary sta s cs provide insights into
the dataset's central tendency, dispersion, and overall distribu on.
2
sales_data_2 = pd.DataFrame({
'OrderID': [3, 4, 5, 6],
'Product': ['Smartphone', 'Headphones', 'Smartwatch', 'Tablet'],
'Sales': [1500, 300, 200, 900]
})
# Display the DataFrames
print("Sales Data 1:\n", sales_data_1)
print("\nSales Data 2:\n", sales_data_2)
# Merge DataFrames based on 'OrderID' using an inner join
merged_data = pd.merge(sales_data_1, sales_data_2, on='OrderID', how='inner',
suffixes=('_le ', '_right'))
print("\nMerged Data (Inner Join):\n", merged_data)
# Concatenate the DataFrames ver cally
combined_data = pd.concat([sales_data_1, sales_data_2], ignore_index=True)
print("\nCombined Data (Concatenated Ver cally):\n", combined_data)
3
OUTPUT:
Sales Data 1:
OrderID Product Sales
0 1 Laptop 1200
1 2 Tablet 800
2 3 Smartphone 1500
3 4 Headphones 300
Sales Data 2:
OrderID Product Sales
0 3 Smartphone 1500
1 4 Headphones 300
2 5 Smartwatch 200
3 6 Tablet 900
4 Feb Product_B 80
5 Mar Product_B 120
OUTPUT:
Original Text: ' Hello, World! Welcome to Python programming. '
Text a er stripping spaces: 'Hello, World! Welcome to Python programming.'
Text in uppercase: 'HELLO, WORLD! WELCOME TO PYTHON PROGRAMMING.'
Text in lowercase: 'hello, world! welcome to python programming.'
Number of occurrences of 'o': 6
Text a er replacing 'Python' with 'Data Science': 'Hello, World! Welcome to Data
Science programming.'
Posi on of 'World' in the text: 7
List of words in the text: ['Hello,', 'World!', 'Welcome', 'to', 'Python', 'programming.']
Text a er joining words: 'Hello, World! Welcome to Python programming.'
Does the text start with 'Hello'? True
Does the text end with 'programming.'? True
Regular Expressions:
import re
# Sample text
11
text = """
John's email is [email protected]. He said, "Python is awesome!!" It's a great
language.
Another email: [email protected].
"""
# 1. Remove special characters except for spaces and email-related characters.
# Using regex to remove non-alphabe c characters and non-email symbols
clean_text = re.sub(r"[^a-zA-Z0-9@\.\s]", "", text)
print("Text a er removing special characters:")
print(clean_text)
# 2. Convert the text to lowercase
clean_text = clean_text.lower()
print("\nText a er conver ng to lowercase:")
print(clean_text)
# 3. Replace mul ple spaces with a single space
clean_text = re.sub(r"\s+", " ", clean_text)
print("\nText a er replacing mul ple spaces:")
print(clean_text)
# 4. Extract all words star ng with a vowel (a, e, i, o, u)
vowel_words = re.findall(r"\b[aeiouAEIOU]\w+", clean_text)
print("\nWords star ng with a vowel:")
print(vowel_words)
# 5. Replace email addresses with '[email protected]'
masked_text = re.sub(r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", "[email
protected]", clean_text)
print("\nText a er replacing emails:")
12
print(masked_text)
OUTPUT:
Text a er removing special characters:
Johns email is email protected. He said Python is awesome Its a great language.
Another email email protected.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.holtwinters import Exponen alSmoothing
# Create sample me series data
np.random.seed(42)
date_range = pd.date_range(start="2022-01-01", end="2023-01-01", freq="D")
data = pd.DataFrame({
"Date": date_range,
"Value_A": np.random.normal(100, 10, len(date_range)),
"Value_B": np.random.normal(200, 20, len(date_range)),
})
# Set Date as the index
data.set_index("Date", inplace=True)
# GroupBy Mechanics
def groupby_mechanics(data):
print("\n--- GroupBy Mechanics ---")
# Group data by month and calculate mean
grouped = data.resample('ME').mean()
print(grouped)
return grouped
# Data Formats: Vector and Mul variate
def data_formats(data):
print("\n--- Data Formats ---")
# Display data as vector
15
print("\nVector Format:")
print(data["Value_A"].head())
# Display mul variate me series
print("\nMul variate Time Series:")
print(data.head())
# Forecas ng Example
def me_series_forecas ng(data):
print("\n--- Forecas ng ---")
# Select a single column for forecas ng
ts = data["Value_A"]
# Train-Test Split
train = ts[:int(0.8 * len(ts))]
test = ts[int(0.8 * len(ts)):]
# Fit the Holt-Winters Exponen al Smoothing model
model = Exponen alSmoothing(train, seasonal="add", seasonal_periods=30).fit()
# Forecast for the test period
forecast = model.forecast(len(test))
# Plot results
plt.figure(figsize=(12, 6))
plt.plot(train, label="Train")
plt.plot(test, label="Test")
plt.plot(forecast, label="Forecast")
plt.legend()
plt. tle("Time Series Forecas ng")
plt.show()
# Main func on
16
if __name__ == "__main__":
print("--- Time Series Data ---")
print(data.head())
# Grouping Mechanics
monthly_data = groupby_mechanics(data)
# Data Formats
data_formats(data)
# Time Series Forecas ng
me_series_forecas ng(data)
OUTPUT:
--- Time Series Data ---
Value_A Value_B
Date
2022-01-01 104.967142 204.481850
2022-01-02 98.617357 200.251848
2022-01-03 106.476885 201.953522
2022-01-04 115.230299 184.539804
2022-01-05 97.658466 200.490203
Vector Format:
Date
2022-01-01 104.967142
2022-01-02 98.617357
2022-01-03 106.476885
2022-01-04 115.230299
2022-01-05 97.658466
Name: Value_A, dtype: float64
return {
'Mean': mean,
'Median': median,
'Mode': mode,
'Variance': variance,
'Standard Devia on': std_devia on,
'Mean Devia on': mean_devia on,
'Quar le Devia on': quar le_devia on
}
# Get user input for data and frequencies
data_input = input("Enter the data values separated by commas (e.g., 10, 20, 30): ")
frequencies_input = input("Enter the corresponding frequencies separated by
commas (e.g., 1, 2, 3): ")
# Convert input strings to lists of integers
data = list(map(int, data_input.split(',')))
frequencies = list(map(int, frequencies_input.split(',')))
# Calculate sta s cs
sta s cs = calculate_sta s cs(data, frequencies)
# Display the results
for stat, value in sta s cs.items():
print(f"{stat}: {value:.2f}")
OUTPUT:
Enter the data values separated by commas (e.g., 10, 20, 30): 20, 40, 60
Enter the corresponding frequencies separated by commas (e.g., 1, 2, 3): 7, 8, 9
Mean: 41.67
22
Median: 40.00
Mode: 60.00
Variance: 263.89
Standard Devia on: 16.24
Mean Devia on: 13.75
Quar le Devia on: 10.00
5) Program to perform cross valida on for a given dataset to measure Root Mean
Squared Error (RMSE),Mean Absolute Error (MAE) and R2 Error using Valida on Set,
Leave One Out Cross-Valida on(LOOCV) and K-fold Cross-Valida on approaches.
Cross-valida on is a method to evaluate a model's performance by tes ng it on
different subsets of data. It ensures that the model generalizes well to unseen data.
The program calculates three key metrics for model evalua on:
1. Root Mean Squared Error (RMSE): Measures the average predic on error,
emphasizing larger errors.
2. Mean Absolute Error (MAE): Measures the average predic on error without
emphasizing outliers.
3. R² Score: Indicates how well the model explains the variability in the data.
Cross-Valida on Techniques
1. Valida on Set Approach: Splits the data into training (80%) and valida on (20%).
Tests the model on the valida on set a er training.
2. Leave-One-Out Cross-Valida on (LOOCV): Uses one sample as the test set and the
rest for training. Repeats this process for all samples.
3. K-Fold Cross-Valida on: Divides the data into k equal parts (folds). Trains on folds
and tests on the remaining fold, repeated mes.
Purpose: The program evaluates a linear regression model using these techniques
and calculates RMSE, MAE, and R² to compare performance. It ensures reliable and
unbiased model evalua on.
23
model = LinearRegression()
model.fit(X_train, y_train)
# Make predic ons on the valida on set
y_pred = model.predict(X_val)
# Display metrics
display_metrics(y_val, y_pred)
# Leave-One-Out Cross-Valida on (LOOCV) Approach
def loocv_approach(X, y):
print("Leave-One-Out Cross-Valida on (LOOCV):")
loo = LeaveOneOut()
y_true, y_pred = [], []
# Loop through each sample using LOOCV
for train_index, test_index in loo.split(X):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
# Ini alize and train the model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predic on for the single test sample
y_pred.append(model.predict(X_test)[0])
y_true.append(y_test.iloc[0])
# Display metrics
display_metrics(y_true, y_pred)
# K-Fold Cross-Valida on Approach
def kfold_approach(X, y, k=5):
print(f"{k}-Fold Cross-Valida on Approach:")
25
OUTPUT:
6) Program to display Normal, Binomial Poisson, Bernoulli disr bu ons for a given
frequency distribu on.
Probability distribu ons describe how the values of a random variable are
distributed. They help in understanding the behavior of data and are essen al in
sta s cs and data analysis. The program visualizes four key probability distribu ons
for a given frequency distribu on.
Distribu ons Covered
1. Normal Distribu on:
A con nuous distribu on forming a bell-shaped curve. It is symmetric about the
mean, and most data points cluster around the mean. Useful for modeling natural
phenomena.
2. Binomial Distribu on:
A discrete distribu on represen ng the number of successes in a fixed number of
trials. It depends on two parameters: the number of trials () and the probability of
success (). Common in scenarios like flipping a coin or rolling a die.
3. Poisson Distribu on:
A discrete distribu on that models the number of events in a fixed interval of me or
space. It is characterized by the average rate () of occurrence. Useful for modeling
rare events like system failures or call arrivals.
4. Bernoulli Distribu on:
A discrete distribu on represen ng a single trial with two outcomes: success or
failure. It is defined by the probability of success (). Used in binary events like yes/no
or true/false.
Purpose: Accepts user input for data values and their frequencies.
Visualizes the probability density func on (PDF) or probability mass func on (PMF)
for each distribu on.
Helps users compare how well each distribu on fits the data.
Importance: Understanding Data: Helps iden fy pa erns in data.
Modeling Real-World Scenarios: Simulates phenomena like natural varia ons or rare
events.
28
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm, binom, poisson, bernoulli
def get_user_data():
# Get the frequency distribu on input from the user
data_input = input("Enter the data values separated by commas (e.g., 10, 20, 30): ")
frequencies_input = input("Enter the corresponding frequencies separated by
commas (e.g., 2, 3, 4): ")
# Convert the inputs into lists of integers
data = list(map(int, data_input.split(',')))
frequencies = list(map(int, frequencies_input.split(',')))
return data, frequencies
def plot_normal_distribu on(data, frequencies):
# Fit and plot Normal distribu on
mean = np.mean(data)
std_dev = np.std(data)
x = np.linspace(min(data), max(data), 100)
pdf = norm.pdf(x, mean, std_dev)
plt.plot(x, pdf, 'r-', lw=2, label='Normal Distribu on')
plt. tle('Normal Distribu on')
plt.xlabel('Value')
plt.ylabel('Probability Density')
plt.show()
def plot_binomial_distribu on(data, frequencies):
# Fit and plot Binomial distribu on (assuming n is max(data) and p is
mean/len(data))
29
n = max(data)
p = np.mean(data) / n
x = np.arange(0, n+1)
pmf = binom.pmf(x, n, p)
plt.bar(x, pmf, alpha=0.7, color='b', label='Binomial Distribu on')
plt. tle('Binomial Distribu on')
plt.xlabel('Value')
plt.ylabel('Probability')
plt.show()
def plot_poisson_distribu on(data, frequencies):
# Fit and plot Poisson distribu on (lambda is the mean of the data)
lam = np.mean(data)
x = np.arange(0, max(data)+1)
pmf = poisson.pmf(x, lam)
plt.bar(x, pmf, alpha=0.7, color='g', label='Poisson Distribu on')
plt. tle('Poisson Distribu on')
plt.xlabel('Value')
plt.ylabel('Probability')
plt.show()
def plot_bernoulli_distribu on(data, frequencies):
# Assuming binary outcome for Bernoulli
success_prob = np.mean(data) / max(data)
x = [0, 1]
pmf = bernoulli.pmf(x, success_prob)
plt.bar(x, pmf, alpha=0.7, color='purple', label='Bernoulli Distribu on')
plt. tle('Bernoulli Distribu on')
30
plt.xlabel('Value')
plt.ylabel('Probability')
plt.show()
def analyze_distribu ons(data, frequencies):
print("Analyzing Normal Distribu on:")
plot_normal_distribu on(data, frequencies)
OUTPUT:
Enter the data values separated by commas (e.g., 10, 20, 30): 10, 30, 50, 70
Enter the corresponding frequencies separated by commas (e.g., 2, 3, 4): 1, 2, 3, 4
31
7) Program to implement one sample, two sample and paired sample t-tests for
sample data and analyze the results.
T-Tests are commonly used to assess whether there is a sta s cally significant
difference between groups or condi ons. These tests help us make inferences about
popula ons based on sample data. Types of t-tests:
1. One-Sample T-Test:
This test compares the mean of a sample to a known value (o en a popula on mean)
to determine if the sample mean is significantly different from this reference value.
For eg, in the code, we compare the average exam scores of a group of students to a
popula on mean of 85. The null hypothesis assumes there is no difference, and the
alterna ve hypothesis suggests a difference in means.
2. Two-Sample T-Test:
This test is used to compare the means of two independent groups to determine if
they differ significantly. In the code, we compare the scores of two groups (Group A
and Group B). The null hypothesis suggests that there is no difference between the
two groups, while the alterna ve hypothesis indicates a significant difference.
3. Paired-Sample T-Test:
This test compares the means of two related groups, typically measuring the same
subjects before and a er an interven on. In the code, we compare the scores of the
same group of students before and a er a treatment. The null hypothesis assumes no
difference between the two sets of scores, while the alterna ve hypothesis suggests a
significant change.
Results are Interpreted as:
T-Sta s c: This value tells us how much the sample mean differs from the
hypothesized value (or the mean of the second group in case of two-sample or paired
tests) in terms of standard error.
P-Value: This value indicates the probability of observing the data if the null
hypothesis were true. If the p-value is smaller than the chosen significance level
(usually 0.05), we reject the null hypothesis and conclude there is a sta s cally
significant difference.
34
OUTPUT:
One-Sample T-Test:
T-sta s c: 1.0189950494649807
36
P-value: 0.3348142605778697
Result: The null hypothesis cannot be rejected (no sta s cally significant difference).
Two-Sample T-Test:
T-sta s c: 1.3547090246981803
P-value: 0.19227122007981406
Result: The null hypothesis cannot be rejected (no sta s cally significant difference).
Paired-Sample T-Test:
T-sta s c: -11.758942438532781
P-value: 9.151111215642479e-07
Result: The null hypothesis is rejected (sta s cally significant difference).
8) Program to implement One-way and Two-way ANOVA tests and analyze the
results.
ANOVA (Analysis of Variance) is a sta s cal method used to test if there are
significant differences between the means of mul ple groups.
1. One-Way ANOVA:
Used when comparing the means of more than two groups based on one factor. It
checks if the group means are significantly different. Null Hypothesis (H₀): All group
means are equal. Alterna ve Hypothesis (H₁): At least one group mean is different.
2. Two-Way ANOVA:
Used when there are two factors, and it tests the individual effects of each factor and
their interac on on the dependent variable. Null Hypothesis(H₀): Neither factor nor
their interac on significantly affects response. Alterna ve Hypothesis (H₁): At least
one factor or their interac on significantly affects the response.
Key Results:
F-sta s c: Indicates how much the group means differ.
P-value: If less than 0.05, we reject the null hypothesis, sugges ng a significant
difference.
37
import numpy as np
import pandas as pd
from scipy.stats import f_oneway
import statsmodels.api as sm
from statsmodels.formula.api import ols
# Func on for One-way ANOVA
def one_way_anova(data, groups, response):
"""
Perform one-way ANOVA.
:param data: DataFrame containing the dataset
:param groups: Column name for grouping variable
:param response: Column name for response variable
"""
grouped_data = [group[response].values for _, group in data.groupby(groups)]
f_stat, p_value = f_oneway(*grouped_data)
print("\nOne-way ANOVA Results:")
print(f"F-sta s c: {f_stat:.4f}, p-value: {p_value:.4f}")
if p_value < 0.05:
print("Reject the null hypothesis: Significant difference among group means.")
else:
print("Fail to reject the null hypothesis: No significant difference among group
means.")
# Func on for Two-way ANOVA
def two_way_anova(data, response, factor1, factor2):
"""
Perform two-way ANOVA.
38
OUTPUT:
One-way ANOVA Results:
F-sta s c: 10.6055, p-value: 0.0004
Reject the null hypothesis: Significant difference among group means.
Two-way ANOVA Results:
sum_sq df F PR(>F)
C(Factor1) 152.062998 2.0 2.245097 0.148502
C(Factor2) 38.519894 1.0 1.137435 0.307183
C(Factor1):C(Factor2) 8.827462 2.0 0.130331 0.879031
Residual 406.386901 12.0 NaN NaN
40
9) Program to implement correla on, rank correla on and regression and plot x-y
plot and heat maps of correla on matrices.
Correla on:
Pearson Correla on: Measures the linear rela onship between two variables (X and
Y). A value close to 1 means a strong posi ve rela onship, -1 means a strong nega ve
rela onship, and 0 means no linear rela onship.
The program calculates this correla on using the corr func on in Pandas.
Rank Correla on (Spearman's Rank Correla on):
This measures the strength of a monotonic (ordered) rela onship between two
variables, using their ranks rather than actual values.
It can detect non-linear rela onships, and values close to 1 or -1 indicate strong
posi ve or nega ve rela onships.
Linear Regression:
Linear regression fits a straight line to the data, modeling the rela onship between a
dependent variable (Y) and an independent variable (X).
The program uses scikit-learn to fit a regression line and calculates the Mean Squared
Error (MSE) to evaluate the fit.
Visualiza ons:
X-Y Sca er Plot:
Displays the data points, with a red regression line showing the fi ed model.
Heatmap:
Visualizes the correla on matrix, showing the strength of rela onships between
variables.
This program helps to understand rela onships between variables using correla on,
regression, and visual tools.
41
OUTPUT:
Pearson Correla on Coefficient Matrix:
X Y
X 1.000000 0.952966
43
Y 0.952966 1.000000
Spearman Rank Correla on Coefficient: 0.9519351935193517
Linear Regression Equa on: Y = 2.39X + 5.38
Mean Squared Error (MSE): 504.11535247940856
44
10) Program to implement PCA for Wisconsin dataset, visualize and analyze the
results.
This program demonstrates Principal Component Analysis (PCA) on the Wisconsin
Breast Cancer dataset to reduce the dimensionality of the data, visualize the results,
and analyze the explained variance of the components.
Principal Component Analysis (PCA):
PCA is a technique used to reduce the dimensionality of large datasets while
preserving as much informa on as possible. It transforms the original features into
new, uncorrelated variables called principal components.
The goal is to project the data into fewer dimensions, typically 2 or 3, for easier
visualiza on while retaining most of the data's variance.
Standardiza on:
Before applying PCA, the data is standardized using StandardScaler to ensure that
each feature has zero mean and unit variance. This is important because PCA is
sensi ve to the scale of the data.
Applying PCA:
PCA is performed to reduce the data to 2 principal components for visualiza on. The
program then calculates the explained variance ra o, which tells us how much
variance (informa on) each principal component captures.
Visualiza on:
PCA Sca er Plot: The program creates a sca er plot of the first two principal
components (PCA1 and PCA2) to visualize how the data points are distributed in the
reduced space. Points are colored based on the target variable (malignant or benign).
Explained Variance: A bar plot shows how much variance each of the first two
principal components explains.
Cumula ve Variance: A line plot shows how much cumula ve variance is explained as
more components are added.
45
plt.figure(figsize=(8, 6))
sns.sca erplot(data=pca_df, x='PCA1', y='PCA2', hue='Target', pale e='Set1',
alpha=0.8)
plt. tle('PCA of Wisconsin Breast Cancer Dataset')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(target_names)
plt.grid()
plt.show()
# Plot explained variance ra o
plt.figure(figsize=(8, 5))
plt.bar(range(1, 3), explained_variance_ra o, ck_label=['PCA1', 'PCA2'],
color='skyblue')
plt. tle('Explained Variance Ra o of PCA Components')
plt.xlabel('Principal Components')
plt.ylabel('Variance Explained')
plt.show()
# Full PCA with all components for analysis
pca_full = PCA()
X_pca_full = pca_full.fit_transform(X_scaled)
cumula ve_variance = np.cumsum(pca_full.explained_variance_ra o_)
# Plot cumula ve explained variance
plt.figure(figsize=(8, 5))
plt.plot(range(1, len(cumula ve_variance) + 1), cumula ve_variance, marker='o',
linestyle='--', color='b')
plt. tle('Cumula ve Explained Variance')
plt.xlabel('Number of Principal Components')
47
OUTPUT:
48
11) Program to implement the working of linear discriminant analysis using iris
dataset and visualize the results.
Linear Discriminant Analysis (LDA) is a technique used for dimensionality reduc on
and classifica on. It aims to find the linear combina ons of features that best
separate the classes in the dataset. Unlike Principal Component Analysis (PCA), which
maximizes variance, LDA focuses on maximizing class separability.
Key Steps in LDA:
Data Standardiza on: Before applying LDA, the data is scaled so that each feature has
zero mean and unit variance. This ensures that all features contribute equally to the
analysis.
Compute Discriminants: LDA computes new axes (called discriminants) that maximize
the difference between classes.
Dimensionality Reduc on: LDA reduces the dataset to fewer dimensions while
preserving as much class separa on as possible. In this case, we reduce it to 2
dimensions for easier visualiza on.
Visualiza on: The transformed data is plo ed in a 2D space, showing how well the
classes (species in the Iris dataset) are separated.
Applica on in the Iris Dataset:
The Iris dataset has 4 features, and LDA reduces it to 2 components for visualiza on.
LDA is useful in classifica on tasks, where the goal is to predict the class label of new
data points based on the transformed features.
51
OUTPUT:
53
12) Program to Implement mul ple linear regression using iris dataset, visualize and
analyze the results.
Mul ple Linear Regression (MLR) is a technique used to predict a target variable
based on the rela onship between mul ple input variables. It helps in understanding
how different features affect the outcome.
Key Concepts:
Predic on: MLR creates a model that predicts a target variable using mul ple
independent variables.
Training: The model learns from the training data by adjus ng coefficients for each
feature.
Evalua on: The model’s accuracy is measured using metrics like Mean Squared Error
(MSE) and R-squared (R²).
Applica on to the Iris Dataset:
In this case, we predict the petal length based on other features like sepal length and
petal width.
The dataset is split into a training set and a test set. The model is trained on the
training set and evaluated on the test set.
Steps:
1. Training: Fit the model using the training data.
2. Predic on: Make predic ons on the test data.
3. Evalua on: Use MSE and R² to assess model performance.
4. Visualiza on: Compare the actual vs predicted values using a plot.
MLR is commonly used when there are mul ple factors influencing the outcome and
helps in making predic ons based on them.
54
OUTPUT:
Mul ple Linear Regression Results
56
----------------------------------
Mean Squared Error (MSE): 0.1300
R-squared (R²): 0.9603
Model Coefficients:
sepal length (cm): 0.7228
sepal width (cm): -0.6358
petal width (cm): 1.4675
Intercept: -0.2622
57