0% found this document useful (0 votes)
3 views

DS Journal-1

This document is a certificate from Lords Universal College certifying the completion of practicals in the B.Sc. in Computer Science Program. It includes an index of practicals covering topics such as data frames, feature scaling, regression analysis, and hypothesis testing, with detailed steps and code examples for each practical. The document serves as a record of the student's hands-on experience with various data analysis techniques and tools.

Uploaded by

akhileshworks593
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

DS Journal-1

This document is a certificate from Lords Universal College certifying the completion of practicals in the B.Sc. in Computer Science Program. It includes an index of practicals covering topics such as data frames, feature scaling, regression analysis, and hypothesis testing, with detailed steps and code examples for each practical. The document serves as a record of the student's hands-on experience with various data analysis techniques and tools.

Uploaded by

akhileshworks593
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

VIKAS VIDYA EDUCATION TRUST'S

Lords Universal College


Department of Computer Science

CERTIFICATE

This is to certify that Mr./Ms. of

Uni. Exam No. ___________ (____ Semester) has satisfactorily completed

Practical, in the subject of as a

part of B.Sc. in Computer Science Program during the academic year 20 -

20_________.

Place:

Date:

Subject In-charge Co-Ordinator,


Department of Computer
Science

Signature of External Examiner


INDEX

Sr No. Date Practical Signature

Data Frames and Basic Data Pre-processing.


● Read data from CSV and JSON files into a data
frame.
1 23/01/25
● Perform basic data pre-processing tasks such as
handling missing values and outliers.
● Manipulate and transform data using functions like
filtering, sorting, and grouping.

Feature Scaling and Dummification


● Apply feature-scaling techniques like
2 30/01/25 standardization and normalization to numerical
features.
● Perform feature dummification to convert
categorical variables into numerical representations.

Regression and Its Types


● Implement simple linear regression using a dataset.
3 30/01/25
● Explore and interpret the regression model
coefficients and goodness-of-fit measures.
● Extend the analysis to multiple linear regression and
assess the impact of additional predictors.

Logistic Regression and Decision Tree


● Build a logistic regression model to predict a binary
outcome.
4 06/02/25
● Evaluate the model's performance using
classification metrics (e.g., accuracy, precision,
recall).
● Construct a decision tree model and interpret the
decision rules for classification.
K-Means Clustering
5 06/02/25
● Apply the K-Means algorithm to group similar data
points into clusters.

● Determine the optimal number of clusters using


elbow method or silhouette analysis.
● Visualize the clustering results and analyze the
cluster characteristics.

Principal Component Analysis (PCA)


● Perform PCA on a dataset to reduce dimensionality.
6 13/02/25
● Evaluate the explained variance and select the
appropriate number of principal components.
● Visualize the data in the reduced-dimensional space.

Introduction to Excel
● Perform conditional formatting on a dataset using

7 20/02/25 various criteria.


● Create a pivot table to analyze and summarize data.
● Use VLOOKUP function to retrieve information
from a different worksheet or table.
● Perform what-if analysis using Goal Seek to
determine input values for desired output.

Hypothesis Testing
● Formulate null and alternative hypotheses for a
given problem.
8 20/02/25 ● Conduct a hypothesis test using appropriate
statistical tests (e.g., t-test, chi square test).
● Interpret the results and draw conclusions based on
the test outcomes.
Practical no : 01 : Data Frames and Basic
Data Pre-processing

Aim: Read data from CSV and JSON files


into a data frame. Perform basic data pre-
processing tasks such as handling missing
values and outliers.Manipulate and
transform data using functions like
filtering, sorting, and grouping.
In [1]: import pandas as pd
import numpy as np

# Reading a CSV file into a DataFrame


df=pd.read_csv(r"C:\Users\DELL\Desktop\MSC Data science\Data sets\Iris.csv")

print(df.head()) # Display the first 5 rows of the DataFrame

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species


0 1 5.1 3.5 1.4 0.2 Iris-setosa
1 2 4.9 3.0 1.4 0.2 Iris-setosa
2 3 4.7 3.2 1.3 0.2 Iris-setosa
3 4 4.6 3.1 1.5 0.2 Iris-setosa
4 5 5.0 3.6 1.4 0.2 Iris-setosa

In [2]: # Step 2: Basic Data Exploration


df.head()

Out[2]: Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

0 1 5.1 3.5 1.4 0.2 Iris-setosa

1 2 4.9 3.0 1.4 0.2 Iris-setosa

2 3 4.7 3.2 1.3 0.2 Iris-setosa

3 4 4.6 3.1 1.5 0.2 Iris-setosa

4 5 5.0 3.6 1.4 0.2 Iris-setosa

In [3]: df.tail()
Out[3]: Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

145 146 6.7 3.0 5.2 2.3 Iris-virginica

146 147 6.3 2.5 5.0 1.9 Iris-virginica

147 148 6.5 3.0 5.2 2.0 Iris-virginica

148 149 6.2 3.4 5.4 2.3 Iris-virginica

149 150 5.9 3.0 5.1 1.8 Iris-virginica

In [4]: df.describe()

Out[4]: Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm

count 150.000000 150.000000 150.000000 150.000000 150.000000

mean 75.500000 5.843333 3.054000 3.758667 1.198667

std 43.445368 0.828066 0.433594 1.764420 0.763161

min 1.000000 4.300000 2.000000 1.000000 0.100000

25% 38.250000 5.100000 2.800000 1.600000 0.300000

50% 75.500000 5.800000 3.000000 4.350000 1.300000

75% 112.750000 6.400000 3.300000 5.100000 1.800000

max 150.000000 7.900000 4.400000 6.900000 2.500000

In [5]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 150 non-null int64
1 SepalLengthCm 150 non-null float64
2 SepalWidthCm 150 non-null float64
3 PetalLengthCm 150 non-null float64
4 PetalWidthCm 150 non-null float64
5 Species 150 non-null object
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB

In [6]: df.shape

Out[6]: (150, 6)

In [7]: df.dtypes
Out[7]: Id int64
SepalLengthCm float64
SepalWidthCm float64
PetalLengthCm float64
PetalWidthCm float64
Species object
dtype: object

In [8]: df.columns

Out[8]: Index(['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm',


'Species'],
dtype='object')

In [9]: # Step 3: Checking for Missing Values


# Checking for missing values in each column of the CSV DataFrame
missing_values = df.isnull().sum()
print("\nMissing Values in CSV Data:")
print(missing_values)

Missing Values in CSV Data:


Id 0
SepalLengthCm 0
SepalWidthCm 0
PetalLengthCm 0
PetalWidthCm 0
Species 0
dtype: int64

In [10]: df = df.drop(columns=['Species'])

In [11]: # Step 4: Handling Missing Values


# We will fill missing values in columns with the mean of the column
# (You could also drop missing rows or use other strategies depending on your need
fill= df.fillna(df.mean())
print("\nFilled Missing Values with Mean (CSV Data):")
print(df.head())

Filled Missing Values with Mean (CSV Data):


Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
0 1 5.1 3.5 1.4 0.2
1 2 4.9 3.0 1.4 0.2
2 3 4.7 3.2 1.3 0.2
3 4 4.6 3.1 1.5 0.2
4 5 5.0 3.6 1.4 0.2

In [12]: # Alternatively, you can drop rows with missing values:


# df_csv_dropped = df_csv.dropna()
# print("\nDropped Rows with Missing Values (CSV Data):")
# print(df_csv_dropped.head())

In [13]: # Step 5: Handling Outliers


# Here we will calculate Z-scores and remove rows where Z-score is greater than 3
z_scores = np.abs((fill - fill.mean()) / fill.std())
z_scores
Out[13]: Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm

0 1.714797 0.897674 1.028611 1.336794 1.308593

1 1.691780 1.139200 0.124540 1.336794 1.308593

2 1.668762 1.380727 0.336720 1.393470 1.308593

3 1.645745 1.501490 0.106090 1.280118 1.308593

4 1.622728 1.018437 1.259242 1.336794 1.308593

... ... ... ... ... ...

145 1.622728 1.034539 0.124540 0.816888 1.443121

146 1.645745 0.551486 1.277692 0.703536 0.918985

147 1.668762 0.793012 0.124540 0.816888 1.050019

148 1.691780 0.430722 0.797981 0.930239 1.443121

149 1.714797 0.068433 0.124540 0.760212 0.787951

150 rows × 5 columns

In [14]: # Remove rows where any Z-score is greater than 3 (outliers)


do = fill[(z_scores < 3).all(axis=1)]
print("\nData After Removing Outliers (CSV Data):")
print(do.head())

Data After Removing Outliers (CSV Data):


Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
0 1 5.1 3.5 1.4 0.2
1 2 4.9 3.0 1.4 0.2
2 3 4.7 3.2 1.3 0.2
3 4 4.6 3.1 1.5 0.2
4 5 5.0 3.6 1.4 0.2

In [15]: # Step 6: Filtering Data (Example: Select rows where a column value is greater tha
threshold_value = 3 # Example threshold value
filter = do[do['SepalLengthCm'] > threshold_value]
print(f"\nFiltered Data (Rows with column_name > {threshold_value}):")
print(filter)

Filtered Data (Rows with column_name > 3):


Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
0 1 5.1 3.5 1.4 0.2
1 2 4.9 3.0 1.4 0.2
2 3 4.7 3.2 1.3 0.2
3 4 4.6 3.1 1.5 0.2
4 5 5.0 3.6 1.4 0.2
.. ... ... ... ... ...
145 146 6.7 3.0 5.2 2.3
146 147 6.3 2.5 5.0 1.9
147 148 6.5 3.0 5.2 2.0
148 149 6.2 3.4 5.4 2.3
149 150 5.9 3.0 5.1 1.8

[149 rows x 5 columns]


In [16]: # Step 7: Sorting Data (Sorting by a column in descending order)
df_sorted = filter.sort_values(by='SepalWidthCm', ascending=False)
print(df_sorted.head())

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm


33 34 5.5 4.2 1.4 0.2
32 33 5.2 4.1 1.5 0.1
14 15 5.8 4.0 1.2 0.2
16 17 5.4 3.9 1.3 0.4
5 6 5.4 3.9 1.7 0.4

In [17]: # Step 8: Grouping Data (Example: Group by a column and calculate the mean of anot
df_grouped = df_sorted.groupby('SepalLengthCm').agg({
'PetalLengthCm': 'mean', # Calculate the mean of 'another_column' for each gr
'PetalWidthCm': 'sum' # Calculate the sum of 'yet_another_column' for each gr
}).reset_index() # Reset index to avoid multi-index
print("\nGrouped Data (Mean and Sum for Each Group):")
print(df_grouped.head())

Grouped Data (Mean and Sum for Each Group):


SepalLengthCm PetalLengthCm PetalWidthCm
0 4.3 1.100000 0.1
1 4.4 1.333333 0.6
2 4.5 1.300000 0.3
3 4.6 1.325000 0.9
4 4.7 1.450000 0.4
Practical No 02: Feature Scaling and
Dummification

Part I: Apply feature-scaling techniques


like standardization and normalization to
numerical features
In [1]: import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler, StandardScaler

In [2]: df = pd.read_csv(r"C:\Users\DELL\Desktop\wine.csv")
df

Out[2]: Wine Alcohol Malic.acid Ash Acl Mg Phenols Flavanoids Nonflavanoid.pheno

0 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.2

1 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.2

2 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.3

3 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.2

4 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.3

... ... ... ... ... ... ... ... ...

173 3 13.71 5.65 2.45 20.5 95 1.68 0.61 0.5

174 3 13.40 3.91 2.48 23.0 102 1.80 0.75 0.4

175 3 13.27 4.28 2.26 20.0 120 1.59 0.69 0.4

176 3 13.17 2.59 2.37 20.0 120 1.65 0.68 0.5

177 3 14.13 4.10 2.74 24.5 96 2.05 0.76 0.5

178 rows × 14 columns

In [3]: df1 = pd.read_csv(r"C:\Users\DELL\Desktop\wine.csv", usecols=[0, 1, 2], skiprows=1


df1.columns = ['classlabel', 'Alcohol', 'Malic Acid']
print("Original DataFrame:")
df1

Original DataFrame:

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
Out[3]: classlabel Alcohol Malic Acid

0 1 13.20 1.78

1 1 13.16 2.36

2 1 14.37 1.95

3 1 13.24 2.59

4 1 14.20 1.76

... ... ... ...

172 3 13.71 5.65

173 3 13.40 3.91

174 3 13.27 4.28

175 3 13.17 2.59

176 3 14.13 4.10

177 rows × 3 columns

MinMax Scaler
There is another way of data scaling, where the minimum of feature is made equal to zero
and the maximum of feature equal to one. MinMax Scaler shrinks the data within the given
range, usually of 0 to 1. It transforms data by scaling features to a given range. It scales the
values to a specific value range without changing the shape of the original distribution.

In [ ]: scaling=MinMaxScaler()
scaled_value=scaling.fit_transform(df1[['Alcohol','Malic Acid']])
df1[['Alcohol','Malic Acid']]=scaled_value
print("\n Dataframe after MinMax Scaling")
df1

StandardScaler
StandardScaler is a preprocessing technique in scikit-learn used for standardizing features
by removing the mean and scaling to unit variance. StandardScaler, a popular
preprocessing technique provided by scikit-learn, offers a simple yet effective method for
standardizing feature values. StandardScaler operates on the principle of normalization,
where it transforms the distribution of each feature to have a mean of zero and a standard
deviation of one. This process ensures that all features are on the same scale, preventing
any single feature from dominating the learning process due to its larger magnitude.

In [4]: scaling=StandardScaler()
scaled_standardvalue=scaling.fit_transform(df1[['Alcohol','Malic Acid']])
df1[['Alcohol','Malic Acid']]=scaled_standardvalue
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
print("\n Dataframe after Standard Scaling")
df1

Dataframe after Standard Scaling


Out[4]: classlabel Alcohol Malic Acid

0 1 0.255824 -0.501624

1 1 0.206229 0.018020

2 1 1.706501 -0.349315

3 1 0.305420 0.224086

4 1 1.495719 -0.519543

... ... ... ...

172 3 0.888171 2.965658

173 3 0.503803 1.406725

174 3 0.342617 1.738222

175 3 0.218628 0.224086

176 3 1.408926 1.576953

177 rows × 3 columns

Part II : Perform feature Dummification to


convert categorical variables into
numerical representations.
In [5]: import pandas as pd
from sklearn.preprocessing import LabelEncoder

In [6]: iris=pd.read_csv(r"C:\Users\DELL\Desktop\MSC Data science\Data sets\Iris.csv")


iris

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
Out[6]: Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

0 1 5.1 3.5 1.4 0.2 Iris-setosa

1 2 4.9 3.0 1.4 0.2 Iris-setosa

2 3 4.7 3.2 1.3 0.2 Iris-setosa

3 4 4.6 3.1 1.5 0.2 Iris-setosa

4 5 5.0 3.6 1.4 0.2 Iris-setosa

... ... ... ... ... ... ...

145 146 6.7 3.0 5.2 2.3 Iris-virginica

146 147 6.3 2.5 5.0 1.9 Iris-virginica

147 148 6.5 3.0 5.2 2.0 Iris-virginica

148 149 6.2 3.4 5.4 2.3 Iris-virginica

149 150 5.9 3.0 5.1 1.8 Iris-virginica

150 rows × 6 columns

In [7]: le=LabelEncoder()
iris['code']=le.fit_transform(iris.Species)
iris

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
Out[7]: Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species code

Iris-
0 1 5.1 3.5 1.4 0.2 0
setosa

Iris-
1 2 4.9 3.0 1.4 0.2 0
setosa

Iris-
2 3 4.7 3.2 1.3 0.2 0
setosa

Iris-
3 4 4.6 3.1 1.5 0.2 0
setosa

Iris-
4 5 5.0 3.6 1.4 0.2 0
setosa

... ... ... ... ... ... ... ...

Iris-
145 146 6.7 3.0 5.2 2.3 2
virginica

Iris-
146 147 6.3 2.5 5.0 1.9 2
virginica

Iris-
147 148 6.5 3.0 5.2 2.0 2
virginica

Iris-
148 149 6.2 3.4 5.4 2.3 2
virginica

Iris-
149 150 5.9 3.0 5.1 1.8 2
virginica

150 rows × 7 columns

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
Pratical No: 03 - Regression and Its Types

Aim : To Implement simple linear


regression using a dataset.Explore and
interpret the regression model coefficients
and goodness-of-fit measures.
In [1]: import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error,r2_score

In [2]: df = pd.read_csv(r"C:\Users\DELL\Downloads\fetch_california_housing.csv")
df.head()

Out[2]: MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitud

0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.2

1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.2

2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.2

3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.2

4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.2

In [3]: df.tail()

Out[3]: MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Long

20635 1.5603 25.0 5.045455 1.133333 845.0 2.560606 39.48 -

20636 2.5568 18.0 6.114035 1.315789 356.0 3.122807 39.49 -

20637 1.7000 17.0 5.205543 1.120092 1007.0 2.325635 39.43 -

20638 1.8672 18.0 5.329513 1.171920 741.0 2.123209 39.43 -

20639 2.3886 16.0 5.254717 1.162264 1387.0 2.616981 39.37 -

In [4]: df.shape

Out[4]: (20640, 9)

In [5]: df.size
Out[5]: 185760

In [6]: df.describe()

Out[6]: MedInc HouseAge AveRooms AveBedrms Population AveOccup

count 20640.000000 20640.000000 20640.000000 20640.000000 20640.000000 20640.00000

mean 3.870671 28.639486 5.429000 1.096675 1425.476744 3.07065

std 1.899822 12.585558 2.474173 0.473911 1132.462122 10.38605

min 0.499900 1.000000 0.846154 0.333333 3.000000 0.69230

25% 2.563400 18.000000 4.440716 1.006079 787.000000 2.42974

50% 3.534800 29.000000 5.229129 1.048780 1166.000000 2.81811

75% 4.743250 37.000000 6.052381 1.099526 1725.000000 3.28226

max 15.000100 52.000000 141.909091 34.066667 35682.000000 1243.33333

In [7]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 MedInc 20640 non-null float64
1 HouseAge 20640 non-null float64
2 AveRooms 20640 non-null float64
3 AveBedrms 20640 non-null float64
4 Population 20640 non-null float64
5 AveOccup 20640 non-null float64
6 Latitude 20640 non-null float64
7 Longitude 20640 non-null float64
8 MedHouseVal 20640 non-null float64
dtypes: float64(9)
memory usage: 1.4 MB

In [8]: df.dtypes

Out[8]: MedInc float64


HouseAge float64
AveRooms float64
AveBedrms float64
Population float64
AveOccup float64
Latitude float64
Longitude float64
MedHouseVal float64
dtype: object

In [9]: #import ssl


#ssl._create_default_https_context = ssl._create_unverified_context

housing = fetch_california_housing()
# Convert to DataFrame
housing_df = pd.DataFrame(housing.data, columns=housing.feature_names)
housing_df.head() # Print first few rows

Out[9]: MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitud

0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.2

1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.2

2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.2

3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.2

4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.2

In [10]: housing_df['PRICE']=housing.target
X=housing_df[['AveRooms']]
y=housing_df[['PRICE']]
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)

In [11]: model=LinearRegression()
model.fit(X_train,y_train)

Out[11]: ▾ LinearRegression

LinearRegression()

In [12]: mse=mean_squared_error(y_test,model.predict(X_test))
r2=r2_score(y_test,model.predict(X_test))

In [13]: print("Mean Squared Error: ", mse)


print("R-squared: ",r2)
print("Intercept: ",model.intercept_)
print("Co-efficient: ",model.coef_)

Mean Squared Error: 1.2923314440807299


R-squared: 0.013795337532284901
Intercept: [1.65476227]
Co-efficient: [[0.07675559]]

Part II: Extend the analysis to multiple


linear regression and assess the impact of
additional predictors.
In [14]: X = housing_df.drop('PRICE',axis=1)
y = housing_df['PRICE']
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42

In [15]: model = LinearRegression()


model.fit(X_train,y_train)
y_pred = model.predict(X_test)
In [16]: mse = mean_squared_error(y_test,y_pred)
r2 = r2_score(y_test,y_pred)

In [17]: print("Mean Squared Error:",mse)


print("R-squared:",r2)
print("Intercept:",model.intercept_)
print("Coefficient:",model.coef_)

Mean Squared Error: 0.555891598695244


R-squared: 0.5757877060324511
Intercept: -37.023277706064064
Coefficient: [ 4.48674910e-01 9.72425752e-03 -1.23323343e-01 7.83144907e-01
-2.02962058e-06 -3.52631849e-03 -4.19792487e-01 -4.33708065e-01]
Practical no:04

Aim: Logistic Regression and Decision Tree

Part I: Build a logistic regression model to


predict a binary outcome. Evaluate the
model's performance using classification
metrics (e.g., accuracy, precision, recall).
In [1]: import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, classif

In [2]: df=pd.read_csv(r"C:\Users\DELL\Desktop\MSC Data science\Data sets\Iris.csv")


df

Out[2]: Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

0 1 5.1 3.5 1.4 0.2 Iris-setosa

1 2 4.9 3.0 1.4 0.2 Iris-setosa

2 3 4.7 3.2 1.3 0.2 Iris-setosa

3 4 4.6 3.1 1.5 0.2 Iris-setosa

4 5 5.0 3.6 1.4 0.2 Iris-setosa

... ... ... ... ... ... ...

145 146 6.7 3.0 5.2 2.3 Iris-virginica

146 147 6.3 2.5 5.0 1.9 Iris-virginica

147 148 6.5 3.0 5.2 2.0 Iris-virginica

148 149 6.2 3.4 5.4 2.3 Iris-virginica

149 150 5.9 3.0 5.1 1.8 Iris-virginica

150 rows × 6 columns

In [3]: # Keep only two classes


df1 = df[df['Species'] != 2]
df1
Out[3]: Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

0 1 5.1 3.5 1.4 0.2 Iris-setosa

1 2 4.9 3.0 1.4 0.2 Iris-setosa

2 3 4.7 3.2 1.3 0.2 Iris-setosa

3 4 4.6 3.1 1.5 0.2 Iris-setosa

4 5 5.0 3.6 1.4 0.2 Iris-setosa

... ... ... ... ... ... ...

145 146 6.7 3.0 5.2 2.3 Iris-virginica

146 147 6.3 2.5 5.0 1.9 Iris-virginica

147 148 6.5 3.0 5.2 2.0 Iris-virginica

148 149 6.2 3.4 5.4 2.3 Iris-virginica

149 150 5.9 3.0 5.1 1.8 Iris-virginica

150 rows × 6 columns

In [4]: # Keep only two classes (filter out class 2)


df = df[df['Species'] != 2]

# Define features and target


X = df.drop('Species', axis=1)
y = df['Species']

In [5]: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_st

logistic_model = LogisticRegression()
logistic_model.fit(X_train, y_train)

C:\Users\DELL\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\lin
ear_model\_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=
1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
Out[5]: ▾ LogisticRegression

LogisticRegression()

In [6]: # Predictions
y_pred_logistic = logistic_model.predict(X_test)

# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred_logistic))
print("\nClassification Report")
print(classification_report(y_test, y_pred_logistic))
Accuracy: 1.0

Classification Report
precision recall f1-score support

Iris-setosa 1.00 1.00 1.00 10


Iris-versicolor 1.00 1.00 1.00 9
Iris-virginica 1.00 1.00 1.00 11

accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30

Part II: Construct a decision tree model


and interpret the decision rules for
classification.
In [7]: from sklearn.tree import DecisionTreeClassifier
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_st

In [8]: model = DecisionTreeClassifier()

model.fit(X_train, y_train)
y_pred_tree = model.predict(X_test)
y_pred_tree

Out[8]: array(['Iris-versicolor', 'Iris-setosa', 'Iris-virginica',


'Iris-versicolor', 'Iris-versicolor', 'Iris-setosa',
'Iris-versicolor', 'Iris-virginica', 'Iris-versicolor',
'Iris-versicolor', 'Iris-virginica', 'Iris-setosa', 'Iris-setosa',
'Iris-setosa', 'Iris-setosa', 'Iris-versicolor', 'Iris-virginica',
'Iris-versicolor', 'Iris-versicolor', 'Iris-virginica',
'Iris-setosa', 'Iris-virginica', 'Iris-setosa', 'Iris-virginica',
'Iris-virginica', 'Iris-virginica', 'Iris-virginica',
'Iris-virginica', 'Iris-setosa', 'Iris-setosa'], dtype=object)

In [9]: # Print Decision Tree Metrics


print("\nDecision Tree Metrics")
print("Accuracy: ", accuracy_score(y_test, y_pred_tree))
print("\nClassification Report")
print(classification_report(y_test, y_pred_tree))
Decision Tree Metrics
Accuracy: 1.0

Classification Report
precision recall f1-score support

Iris-setosa 1.00 1.00 1.00 10


Iris-versicolor 1.00 1.00 1.00 9
Iris-virginica 1.00 1.00 1.00 11

accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30

In [ ]:
Practical No: 05 - K-Means Clustering

Aim:Apply the K-Means algorithm to group


similar data points into clusters. Determine the
optimal number of clusters using elbow method
or silhouette analysis. Visualize the clustering
results and analyze the cluster characteristics.
In [1]: import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

In [2]: # Generate synthetic data


X, _ = make_blobs(n_samples=300, centers=4, random_state=42)

# Step 1: Elbow Method to find the optimal number of clusters


inertia = []
K_range = range(1, 11)

In [3]: for k in K_range:


kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(X)
inertia.append(kmeans.inertia_)

In [4]: # Plot Elbow Curve


plt.plot(K_range, inertia, marker='o', linestyle='--')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()
In [5]: # Step 2: Apply K-Means with the chosen k (let's pick k=4)
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
y_kmeans = kmeans.fit_predict(X)

In [6]: # Step 3: Visualize Clustering Results


plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, cmap='viridis', edgecolors='k')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
s=200, c='red', marker='X', label='Centroids')
plt.title('K-Means Clustering')
plt.legend()
plt.show()
Practical No: 06 - Principal Component
Analysis (PCA)

Aim: erform PCA on a dataset to reduce


dimensionality. Evaluate the explained
variance and select the appropriate number
of principal components. Visualize the data
in the reduced-dimensional space.
In [1]: import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

In [2]: # Load dataset (Iris dataset)


data = load_iris()
X = data.data # Features
y = data.target # Labels

In [3]: # Standardize the data (important for PCA)


scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Perform PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

In [4]: # Evaluate explained variance


explained_variance = pca.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance)

In [5]: # Plot explained variance


plt.figure(figsize=(6, 4))
plt.plot(range(1, len(explained_variance) + 1), cumulative_variance, marker='o', line
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Explained Variance vs. Number of Components')
plt.grid(True)
plt.show()
In [6]: # Choose first two principal components for visualization
pca_2d = PCA(n_components=2)
X_pca_2d = pca_2d.fit_transform(X_scaled)

# Scatter plot of the first two principal components


plt.figure(figsize=(6, 4))
plt.scatter(X_pca_2d[:, 0], X_pca_2d[:, 1], c=y, cmap='viridis', edgecolor='k')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA: 2D Projection of Data')
plt.colorbar(label='Class Label')
plt.grid(True)
plt.show()

You might also like