0% found this document useful (0 votes)

3 views

DS Journal-1

This document is a certificate from Lords Universal College certifying the completion of practicals in the B.Sc. in Computer Science Program. It includes an index of practicals covering topics such as data frames, feature scaling, regression analysis, and hypothesis testing, with detailed steps and code examples for each practical. The document serves as a record of the student's hands-on experience with various data analysis techniques and tools.

Uploaded by

akhileshworks593

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

DS Journal-1

Uploaded by

akhileshworks593

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

VIKAS VIDYA EDUCATION TRUST'S

Lords Universal College

Department of Computer Science

CERTIFICATE

This is to certify that Mr./Ms. of

Uni. Exam No. _______ ( Semester) has satisfactorily completed

Practical, in the subject of as a

part of B.Sc. in Computer Science Program during the academic year 20 -

20_________.

Place:

Date:

Subject In-charge Co-Ordinator,

Department of Computer
Science

Signature of External Examiner

INDEX

Sr No. Date Practical Signature

Data Frames and Basic Data Pre-processing.

● Read data from CSV and JSON files into a data
frame.
1 23/01/25
● Perform basic data pre-processing tasks such as
handling missing values and outliers.
● Manipulate and transform data using functions like
filtering, sorting, and grouping.

Feature Scaling and Dummification

● Apply feature-scaling techniques like
2 30/01/25 standardization and normalization to numerical
features.
● Perform feature dummification to convert
categorical variables into numerical representations.

Regression and Its Types

● Implement simple linear regression using a dataset.
3 30/01/25
● Explore and interpret the regression model
coefficients and goodness-of-fit measures.
● Extend the analysis to multiple linear regression and
assess the impact of additional predictors.

Logistic Regression and Decision Tree

● Build a logistic regression model to predict a binary
outcome.
4 06/02/25
● Evaluate the model's performance using
classification metrics (e.g., accuracy, precision,
recall).
● Construct a decision tree model and interpret the
decision rules for classification.
K-Means Clustering
5 06/02/25
● Apply the K-Means algorithm to group similar data
points into clusters.

● Determine the optimal number of clusters using

elbow method or silhouette analysis.
● Visualize the clustering results and analyze the
cluster characteristics.

Principal Component Analysis (PCA)

● Perform PCA on a dataset to reduce dimensionality.
6 13/02/25
● Evaluate the explained variance and select the
appropriate number of principal components.
● Visualize the data in the reduced-dimensional space.

Introduction to Excel
● Perform conditional formatting on a dataset using

7 20/02/25 various criteria.

● Create a pivot table to analyze and summarize data.
● Use VLOOKUP function to retrieve information
from a different worksheet or table.
● Perform what-if analysis using Goal Seek to
determine input values for desired output.

Hypothesis Testing
● Formulate null and alternative hypotheses for a
given problem.
8 20/02/25 ● Conduct a hypothesis test using appropriate
statistical tests (e.g., t-test, chi square test).
● Interpret the results and draw conclusions based on
the test outcomes.
Practical no : 01 : Data Frames and Basic
Data Pre-processing

Aim: Read data from CSV and JSON files

into a data frame. Perform basic data pre-
processing tasks such as handling missing
values and outliers.Manipulate and
transform data using functions like
filtering, sorting, and grouping.
In [1]: import pandas as pd
import numpy as np

# Reading a CSV file into a DataFrame

df=pd.read_csv(r"C:\Users\DELL\Desktop\MSC Data science\Data sets\Iris.csv")

print(df.head()) # Display the first 5 rows of the DataFrame

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

0 1 5.1 3.5 1.4 0.2 Iris-setosa
1 2 4.9 3.0 1.4 0.2 Iris-setosa
2 3 4.7 3.2 1.3 0.2 Iris-setosa
3 4 4.6 3.1 1.5 0.2 Iris-setosa
4 5 5.0 3.6 1.4 0.2 Iris-setosa

In [2]: # Step 2: Basic Data Exploration

df.head()

Out[2]: Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

0 1 5.1 3.5 1.4 0.2 Iris-setosa

1 2 4.9 3.0 1.4 0.2 Iris-setosa

2 3 4.7 3.2 1.3 0.2 Iris-setosa

3 4 4.6 3.1 1.5 0.2 Iris-setosa

4 5 5.0 3.6 1.4 0.2 Iris-setosa

In [3]: df.tail()
Out[3]: Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

145 146 6.7 3.0 5.2 2.3 Iris-virginica

146 147 6.3 2.5 5.0 1.9 Iris-virginica

147 148 6.5 3.0 5.2 2.0 Iris-virginica

148 149 6.2 3.4 5.4 2.3 Iris-virginica

149 150 5.9 3.0 5.1 1.8 Iris-virginica

In [4]: df.describe()

Out[4]: Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm

count 150.000000 150.000000 150.000000 150.000000 150.000000

mean 75.500000 5.843333 3.054000 3.758667 1.198667

std 43.445368 0.828066 0.433594 1.764420 0.763161

min 1.000000 4.300000 2.000000 1.000000 0.100000

25% 38.250000 5.100000 2.800000 1.600000 0.300000

50% 75.500000 5.800000 3.000000 4.350000 1.300000

75% 112.750000 6.400000 3.300000 5.100000 1.800000

max 150.000000 7.900000 4.400000 6.900000 2.500000

In [5]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 150 non-null int64
1 SepalLengthCm 150 non-null float64
2 SepalWidthCm 150 non-null float64
3 PetalLengthCm 150 non-null float64
4 PetalWidthCm 150 non-null float64
5 Species 150 non-null object
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB

In [6]: df.shape

Out[6]: (150, 6)

In [7]: df.dtypes
Out[7]: Id int64
SepalLengthCm float64
SepalWidthCm float64
PetalLengthCm float64
PetalWidthCm float64
Species object
dtype: object

In [8]: df.columns

Out[8]: Index(['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm',

'Species'],
dtype='object')

In [9]: # Step 3: Checking for Missing Values

# Checking for missing values in each column of the CSV DataFrame
missing_values = df.isnull().sum()
print("\nMissing Values in CSV Data:")
print(missing_values)

Missing Values in CSV Data:

Id 0
SepalLengthCm 0
SepalWidthCm 0
PetalLengthCm 0
PetalWidthCm 0
Species 0
dtype: int64

In [10]: df = df.drop(columns=['Species'])

In [11]: # Step 4: Handling Missing Values

# We will fill missing values in columns with the mean of the column
# (You could also drop missing rows or use other strategies depending on your need
fill= df.fillna(df.mean())
print("\nFilled Missing Values with Mean (CSV Data):")
print(df.head())

Filled Missing Values with Mean (CSV Data):

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
0 1 5.1 3.5 1.4 0.2
1 2 4.9 3.0 1.4 0.2
2 3 4.7 3.2 1.3 0.2
3 4 4.6 3.1 1.5 0.2
4 5 5.0 3.6 1.4 0.2

In [12]: # Alternatively, you can drop rows with missing values:

# df_csv_dropped = df_csv.dropna()
# print("\nDropped Rows with Missing Values (CSV Data):")
# print(df_csv_dropped.head())

In [13]: # Step 5: Handling Outliers

# Here we will calculate Z-scores and remove rows where Z-score is greater than 3
z_scores = np.abs((fill - fill.mean()) / fill.std())
z_scores
Out[13]: Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm

0 1.714797 0.897674 1.028611 1.336794 1.308593

1 1.691780 1.139200 0.124540 1.336794 1.308593

2 1.668762 1.380727 0.336720 1.393470 1.308593

3 1.645745 1.501490 0.106090 1.280118 1.308593

4 1.622728 1.018437 1.259242 1.336794 1.308593

... ... ... ... ... ...

145 1.622728 1.034539 0.124540 0.816888 1.443121

146 1.645745 0.551486 1.277692 0.703536 0.918985

147 1.668762 0.793012 0.124540 0.816888 1.050019

148 1.691780 0.430722 0.797981 0.930239 1.443121

149 1.714797 0.068433 0.124540 0.760212 0.787951

150 rows × 5 columns

In [14]: # Remove rows where any Z-score is greater than 3 (outliers)

do = fill[(z_scores < 3).all(axis=1)]
print("\nData After Removing Outliers (CSV Data):")
print(do.head())

Data After Removing Outliers (CSV Data):

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
0 1 5.1 3.5 1.4 0.2
1 2 4.9 3.0 1.4 0.2
2 3 4.7 3.2 1.3 0.2
3 4 4.6 3.1 1.5 0.2
4 5 5.0 3.6 1.4 0.2

In [15]: # Step 6: Filtering Data (Example: Select rows where a column value is greater tha
threshold_value = 3 # Example threshold value
filter = do[do['SepalLengthCm'] > threshold_value]
print(f"\nFiltered Data (Rows with column_name > {threshold_value}):")
print(filter)

Filtered Data (Rows with column_name > 3):

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
0 1 5.1 3.5 1.4 0.2
1 2 4.9 3.0 1.4 0.2
2 3 4.7 3.2 1.3 0.2
3 4 4.6 3.1 1.5 0.2
4 5 5.0 3.6 1.4 0.2
.. ... ... ... ... ...
145 146 6.7 3.0 5.2 2.3
146 147 6.3 2.5 5.0 1.9
147 148 6.5 3.0 5.2 2.0
148 149 6.2 3.4 5.4 2.3
149 150 5.9 3.0 5.1 1.8

[149 rows x 5 columns]

In [16]: # Step 7: Sorting Data (Sorting by a column in descending order)
df_sorted = filter.sort_values(by='SepalWidthCm', ascending=False)
print(df_sorted.head())

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm

33 34 5.5 4.2 1.4 0.2
32 33 5.2 4.1 1.5 0.1
14 15 5.8 4.0 1.2 0.2
16 17 5.4 3.9 1.3 0.4
5 6 5.4 3.9 1.7 0.4

In [17]: # Step 8: Grouping Data (Example: Group by a column and calculate the mean of anot
df_grouped = df_sorted.groupby('SepalLengthCm').agg({
'PetalLengthCm': 'mean', # Calculate the mean of 'another_column' for each gr
'PetalWidthCm': 'sum' # Calculate the sum of 'yet_another_column' for each gr
}).reset_index() # Reset index to avoid multi-index
print("\nGrouped Data (Mean and Sum for Each Group):")
print(df_grouped.head())

Grouped Data (Mean and Sum for Each Group):

SepalLengthCm PetalLengthCm PetalWidthCm
0 4.3 1.100000 0.1
1 4.4 1.333333 0.6
2 4.5 1.300000 0.3
3 4.6 1.325000 0.9
4 4.7 1.450000 0.4
Practical No 02: Feature Scaling and
Dummification

Part I: Apply feature-scaling techniques

like standardization and normalization to
numerical features
In [1]: import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler, StandardScaler

In [2]: df = pd.read_csv(r"C:\Users\DELL\Desktop\wine.csv")
df

Out[2]: Wine Alcohol Malic.acid Ash Acl Mg Phenols Flavanoids Nonflavanoid.pheno

0 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.2

1 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.2

2 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.3

3 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.2

4 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.3

... ... ... ... ... ... ... ... ...

173 3 13.71 5.65 2.45 20.5 95 1.68 0.61 0.5

174 3 13.40 3.91 2.48 23.0 102 1.80 0.75 0.4

175 3 13.27 4.28 2.26 20.0 120 1.59 0.69 0.4

176 3 13.17 2.59 2.37 20.0 120 1.65 0.68 0.5

177 3 14.13 4.10 2.74 24.5 96 2.05 0.76 0.5

178 rows × 14 columns

In [3]: df1 = pd.read_csv(r"C:\Users\DELL\Desktop\wine.csv", usecols=[0, 1, 2], skiprows=1

df1.columns = ['classlabel', 'Alcohol', 'Malic Acid']
print("Original DataFrame:")
df1

Original DataFrame:

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
Out[3]: classlabel Alcohol Malic Acid

0 1 13.20 1.78

1 1 13.16 2.36

2 1 14.37 1.95

3 1 13.24 2.59

4 1 14.20 1.76

... ... ... ...

172 3 13.71 5.65

173 3 13.40 3.91

174 3 13.27 4.28

175 3 13.17 2.59

176 3 14.13 4.10

177 rows × 3 columns

MinMax Scaler
There is another way of data scaling, where the minimum of feature is made equal to zero
and the maximum of feature equal to one. MinMax Scaler shrinks the data within the given
range, usually of 0 to 1. It transforms data by scaling features to a given range. It scales the
values to a specific value range without changing the shape of the original distribution.

In [ ]: scaling=MinMaxScaler()
scaled_value=scaling.fit_transform(df1[['Alcohol','Malic Acid']])
df1[['Alcohol','Malic Acid']]=scaled_value
print("\n Dataframe after MinMax Scaling")
df1

StandardScaler
StandardScaler is a preprocessing technique in scikit-learn used for standardizing features
by removing the mean and scaling to unit variance. StandardScaler, a popular
preprocessing technique provided by scikit-learn, offers a simple yet effective method for
standardizing feature values. StandardScaler operates on the principle of normalization,
where it transforms the distribution of each feature to have a mean of zero and a standard
deviation of one. This process ensures that all features are on the same scale, preventing
any single feature from dominating the learning process due to its larger magnitude.

In [4]: scaling=StandardScaler()
scaled_standardvalue=scaling.fit_transform(df1[['Alcohol','Malic Acid']])
df1[['Alcohol','Malic Acid']]=scaled_standardvalue
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
print("\n Dataframe after Standard Scaling")
df1

Dataframe after Standard Scaling

Out[4]: classlabel Alcohol Malic Acid

0 1 0.255824 -0.501624

1 1 0.206229 0.018020

2 1 1.706501 -0.349315

3 1 0.305420 0.224086

4 1 1.495719 -0.519543

... ... ... ...

172 3 0.888171 2.965658

173 3 0.503803 1.406725

174 3 0.342617 1.738222

175 3 0.218628 0.224086

176 3 1.408926 1.576953

177 rows × 3 columns

Part II : Perform feature Dummification to

convert categorical variables into
numerical representations.
In [5]: import pandas as pd
from sklearn.preprocessing import LabelEncoder

In [6]: iris=pd.read_csv(r"C:\Users\DELL\Desktop\MSC Data science\Data sets\Iris.csv")

iris

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
Out[6]: Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

0 1 5.1 3.5 1.4 0.2 Iris-setosa

1 2 4.9 3.0 1.4 0.2 Iris-setosa

2 3 4.7 3.2 1.3 0.2 Iris-setosa

3 4 4.6 3.1 1.5 0.2 Iris-setosa

4 5 5.0 3.6 1.4 0.2 Iris-setosa

... ... ... ... ... ... ...

145 146 6.7 3.0 5.2 2.3 Iris-virginica

146 147 6.3 2.5 5.0 1.9 Iris-virginica

147 148 6.5 3.0 5.2 2.0 Iris-virginica

148 149 6.2 3.4 5.4 2.3 Iris-virginica

149 150 5.9 3.0 5.1 1.8 Iris-virginica

150 rows × 6 columns

In [7]: le=LabelEncoder()
iris['code']=le.fit_transform(iris.Species)
iris

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
Out[7]: Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species code

Iris-
0 1 5.1 3.5 1.4 0.2 0
setosa

Iris-
1 2 4.9 3.0 1.4 0.2 0
setosa

Iris-
2 3 4.7 3.2 1.3 0.2 0
setosa

Iris-
3 4 4.6 3.1 1.5 0.2 0
setosa

Iris-
4 5 5.0 3.6 1.4 0.2 0
setosa

... ... ... ... ... ... ... ...

Iris-
145 146 6.7 3.0 5.2 2.3 2
virginica

Iris-
146 147 6.3 2.5 5.0 1.9 2
virginica

Iris-
147 148 6.5 3.0 5.2 2.0 2
virginica

Iris-
148 149 6.2 3.4 5.4 2.3 2
virginica

Iris-
149 150 5.9 3.0 5.1 1.8 2
virginica

150 rows × 7 columns

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
Pratical No: 03 - Regression and Its Types

Aim : To Implement simple linear

regression using a dataset.Explore and
interpret the regression model coefficients
and goodness-of-fit measures.
In [1]: import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error,r2_score

In [2]: df = pd.read_csv(r"C:\Users\DELL\Downloads\fetch_california_housing.csv")
df.head()

Out[2]: MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitud

0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.2

1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.2

2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.2

3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.2

4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.2

In [3]: df.tail()

Out[3]: MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Long

20635 1.5603 25.0 5.045455 1.133333 845.0 2.560606 39.48 -

20636 2.5568 18.0 6.114035 1.315789 356.0 3.122807 39.49 -

20637 1.7000 17.0 5.205543 1.120092 1007.0 2.325635 39.43 -

20638 1.8672 18.0 5.329513 1.171920 741.0 2.123209 39.43 -

20639 2.3886 16.0 5.254717 1.162264 1387.0 2.616981 39.37 -

In [4]: df.shape

Out[4]: (20640, 9)

In [5]: df.size
Out[5]: 185760

In [6]: df.describe()

Out[6]: MedInc HouseAge AveRooms AveBedrms Population AveOccup

count 20640.000000 20640.000000 20640.000000 20640.000000 20640.000000 20640.00000

mean 3.870671 28.639486 5.429000 1.096675 1425.476744 3.07065

std 1.899822 12.585558 2.474173 0.473911 1132.462122 10.38605

min 0.499900 1.000000 0.846154 0.333333 3.000000 0.69230

25% 2.563400 18.000000 4.440716 1.006079 787.000000 2.42974

50% 3.534800 29.000000 5.229129 1.048780 1166.000000 2.81811

75% 4.743250 37.000000 6.052381 1.099526 1725.000000 3.28226

max 15.000100 52.000000 141.909091 34.066667 35682.000000 1243.33333

In [7]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 MedInc 20640 non-null float64
1 HouseAge 20640 non-null float64
2 AveRooms 20640 non-null float64
3 AveBedrms 20640 non-null float64
4 Population 20640 non-null float64
5 AveOccup 20640 non-null float64
6 Latitude 20640 non-null float64
7 Longitude 20640 non-null float64
8 MedHouseVal 20640 non-null float64
dtypes: float64(9)
memory usage: 1.4 MB

In [8]: df.dtypes

Out[8]: MedInc float64

HouseAge float64
AveRooms float64
AveBedrms float64
Population float64
AveOccup float64
Latitude float64
Longitude float64
MedHouseVal float64
dtype: object

In [9]: #import ssl

#ssl._create_default_https_context = ssl._create_unverified_context

housing = fetch_california_housing()
# Convert to DataFrame
housing_df = pd.DataFrame(housing.data, columns=housing.feature_names)
housing_df.head() # Print first few rows

Out[9]: MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitud

0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.2

1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.2

2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.2

3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.2

4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.2

In [10]: housing_df['PRICE']=housing.target
X=housing_df[['AveRooms']]
y=housing_df[['PRICE']]
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)

In [11]: model=LinearRegression()
model.fit(X_train,y_train)

Out[11]: ▾ LinearRegression

LinearRegression()

In [12]: mse=mean_squared_error(y_test,model.predict(X_test))
r2=r2_score(y_test,model.predict(X_test))

In [13]: print("Mean Squared Error: ", mse)

print("R-squared: ",r2)
print("Intercept: ",model.intercept_)
print("Co-efficient: ",model.coef_)

Mean Squared Error: 1.2923314440807299

R-squared: 0.013795337532284901
Intercept: [1.65476227]
Co-efficient: [[0.07675559]]

Part II: Extend the analysis to multiple

linear regression and assess the impact of
additional predictors.
In [14]: X = housing_df.drop('PRICE',axis=1)
y = housing_df['PRICE']
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42

In [15]: model = LinearRegression()

model.fit(X_train,y_train)
y_pred = model.predict(X_test)
In [16]: mse = mean_squared_error(y_test,y_pred)
r2 = r2_score(y_test,y_pred)

In [17]: print("Mean Squared Error:",mse)

print("R-squared:",r2)
print("Intercept:",model.intercept_)
print("Coefficient:",model.coef_)

Mean Squared Error: 0.555891598695244

R-squared: 0.5757877060324511
Intercept: -37.023277706064064
Coefficient: [ 4.48674910e-01 9.72425752e-03 -1.23323343e-01 7.83144907e-01
-2.02962058e-06 -3.52631849e-03 -4.19792487e-01 -4.33708065e-01]
Practical no:04

Aim: Logistic Regression and Decision Tree

Part I: Build a logistic regression model to

predict a binary outcome. Evaluate the
model's performance using classification
metrics (e.g., accuracy, precision, recall).
In [1]: import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, classif

In [2]: df=pd.read_csv(r"C:\Users\DELL\Desktop\MSC Data science\Data sets\Iris.csv")

Out[2]: Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

0 1 5.1 3.5 1.4 0.2 Iris-setosa

1 2 4.9 3.0 1.4 0.2 Iris-setosa

2 3 4.7 3.2 1.3 0.2 Iris-setosa

3 4 4.6 3.1 1.5 0.2 Iris-setosa

4 5 5.0 3.6 1.4 0.2 Iris-setosa

... ... ... ... ... ... ...

145 146 6.7 3.0 5.2 2.3 Iris-virginica

146 147 6.3 2.5 5.0 1.9 Iris-virginica

147 148 6.5 3.0 5.2 2.0 Iris-virginica

148 149 6.2 3.4 5.4 2.3 Iris-virginica

149 150 5.9 3.0 5.1 1.8 Iris-virginica

150 rows × 6 columns

In [3]: # Keep only two classes

df1 = df[df['Species'] != 2]
df1
Out[3]: Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

0 1 5.1 3.5 1.4 0.2 Iris-setosa

1 2 4.9 3.0 1.4 0.2 Iris-setosa

2 3 4.7 3.2 1.3 0.2 Iris-setosa

3 4 4.6 3.1 1.5 0.2 Iris-setosa

4 5 5.0 3.6 1.4 0.2 Iris-setosa

... ... ... ... ... ... ...

145 146 6.7 3.0 5.2 2.3 Iris-virginica

146 147 6.3 2.5 5.0 1.9 Iris-virginica

147 148 6.5 3.0 5.2 2.0 Iris-virginica

148 149 6.2 3.4 5.4 2.3 Iris-virginica

149 150 5.9 3.0 5.1 1.8 Iris-virginica

150 rows × 6 columns

In [4]: # Keep only two classes (filter out class 2)

df = df[df['Species'] != 2]

# Define features and target

X = df.drop('Species', axis=1)
y = df['Species']

In [5]: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_st

logistic_model = LogisticRegression()
logistic_model.fit(X_train, y_train)

C:\Users\DELL\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\lin
ear_model\_logistic.py:460: ConvergenceWarning: lbfgs failed to converge (status=
1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
Out[5]: ▾ LogisticRegression

LogisticRegression()

In [6]: # Predictions
y_pred_logistic = logistic_model.predict(X_test)

# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred_logistic))
print("\nClassification Report")
print(classification_report(y_test, y_pred_logistic))
Accuracy: 1.0

Classification Report
precision recall f1-score support

Iris-setosa 1.00 1.00 1.00 10

Iris-versicolor 1.00 1.00 1.00 9
Iris-virginica 1.00 1.00 1.00 11

accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30

Part II: Construct a decision tree model

and interpret the decision rules for
classification.
In [7]: from sklearn.tree import DecisionTreeClassifier
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_st

In [8]: model = DecisionTreeClassifier()

model.fit(X_train, y_train)
y_pred_tree = model.predict(X_test)
y_pred_tree

Out[8]: array(['Iris-versicolor', 'Iris-setosa', 'Iris-virginica',

'Iris-versicolor', 'Iris-versicolor', 'Iris-setosa',
'Iris-versicolor', 'Iris-virginica', 'Iris-versicolor',
'Iris-versicolor', 'Iris-virginica', 'Iris-setosa', 'Iris-setosa',
'Iris-setosa', 'Iris-setosa', 'Iris-versicolor', 'Iris-virginica',
'Iris-versicolor', 'Iris-versicolor', 'Iris-virginica',
'Iris-setosa', 'Iris-virginica', 'Iris-setosa', 'Iris-virginica',
'Iris-virginica', 'Iris-virginica', 'Iris-virginica',
'Iris-virginica', 'Iris-setosa', 'Iris-setosa'], dtype=object)

In [9]: # Print Decision Tree Metrics

print("\nDecision Tree Metrics")
print("Accuracy: ", accuracy_score(y_test, y_pred_tree))
print("\nClassification Report")
print(classification_report(y_test, y_pred_tree))
Decision Tree Metrics
Accuracy: 1.0

Classification Report
precision recall f1-score support

Iris-setosa 1.00 1.00 1.00 10

Iris-versicolor 1.00 1.00 1.00 9
Iris-virginica 1.00 1.00 1.00 11

accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30

In [ ]:
Practical No: 05 - K-Means Clustering

Aim:Apply the K-Means algorithm to group

similar data points into clusters. Determine the
optimal number of clusters using elbow method
or silhouette analysis. Visualize the clustering
results and analyze the cluster characteristics.
In [1]: import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

In [2]: # Generate synthetic data

X, _ = make_blobs(n_samples=300, centers=4, random_state=42)

# Step 1: Elbow Method to find the optimal number of clusters

inertia = []
K_range = range(1, 11)

In [3]: for k in K_range:

kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(X)
inertia.append(kmeans.inertia_)

In [4]: # Plot Elbow Curve

plt.plot(K_range, inertia, marker='o', linestyle='--')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()
In [5]: # Step 2: Apply K-Means with the chosen k (let's pick k=4)
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
y_kmeans = kmeans.fit_predict(X)

In [6]: # Step 3: Visualize Clustering Results

plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, cmap='viridis', edgecolors='k')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
s=200, c='red', marker='X', label='Centroids')
plt.title('K-Means Clustering')
plt.legend()
plt.show()
Practical No: 06 - Principal Component
Analysis (PCA)

Aim: erform PCA on a dataset to reduce

dimensionality. Evaluate the explained
variance and select the appropriate number
of principal components. Visualize the data
in the reduced-dimensional space.
In [1]: import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

In [2]: # Load dataset (Iris dataset)

data = load_iris()
X = data.data # Features
y = data.target # Labels

In [3]: # Standardize the data (important for PCA)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Perform PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

In [4]: # Evaluate explained variance

explained_variance = pca.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance)

In [5]: # Plot explained variance

plt.figure(figsize=(6, 4))
plt.plot(range(1, len(explained_variance) + 1), cumulative_variance, marker='o', line
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Explained Variance vs. Number of Components')
plt.grid(True)
plt.show()
In [6]: # Choose first two principal components for visualization
pca_2d = PCA(n_components=2)
X_pca_2d = pca_2d.fit_transform(X_scaled)

# Scatter plot of the first two principal components

plt.figure(figsize=(6, 4))
plt.scatter(X_pca_2d[:, 0], X_pca_2d[:, 1], c=y, cmap='viridis', edgecolor='k')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA: 2D Projection of Data')
plt.colorbar(label='Class Label')
plt.grid(True)
plt.show()

Models of Decision Making
100% (6)
Models of Decision Making
19 pages
DS Journal_Final
No ratings yet
DS Journal_Final
37 pages
data science practicals
No ratings yet
data science practicals
47 pages
Data Science Practicals
No ratings yet
Data Science Practicals
40 pages
ds
No ratings yet
ds
28 pages
omkar
No ratings yet
omkar
37 pages
Data Analysis Lab - Final - 23-24
No ratings yet
Data Analysis Lab - Final - 23-24
11 pages
Fundamentals of Data Science Students
No ratings yet
Fundamentals of Data Science Students
52 pages
Practical No - 1
No ratings yet
Practical No - 1
5 pages
vamshi ml-1,2
No ratings yet
vamshi ml-1,2
25 pages
final dev record
No ratings yet
final dev record
49 pages
Sample Worksheet 1
No ratings yet
Sample Worksheet 1
8 pages
DEV RECORD AIDS
No ratings yet
DEV RECORD AIDS
24 pages
CS3362 Data Science Laboratory Manual 2022-23
No ratings yet
CS3362 Data Science Laboratory Manual 2022-23
54 pages
IP Record Python 23-24 Aryan
No ratings yet
IP Record Python 23-24 Aryan
42 pages
1152CS239-Intro. To Data Science-Syllabus
No ratings yet
1152CS239-Intro. To Data Science-Syllabus
6 pages
Data Science Journal
No ratings yet
Data Science Journal
40 pages
2. ML Lab Record
No ratings yet
2. ML Lab Record
38 pages
DS-DS Lab-1
No ratings yet
DS-DS Lab-1
4 pages
Data Science Practical Book - Ipynb
No ratings yet
Data Science Practical Book - Ipynb
21 pages
Course_ Introduction to Data Science (SD211105)
No ratings yet
Course_ Introduction to Data Science (SD211105)
10 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
DSBDA Lab Manual
No ratings yet
DSBDA Lab Manual
167 pages
Python For DS Cheat Sheet
100% (2)
Python For DS Cheat Sheet
6 pages
DSR LAB MANUAL - 10 programs
No ratings yet
DSR LAB MANUAL - 10 programs
34 pages
ML lab manual 1-10
No ratings yet
ML lab manual 1-10
58 pages
ML LAB
No ratings yet
ML LAB
46 pages
ML Lab Manual (Upto Cie-1)
No ratings yet
ML Lab Manual (Upto Cie-1)
33 pages
Dsa Lab Manual
No ratings yet
Dsa Lab Manual
35 pages
Data Science in Society Cat
No ratings yet
Data Science in Society Cat
5 pages
DAL EXT 1 and 2
No ratings yet
DAL EXT 1 and 2
125 pages
data-science-practical-with-solutions-bsc-cs-sem-6
No ratings yet
data-science-practical-with-solutions-bsc-cs-sem-6
29 pages
Sessional QP-TaT
No ratings yet
Sessional QP-TaT
5 pages
TYCS Practical
No ratings yet
TYCS Practical
26 pages
Some Exercises
No ratings yet
Some Exercises
9 pages
PracticalList_EDT_BCA_2024 SET B1_4
No ratings yet
PracticalList_EDT_BCA_2024 SET B1_4
8 pages
fds-fundamentals-of-data-science-laboratory
No ratings yet
fds-fundamentals-of-data-science-laboratory
53 pages
Data Science
No ratings yet
Data Science
15 pages
‏لقطة شاشة ٢٠٢٤-٠٥-٠٧ في ٧.٢٧.١٤ م
No ratings yet
‏لقطة شاشة ٢٠٢٤-٠٥-٠٧ في ٧.٢٧.١٤ م
12 pages
DA lab
No ratings yet
DA lab
27 pages
ML MANUAL
No ratings yet
ML MANUAL
21 pages
Data Cleaning
No ratings yet
Data Cleaning
13 pages
ML Aml Cse It Lab Manual Final
No ratings yet
ML Aml Cse It Lab Manual Final
22 pages
DS FINAL
No ratings yet
DS FINAL
46 pages
Exploratory Data Analysis
100% (1)
Exploratory Data Analysis
203 pages
EDS - Python Cheat Sheet
0% (1)
EDS - Python Cheat Sheet
3 pages
Overview of Data Cleaning
No ratings yet
Overview of Data Cleaning
17 pages
Data Science Practicals - Ipynb
No ratings yet
Data Science Practicals - Ipynb
54 pages
Data Science Notes
No ratings yet
Data Science Notes
66 pages
Practical No.-01
No ratings yet
Practical No.-01
25 pages
vishnu. ml
No ratings yet
vishnu. ml
26 pages
DA 8th Sem
No ratings yet
DA 8th Sem
32 pages
cs3362 Foundations of Data Science Lab Manual
No ratings yet
cs3362 Foundations of Data Science Lab Manual
53 pages
FDS RECORD-1-4
No ratings yet
FDS RECORD-1-4
18 pages
Prac 7
No ratings yet
Prac 7
5 pages
Ds Practical
No ratings yet
Ds Practical
19 pages
Big Data Analysis
No ratings yet
Big Data Analysis
38 pages
Machine Learning Project Roadmap
No ratings yet
Machine Learning Project Roadmap
4 pages
Data Science Lab Manual
No ratings yet
Data Science Lab Manual
32 pages
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
César Pérez López
No ratings yet
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
César Pérez López
No ratings yet
Classification & Prediction
No ratings yet
Classification & Prediction
24 pages
Report On Machine Learning-Jyoti Poddar-EC084
No ratings yet
Report On Machine Learning-Jyoti Poddar-EC084
70 pages
IS Revision Questions
No ratings yet
IS Revision Questions
9 pages
DAV Previous Year Papers
No ratings yet
DAV Previous Year Papers
6 pages
TreePlan 212 Guide
No ratings yet
TreePlan 212 Guide
22 pages
Decision Trees / NLP
No ratings yet
Decision Trees / NLP
27 pages
(Topic 6) Decision Tree
No ratings yet
(Topic 6) Decision Tree
2 pages
Decision Trees Examples
No ratings yet
Decision Trees Examples
30 pages
Research Article
No ratings yet
Research Article
10 pages
2nd Paper On Crop Yield and Recommedation
No ratings yet
2nd Paper On Crop Yield and Recommedation
6 pages
EVPI
No ratings yet
EVPI
13 pages
Decision Making Presentation
No ratings yet
Decision Making Presentation
59 pages
Lecture - 4 - Decision Analysis
No ratings yet
Lecture - 4 - Decision Analysis
65 pages
Predicting energy consumption in multiple buildings using machine
No ratings yet
Predicting energy consumption in multiple buildings using machine
15 pages
Ai Ch-2 Ai Project Cycle
No ratings yet
Ai Ch-2 Ai Project Cycle
10 pages
Under Pressure - Mastering Decision-Making in Demanding Circumstances
No ratings yet
Under Pressure - Mastering Decision-Making in Demanding Circumstances
15 pages
Cost-Sensitive Trees For Interpretable Reinforcement Learning
No ratings yet
Cost-Sensitive Trees For Interpretable Reinforcement Learning
9 pages
Quiz 1 (Group 1) Ver. B
No ratings yet
Quiz 1 (Group 1) Ver. B
3 pages
ML Practical File
100% (2)
ML Practical File
43 pages
Management Science_Decision Tree
No ratings yet
Management Science_Decision Tree
2 pages
Fraud Detection System Micro-Project
No ratings yet
Fraud Detection System Micro-Project
27 pages
DWDM Notes Unit-4
No ratings yet
DWDM Notes Unit-4
89 pages
Project Report
No ratings yet
Project Report
24 pages
14 Ijsrcse 04156
No ratings yet
14 Ijsrcse 04156
12 pages
Interpretable ML
No ratings yet
Interpretable ML
447 pages
20011F0008 Samba PRC3
No ratings yet
20011F0008 Samba PRC3
21 pages
Large Engineering Project Risk Analysis
No ratings yet
Large Engineering Project Risk Analysis
9 pages
Machine Learning-Based Breast Cancer Detection
No ratings yet
Machine Learning-Based Breast Cancer Detection
82 pages
Decision Trees
No ratings yet
Decision Trees
11 pages

DS Journal-1

Uploaded by

DS Journal-1

Uploaded by

VIKAS VIDYA EDUCATION TRUST'S

Lords Universal College

This is to certify that Mr./Ms. of

Uni. Exam No. ___________ (____ Semester) has satisfactorily completed

Practical, in the subject of as a

part of B.Sc. in Computer Science Program during the academic year 20 -

Subject In-charge Co-Ordinator,

Signature of External Examiner

Sr No. Date Practical Signature

Data Frames and Basic Data Pre-processing.

Feature Scaling and Dummification

Regression and Its Types

Logistic Regression and Decision Tree

● Determine the optimal number of clusters using

Principal Component Analysis (PCA)

7 20/02/25 various criteria.

Aim: Read data from CSV and JSON files

# Reading a CSV file into a DataFrame

print(df.head()) # Display the first 5 rows of the DataFrame

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

In [2]: # Step 2: Basic Data Exploration

Out[2]: Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

0 1 5.1 3.5 1.4 0.2 Iris-setosa

1 2 4.9 3.0 1.4 0.2 Iris-setosa

2 3 4.7 3.2 1.3 0.2 Iris-setosa

3 4 4.6 3.1 1.5 0.2 Iris-setosa

4 5 5.0 3.6 1.4 0.2 Iris-setosa

145 146 6.7 3.0 5.2 2.3 Iris-virginica

146 147 6.3 2.5 5.0 1.9 Iris-virginica

147 148 6.5 3.0 5.2 2.0 Iris-virginica

148 149 6.2 3.4 5.4 2.3 Iris-virginica

149 150 5.9 3.0 5.1 1.8 Iris-virginica

Out[4]: Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm

count 150.000000 150.000000 150.000000 150.000000 150.000000

mean 75.500000 5.843333 3.054000 3.758667 1.198667

std 43.445368 0.828066 0.433594 1.764420 0.763161

min 1.000000 4.300000 2.000000 1.000000 0.100000

25% 38.250000 5.100000 2.800000 1.600000 0.300000

50% 75.500000 5.800000 3.000000 4.350000 1.300000

75% 112.750000 6.400000 3.300000 5.100000 1.800000

max 150.000000 7.900000 4.400000 6.900000 2.500000

Out[8]: Index(['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm',

In [9]: # Step 3: Checking for Missing Values

Missing Values in CSV Data:

In [11]: # Step 4: Handling Missing Values

Filled Missing Values with Mean (CSV Data):

In [12]: # Alternatively, you can drop rows with missing values:

In [13]: # Step 5: Handling Outliers

0 1.714797 0.897674 1.028611 1.336794 1.308593

1 1.691780 1.139200 0.124540 1.336794 1.308593

2 1.668762 1.380727 0.336720 1.393470 1.308593

3 1.645745 1.501490 0.106090 1.280118 1.308593

4 1.622728 1.018437 1.259242 1.336794 1.308593

... ... ... ... ... ...

145 1.622728 1.034539 0.124540 0.816888 1.443121

146 1.645745 0.551486 1.277692 0.703536 0.918985

147 1.668762 0.793012 0.124540 0.816888 1.050019

148 1.691780 0.430722 0.797981 0.930239 1.443121

149 1.714797 0.068433 0.124540 0.760212 0.787951

150 rows × 5 columns

In [14]: # Remove rows where any Z-score is greater than 3 (outliers)

Data After Removing Outliers (CSV Data):

Filtered Data (Rows with column_name > 3):

[149 rows x 5 columns]

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm

Grouped Data (Mean and Sum for Each Group):

Part I: Apply feature-scaling techniques

Out[2]: Wine Alcohol Malic.acid Ash Acl Mg Phenols Flavanoids Nonflavanoid.pheno

0 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.2

1 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.2

2 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.3

3 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.2

4 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.3

... ... ... ... ... ... ... ... ...

173 3 13.71 5.65 2.45 20.5 95 1.68 0.61 0.5

174 3 13.40 3.91 2.48 23.0 102 1.80 0.75 0.4

Uni. Exam No. _______ ( Semester) has satisfactorily completed