0% found this document useful (0 votes)
8 views

omkar

This document certifies that Mr. Omkar Mahesh Kashid has completed practical work in Data Science as part of his Bachelor of Computer Science curriculum at SMT. Sushiladevi Deshmukh College of Arts and Science and Commerce. It includes a detailed index of practicals covering various topics such as Excel, data frames, hypothesis testing, regression, and clustering techniques.

Uploaded by

yashverma0729
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

omkar

This document certifies that Mr. Omkar Mahesh Kashid has completed practical work in Data Science as part of his Bachelor of Computer Science curriculum at SMT. Sushiladevi Deshmukh College of Arts and Science and Commerce. It includes a detailed index of practicals covering various topics such as Excel, data frames, hypothesis testing, regression, and clustering techniques.

Uploaded by

yashverma0729
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 37

SMT. SUSHILADEVI DESHMUKH COLLEGE OF ARTS, SMT.

SUSHILADEVI
SCIENCE AND COMMERCE, Airoli, Sector-4, Navi Mumbai-400
SCIENCE708
AND COMMER

Date:

CERTIFICATE C

This is to Certify that Mr. Omkar Mahesh Kashid seat This no. 13
is of
to TCertify
YBSC that Mr.Sac
CS Semester VI has completed the practical work in theTYBSCsubject CS
of “Semester
Data V has c
Science ” during the academic year 2024 -25 under the guidance of AsIntelligence
“Artificial st.Prof ” during
Dnyaneshwar Deore being the partial requirement forofthe fulfilmentDnyaneshwar
Asst.Prof of D
the curriculum of Degree of Bachelor of Computer Science, University
fulfilment of theofcurriculum o
Mumbai. University of Mumbai.

Signature of Internal Guide HOD Guide


Signature of Internal

Signature of External Principal


Signature of External
College
Seal
INDEX

Sr. No. Name of Practicals Date Signature

1 Introduction to Excel 06/01/2025

2 Data Frames and Basic Data Pre-processing 20/01/2025

3 Feature Scaling and Dummification 27/01/2025

4 Hypothesis Testing 01/02/2025

5 ANOVA (Analysis of Variance) 03/02/2025

6 Regression and Its Types 10/02/2025

7 Logistic Regression and Decision Tree 17/02/2025

8 K-Means Clustering 21/02/2025

9 Principal Component Analysis (PCA) 24/02/2025

10 Data Visualization and Storytelling 01/03/2025

2
PRATICAL 01:
Introduction to Excel
A. Perform conditional formatting on a dataset using various criteria.

Steps:
Step 1: Go to conditional formatting > Greater Than

3
Step 2: Enter the greater than filter value for example 2000.

Step 3: Go to Data Bars > Solid Fill in conditional formatting.

4
B. Create a pivot table to analyse and summarize data.
Steps:
Step 1: select the entire table and go to Insert tab PivotChart >
Pivotchart.
Step 2: Select “New worksheet” in the create pivot chart
window.

Step 3: Select and drag attributes in the below boxes.

5
A. Use VLOOKUP function to retrieve information from a
different worksheet or table.Steps:
Step 1: click on an empty cell and type the following command.
=VLOOKUP(B3, B3:D3,1, TRUE)

B. Perform what-if analysis using Goal Seek to determine input


values for desiredoutput.
Steps-
Step 1: In the Data tab go to the what if analysis>Goal seek.

6
Step 2: Fill the information in the window accordingly and click ok.

7
8
PRACTICAL 02:
Data Frames and Basic Data Pre-processing
A.2.1 Read data from CSV and JSON files into a

import pandas as pd
df = pd.read_csv('Student_Marks.csv')
print("Our dataset ")
print(df)

A.2.2
import pandas as pd
data = pd.read_json('dataset.json')
print(data)

B.2.1 Perform basic data pre-processing tasks such as handling missing


values and outliers
import pandas as pd

9
df = pd.read_csv('titanic.csv')
print(df)
df.head(10)
print("Dataset after filling NA values with 0: ")
df2 = df.fillna(value=0)
print(df2)

B.2.2 Dropping NA values using dropna


import pandas as pd

10
df = pd.read_csv('titanic.csv')
print(df)
df.head(10)
print("Dataset after dropping NA values: ")
df.dropna(inplace=True)
print(df)

C.

Manipulate and transform data using functions like filtering, sorting, and
grouping

import pandas as pd
iris = pd.read_csv('Iris.csv')

setosa = iris[iris['Species'] == 'setosa']


print("Setosa samples:")

11
print(setosa.head())

sorted_iris = iris.sort_values(by='SepalLengthCm', ascending=False)


print("\nSorted iris dataset:")
print(sorted_iris.head())

grouped_species = iris.groupby('Species').mean()
print("\nMean measurements for each species:")
print(grouped_species)

PRACTICAL 3
Feature Scaling and Dummification
A. Apply feature-scaling techniques like standardization and
normalization to numerical features.

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler, StandardScaler

12
df = pd.read_csv('wine.csv', header=None, usecols=[0, 1, 2], skiprows=1)
df.columns = ['classlabel', 'Alcohol', 'Malic Acid']

print("Original DataFrame:")
print(df)

scaling = MinMaxScaler()
scaled_value = scaling.fit_transform(df[['Alcohol', 'Malic Acid']])
df[['Alcohol', 'Malic Acid']] = scaled_value

print("\nDataFrame after Min-Max Scaling:")


print(df)

scaling = StandardScaler()
scaled_standard_value = scaling.fit_transform(df[['Alcohol', 'Malic Acid']])
df[['Alcohol', 'Malic Acid']] = scaled_standard_value
print("\nDataFrame after Standard Scaling:")
print(df)

13
B. Perform feature Dummification to convert categorical variables into numerical
representations.

import pandas as pd
from sklearn.preprocessing import LabelEncoder
iris = pd.read_csv("Iris.csv")
print(iris)
le = LabelEncoder()
iris['code'] = le.fit_transform(iris['Species'])
print("\nDataset after Label Encoding:")
print(iris)

14
PRACTICAL 04:
A.Hypothesis Testing

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

np.random.seed(42)
sample1 = np.random.normal(loc=10, scale=2, size=30)
sample2 = np.random.normal(loc=12, scale=2, size=30)

15
t_statistic, p_value = stats.ttest_ind(sample1, sample2)

alpha = 0.05

print("Results of Two-Sample t-test:")


print(f'T-statistic: {t_statistic}')
print(f'P-value: {p_value}')
print(f"Degrees of Freedom: {len(sample1) + len(sample2) - 2}")

plt.figure(figsize=(10, 6))
plt.hist(sample1, alpha=0.5, label='Sample 1', color='blue')
plt.hist(sample2, alpha=0.5, label='Sample 2', color='orange')

plt.axvline(np.mean(sample1), color='blue', linestyle='dashed', linewidth=2)


plt.axvline(np.mean(sample2), color='orange', linestyle='dashed', linewidth=2)

plt.title('Distributions of Sample 1 and Sample 2')


plt.xlabel('Values')
plt.ylabel('Frequency')
plt.legend()

if p_value < alpha:


critical_region = np.linspace(min(sample1.min(), sample2.min()), max(sample1.max(),
sample2.max()), 1000)
plt.fill_between(critical_region, 0, 5, color='red', alpha=0.3, label='Critical Region')

plt.text(11, 5, f'T-statistic: {t_statistic:.2f}', ha='center', va='center', color='black',


backgroundcolor='white')

16
plt.show()

if p_value < alpha:


if np.mean(sample1) > np.mean(sample2):
print("Conclusion: There is significant evidence to reject the null hypothesis.")
print("Interpretation: The mean of Sample 1 is significantly higher than that of Sample
2.")
else:
print("Conclusion: There is significant evidence to reject the null hypothesis.")
print("Interpretation: The mean of Sample 2 is significantly higher than that of Sample
1.")
else:
print("Conclusion: Fail to reject the null hypothesis.")
print("Interpretation: There is not enough evidence to claim a significant difference
between the means.")

17
18
B. Chi Test

import pandas as pd
import numpy as np
import seaborn as sb
import warnings
from scipy import stats

warnings.filterwarnings('ignore')
df = sb.load_dataset('mpg')

print(df)
print(df['horsepower'].describe())
print(df['model_year'].describe())

bins = [0, 75, 150, 240]


df['horsepower_new'] = pd.cut(df['horsepower'], bins=bins, labels=['l', 'm', 'h'])
c = df['horsepower_new']
print(c)

ybins = [69, 72, 74, 84]


label = ['t1', 't2', 't3']
df['modelyear_new'] = pd.cut(df['model_year'], bins=ybins, labels=label)
newyear = df['modelyear_new']
print(newyear)

df_chi = pd.crosstab(df['horsepower_new'], df['modelyear_new'])


print(df_chi)

19
print(stats.chi2_contingency(df_chi))
print("There is sufficient evidence to reject the null hypothesis, indicating that there is a
significant association between 'horsepower_new' and 'modelyear_new' categories.")

20
21
PRACTICAL 05: ANOVA ( Analysis of Varience )

import pandas as pd
import scipy.stats as stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Sample data for four groups


group1 = [23, 25, 29, 34, 30]
group2 = [19, 20, 22, 24, 25]
group3 = [15, 18, 20, 21, 17]
group4 = [28, 24, 26, 30, 29]

# Combine data into a DataFrame


data = pd.DataFrame({'value': group1 + group2 + group3 + group4,
'group': ['Group1'] * len(group1) + ['Group2'] * len(group2) +
['Group3'] * len(group3) + ['Group4'] * len(group4)})

# Perform one-way ANOVA


f_statistics, p_value = stats.f_oneway(group1, group2, group3, group4)

print("One-way ANOVA:")
print("F-statistic:", f_statistics)
print("P-value:", p_value)

# Perform Tukey-Kramer post-hoc test


tukey_results = pairwise_tukeyhsd(data['value'], data['group'])

print("\nTukey-Kramer post-hoc test:")


print(tukey_results)

22
23
PRACTICAL 06:
Regression and its types
A. Linear regression

import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

housing = fetch_california_housing()
housing_df = pd.DataFrame(housing.data, columns=housing.feature_names)

housing_df['PRICE'] = housing.target
print(housing_df)

X = housing_df[['AveRooms']]
y = housing_df['PRICE']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

24
mse = mean_squared_error(y_test, model.predict(X_test))
r2 = r2_score(y_test, model.predict(X_test))

print("Mean Squared Error:", mse)


print("R-squared:", r2)
print("Intercept:", model.intercept_)
print("Coefficient:", model.coef_)

25
B. Multiple Linear Regression

import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

housing = fetch_california_housing()
housing_df = pd.DataFrame(housing.data, columns=housing.feature_names)

housing_df['PRICE'] = housing.target
print(housing_df)

X = housing_df.drop('PRICE', axis=1)
y = housing_df['PRICE']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

mse = mean_squared_error(y_test, model.predict(X_test))


r2 = r2_score(y_test, model.predict(X_test))

print("Mean Squared Error:", mse)


print("R-squared:", r2)

26
print("Intercept:", model.intercept_)
print("Coefficient:", model.coef_)

27
PRACTICAL 07:
Logistic Regression and Decision Tree

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score,
classification_report

iris = load_iris()
iris_df = pd.DataFrame(data=np.c_[iris['data'], iris['target']],
columns=iris['feature_names'] + ['target'])

binary_df = iris_df[iris_df['target'] != 2]
X = binary_df.drop('target', axis=1)
y = binary_df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

logistic_model = LogisticRegression()
logistic_model.fit(X_train, y_train)
y_pred_logistic = logistic_model.predict(X_test)

print("Logistic Regression Metrics")


print("Accuracy: ", accuracy_score(y_test, y_pred_logistic))

28
print("Precision:", precision_score(y_test, y_pred_logistic))
print("Recall: ", recall_score(y_test, y_pred_logistic))
print("\nClassification Report")
print(classification_report(y_test, y_pred_logistic))

decision_tree_model = DecisionTreeClassifier()
decision_tree_model.fit(X_train, y_train)
y_pred_tree = decision_tree_model.predict(X_test)

print("\nDecision Tree Metrics")


print("Accuracy: ", accuracy_score(y_test, y_pred_tree))
print("Precision:", precision_score(y_test, y_pred_tree))
print("Recall: ", recall_score(y_test, y_pred_tree))
print("\nClassification Report")
print(classification_report(y_test, y_pred_tree))

29
PRACTICAL 08: K-Means Clustering

import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

data = pd.read_csv("wholesale.csv")
data.head()

categorical_features = ['Channel', 'Region']


continuous_features = ['Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper', 'Delicassen']

data[continuous_features].describe()

for col in categorical_features:


dummies = pd.get_dummies(data[col], prefix=col)
data = pd.concat([data, dummies], axis=1)
data.drop(col, axis=1, inplace=True)

data.head()

mms = MinMaxScaler()
mms.fit(data)
data_transformed = mms.transform(data)

30
sum_of_squared_distances = []
K = range(1, 15)
for k in K:
km = KMeans(n_clusters=k)
km = km.fit(data_transformed)
sum_of_squared_distances.append(km.inertia_)

plt.plot(K, sum_of_squared_distances, 'bx-')


plt.xlabel('Number of Clusters (k)')
plt.ylabel('Sum of Squared Distances')
plt.title('Elbow Method for Optimal k')
plt.show()

31
Principal Component Analysis (PCA)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris

iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target
print(X)
print(y)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

32
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

explained_variance_ratio = pca.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance_ratio)

n_components = 2
pca = PCA(n_components=n_components)
X_reduced = pca.fit_transform(X_scaled)

plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, cmap='viridis', edgecolor='k',
s=100)
plt.colorbar(scatter, label='Target Label')
plt.title('Data in Reduced-Dimensional Space (2 Components)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.grid()
plt.show()

33
PRACTICAL 10:
Data Visualization and Storytelling

pip install plotly

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
data={
'Product': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
'Region': ['North', 'North', 'North', 'South', 'South', 'South', 'East', 'East', 'East', 'West', 'West',
'West'],

34
'Month': ['Jan', 'Jan', 'Jan', 'Feb', 'Feb', 'Feb', 'Mar', 'Mar', 'Mar', 'Apr', 'Apr', 'Apr'],
'Sales': [120, 150, 100, 200, 180, 130, 250, 220, 140, 300, 260, 180],
}
df=pd.DataFrame(data)
print(df)

plt.figure(figsize=(8, 5))
sns.barplot(x=df.groupby('Product')['Sales'].sum().index, y=df.groupby('Product')
['Sales'].sum().values)
plt.title('Total Sales by Product')
plt.xlabel('Product')
plt.ylabel('Total Sales')
plt.show()

35
monthly_sales = df.groupby('Month')['Sales'].sum().reindex(['Jan', 'Feb', 'Mar', 'Apr'])
plt.figure(figsize=(8, 5))
sns.lineplot(x=monthly_sales.index, y=monthly_sales.values, marker='o')
plt.title('Monthly Sales Trend')
plt.xlabel('Month')
plt.ylabel('Sales')
plt.show()

36
heatmap_data = df.pivot_table(index='Region', columns='Product', values='Sales',
aggfunc='sum')
plt.figure(figsize=(8, 5))
sns.heatmap(heatmap_data, annot=True, cmap='Blues', fmt='.0f')
plt.title('Sales by Region and Product')
plt.show()

fig = px.bar(df, x='Region', y='Sales', color='Product', title='Sales by Region and Product')


fig.show()

37

You might also like