omkar
omkar
SUSHILADEVI
SCIENCE AND COMMERCE, Airoli, Sector-4, Navi Mumbai-400
SCIENCE708
AND COMMER
Date:
CERTIFICATE C
This is to Certify that Mr. Omkar Mahesh Kashid seat This no. 13
is of
to TCertify
YBSC that Mr.Sac
CS Semester VI has completed the practical work in theTYBSCsubject CS
of “Semester
Data V has c
Science ” during the academic year 2024 -25 under the guidance of AsIntelligence
“Artificial st.Prof ” during
Dnyaneshwar Deore being the partial requirement forofthe fulfilmentDnyaneshwar
Asst.Prof of D
the curriculum of Degree of Bachelor of Computer Science, University
fulfilment of theofcurriculum o
Mumbai. University of Mumbai.
2
PRATICAL 01:
Introduction to Excel
A. Perform conditional formatting on a dataset using various criteria.
Steps:
Step 1: Go to conditional formatting > Greater Than
3
Step 2: Enter the greater than filter value for example 2000.
4
B. Create a pivot table to analyse and summarize data.
Steps:
Step 1: select the entire table and go to Insert tab PivotChart >
Pivotchart.
Step 2: Select “New worksheet” in the create pivot chart
window.
5
A. Use VLOOKUP function to retrieve information from a
different worksheet or table.Steps:
Step 1: click on an empty cell and type the following command.
=VLOOKUP(B3, B3:D3,1, TRUE)
6
Step 2: Fill the information in the window accordingly and click ok.
7
8
PRACTICAL 02:
Data Frames and Basic Data Pre-processing
A.2.1 Read data from CSV and JSON files into a
import pandas as pd
df = pd.read_csv('Student_Marks.csv')
print("Our dataset ")
print(df)
A.2.2
import pandas as pd
data = pd.read_json('dataset.json')
print(data)
9
df = pd.read_csv('titanic.csv')
print(df)
df.head(10)
print("Dataset after filling NA values with 0: ")
df2 = df.fillna(value=0)
print(df2)
10
df = pd.read_csv('titanic.csv')
print(df)
df.head(10)
print("Dataset after dropping NA values: ")
df.dropna(inplace=True)
print(df)
C.
Manipulate and transform data using functions like filtering, sorting, and
grouping
import pandas as pd
iris = pd.read_csv('Iris.csv')
11
print(setosa.head())
grouped_species = iris.groupby('Species').mean()
print("\nMean measurements for each species:")
print(grouped_species)
PRACTICAL 3
Feature Scaling and Dummification
A. Apply feature-scaling techniques like standardization and
normalization to numerical features.
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler, StandardScaler
12
df = pd.read_csv('wine.csv', header=None, usecols=[0, 1, 2], skiprows=1)
df.columns = ['classlabel', 'Alcohol', 'Malic Acid']
print("Original DataFrame:")
print(df)
scaling = MinMaxScaler()
scaled_value = scaling.fit_transform(df[['Alcohol', 'Malic Acid']])
df[['Alcohol', 'Malic Acid']] = scaled_value
scaling = StandardScaler()
scaled_standard_value = scaling.fit_transform(df[['Alcohol', 'Malic Acid']])
df[['Alcohol', 'Malic Acid']] = scaled_standard_value
print("\nDataFrame after Standard Scaling:")
print(df)
13
B. Perform feature Dummification to convert categorical variables into numerical
representations.
import pandas as pd
from sklearn.preprocessing import LabelEncoder
iris = pd.read_csv("Iris.csv")
print(iris)
le = LabelEncoder()
iris['code'] = le.fit_transform(iris['Species'])
print("\nDataset after Label Encoding:")
print(iris)
14
PRACTICAL 04:
A.Hypothesis Testing
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
np.random.seed(42)
sample1 = np.random.normal(loc=10, scale=2, size=30)
sample2 = np.random.normal(loc=12, scale=2, size=30)
15
t_statistic, p_value = stats.ttest_ind(sample1, sample2)
alpha = 0.05
plt.figure(figsize=(10, 6))
plt.hist(sample1, alpha=0.5, label='Sample 1', color='blue')
plt.hist(sample2, alpha=0.5, label='Sample 2', color='orange')
16
plt.show()
17
18
B. Chi Test
import pandas as pd
import numpy as np
import seaborn as sb
import warnings
from scipy import stats
warnings.filterwarnings('ignore')
df = sb.load_dataset('mpg')
print(df)
print(df['horsepower'].describe())
print(df['model_year'].describe())
19
print(stats.chi2_contingency(df_chi))
print("There is sufficient evidence to reject the null hypothesis, indicating that there is a
significant association between 'horsepower_new' and 'modelyear_new' categories.")
20
21
PRACTICAL 05: ANOVA ( Analysis of Varience )
import pandas as pd
import scipy.stats as stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd
print("One-way ANOVA:")
print("F-statistic:", f_statistics)
print("P-value:", p_value)
22
23
PRACTICAL 06:
Regression and its types
A. Linear regression
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
housing = fetch_california_housing()
housing_df = pd.DataFrame(housing.data, columns=housing.feature_names)
housing_df['PRICE'] = housing.target
print(housing_df)
X = housing_df[['AveRooms']]
y = housing_df['PRICE']
model = LinearRegression()
model.fit(X_train, y_train)
24
mse = mean_squared_error(y_test, model.predict(X_test))
r2 = r2_score(y_test, model.predict(X_test))
25
B. Multiple Linear Regression
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
housing = fetch_california_housing()
housing_df = pd.DataFrame(housing.data, columns=housing.feature_names)
housing_df['PRICE'] = housing.target
print(housing_df)
X = housing_df.drop('PRICE', axis=1)
y = housing_df['PRICE']
model = LinearRegression()
model.fit(X_train, y_train)
26
print("Intercept:", model.intercept_)
print("Coefficient:", model.coef_)
27
PRACTICAL 07:
Logistic Regression and Decision Tree
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score,
classification_report
iris = load_iris()
iris_df = pd.DataFrame(data=np.c_[iris['data'], iris['target']],
columns=iris['feature_names'] + ['target'])
binary_df = iris_df[iris_df['target'] != 2]
X = binary_df.drop('target', axis=1)
y = binary_df['target']
logistic_model = LogisticRegression()
logistic_model.fit(X_train, y_train)
y_pred_logistic = logistic_model.predict(X_test)
28
print("Precision:", precision_score(y_test, y_pred_logistic))
print("Recall: ", recall_score(y_test, y_pred_logistic))
print("\nClassification Report")
print(classification_report(y_test, y_pred_logistic))
decision_tree_model = DecisionTreeClassifier()
decision_tree_model.fit(X_train, y_train)
y_pred_tree = decision_tree_model.predict(X_test)
29
PRACTICAL 08: K-Means Clustering
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
data = pd.read_csv("wholesale.csv")
data.head()
data[continuous_features].describe()
data.head()
mms = MinMaxScaler()
mms.fit(data)
data_transformed = mms.transform(data)
30
sum_of_squared_distances = []
K = range(1, 15)
for k in K:
km = KMeans(n_clusters=k)
km = km.fit(data_transformed)
sum_of_squared_distances.append(km.inertia_)
31
Principal Component Analysis (PCA)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target
print(X)
print(y)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
32
pca = PCA()
X_pca = pca.fit_transform(X_scaled)
explained_variance_ratio = pca.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance_ratio)
n_components = 2
pca = PCA(n_components=n_components)
X_reduced = pca.fit_transform(X_scaled)
plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, cmap='viridis', edgecolor='k',
s=100)
plt.colorbar(scatter, label='Target Label')
plt.title('Data in Reduced-Dimensional Space (2 Components)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.grid()
plt.show()
33
PRACTICAL 10:
Data Visualization and Storytelling
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
data={
'Product': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
'Region': ['North', 'North', 'North', 'South', 'South', 'South', 'East', 'East', 'East', 'West', 'West',
'West'],
34
'Month': ['Jan', 'Jan', 'Jan', 'Feb', 'Feb', 'Feb', 'Mar', 'Mar', 'Mar', 'Apr', 'Apr', 'Apr'],
'Sales': [120, 150, 100, 200, 180, 130, 250, 220, 140, 300, 260, 180],
}
df=pd.DataFrame(data)
print(df)
plt.figure(figsize=(8, 5))
sns.barplot(x=df.groupby('Product')['Sales'].sum().index, y=df.groupby('Product')
['Sales'].sum().values)
plt.title('Total Sales by Product')
plt.xlabel('Product')
plt.ylabel('Total Sales')
plt.show()
35
monthly_sales = df.groupby('Month')['Sales'].sum().reindex(['Jan', 'Feb', 'Mar', 'Apr'])
plt.figure(figsize=(8, 5))
sns.lineplot(x=monthly_sales.index, y=monthly_sales.values, marker='o')
plt.title('Monthly Sales Trend')
plt.xlabel('Month')
plt.ylabel('Sales')
plt.show()
36
heatmap_data = df.pivot_table(index='Region', columns='Product', values='Sales',
aggfunc='sum')
plt.figure(figsize=(8, 5))
sns.heatmap(heatmap_data, annot=True, cmap='Blues', fmt='.0f')
plt.title('Sales by Region and Product')
plt.show()
37