0% found this document useful (0 votes)
24 views

data-science-practical-with-solutions-bsc-cs-sem-6

This document outlines practical exercises for a Data Science course, including data manipulation using Pandas, feature scaling, hypothesis testing, ANOVA, regression techniques, and clustering methods. Each practical includes code snippets and explanations for tasks such as reading data from files, handling missing values, performing statistical tests, and implementing machine learning algorithms. The document serves as a guide for students to apply theoretical concepts in practical scenarios.

Uploaded by

pawan.g7208
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

data-science-practical-with-solutions-bsc-cs-sem-6

This document outlines practical exercises for a Data Science course, including data manipulation using Pandas, feature scaling, hypothesis testing, ANOVA, regression techniques, and clustering methods. Each practical includes code snippets and explanations for tasks such as reading data from files, handling missing values, performing statistical tests, and implementing machine learning algorithms. The document serves as a guide for students to apply theoretical concepts in practical scenarios.

Uploaded by

pawan.g7208
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

lOMoARcPSD|49569690

Data Science Practical with solutions Bsc CS Sem 6

Bsc. Computer Science (University of Mumbai)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Pawan Ramji Gupta ([email protected])
lOMoARcPSD|49569690

PRACTICAL NO:-01

Data Frames and Basic Data Pre-processing

A. Read data from CSV and JSON files into a data frame.

(1) # Read data from a csv file

import pandas as pd

df = pd.read_csv('Student_Marks.csv')

print("Our dataset ") print(df)

Output:-

(2) # Reading data from a JSON file

import pandas as pd

data = pd.read_json('dataset.json')

print(data)

Output:-

Downloaded by Pawan Ramji Gupta ([email protected])


lOMoARcPSD|49569690

B. Perform basic data pre-processing tasks such as handling missing values and
outliers.

Code:-

# Replacing NA values using fillna()

import pandas as pd

df = pd.read_csv('titanic.csv')

print(df) df.head(10)

print("Dataset after filling NA values with 0 : ")

df2=df.fillna(value=0)

print(df2)

Output:-

Downloaded by Pawan Ramji Gupta ([email protected])


lOMoARcPSD|49569690

(2) # Dropping NA values using dropna()

import pandas as pd

df = pd.read_csv('titanic.csv')

print(df) df.head(10)

print("Dataset after dropping NA values: ")

df.dropna(inplace = True)

print(df)

Output:-

Downloaded by Pawan Ramji Gupta ([email protected])


lOMoARcPSD|49569690

C. Manipulate and transform data using functions like filtering, sorting, and grouping
Code:-
import pandas as pd
# Load iris dataset
iris = pd.read_csv('Iris.csv')

# Filtering data based on a condition


setosa = iris[iris['Species'] == 'setosa']
print("Setosa samples:")
print(setosa.head())

# Sorting data
sorted_iris = iris.sort_values(by='SepalLengthCm', ascending=False)
print("\nSorted iris dataset:")
print(sorted_iris.head())

# Grouping data
grouped_species = iris.groupby('Species').mean()
print("\nMean measurements for each species:")
print(grouped_species)

Output:-

Downloaded by Pawan Ramji Gupta ([email protected])


lOMoARcPSD|49569690

PRACTICAL NO:02

Feature Scaling and Dummification


A. Apply feature-scaling techniques like standardization and normalization to numerical
features.
Code:-

# Standardization and normalization


import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler, StandardScaler
df = pd.read_csv('wine.csv', header=None, usecols=[0, 1, 2], skiprows=1)
df.columns = ['classlabel', 'Alcohol', 'Malic Acid']
print("Original DataFrame:")
print(df)
scaling=MinMaxScaler()
scaled_value=scaling.fit_transform(df[['Alcohol','Malic Acid']])
df[['Alcohol','Malic Acid']]=scaled value
print("\n Dataframe after MinMax Scaling")
print(df)
scaling=StandardScaler()
scaled_standardvalue=scaling.fit_transform
(df[['Alcohol','Malic Acid']]) df[['Alcohol','Malic Acid']]=scaled_standardvalue
print("\n Dataframe after Standard Scaling")
print(df)

Downloaded by Pawan Ramji Gupta ([email protected])


lOMoARcPSD|49569690

Output:-

Downloaded by Pawan Ramji Gupta ([email protected])


lOMoARcPSD|49569690

B. Perform feature Dummification to convert categorical variables into numerical


representations.

Code:
import pandas as pd
iris=pd.read_csv("Iris.csv")
print(iris)
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
iris['code']=le.fit_transform(iris.Species)
print(iris)
Output:-

Downloaded by Pawan Ramji Gupta ([email protected])


lOMoARcPSD|49569690

PRACTICAL NO:-03

Hypothesis Testing
Conduct a hypothesis test using appropriate statistical tests (e.g., t-test, chi-square test.

# t-test
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# Generate two samples for demonstration purposes


np.random.seed(42)
sample1 = np.random.normal(loc=10, scale=2, size=30)
sample2 = np.random.normal(loc=12, scale=2, size=30)

# Perform a two-sample t-test


t_statistic, p_value = stats.ttest_ind(sample1, sample2)

# Set the significance level


alpha = 0.05

print("Results of Two-Sample t-test:")


print(f'T-statistic: {t_statistic}')

print(f'P-value: {p_value}')

print(f"Degrees of Freedom: {len(sample1) + len(sample2) - 2}")

# Plot the distributions

plt.figure(figsize=(10, 6))

plt.hist(sample1, alpha=0.5, label='Sample 1', color='blue')

plt.hist(sample2, alpha=0.5, label='Sample 2', color='orange')

Downloaded by Pawan Ramji Gupta ([email protected])


lOMoARcPSD|49569690

plt.axvline(np.mean(sample1), color='blue', linestyle='dashed', linewidth=2)


plt.axvline(np.mean(sample2), color='orange', linestyle='dashed', linewidth=2)
plt.title('Distributions of Sample 1 and Sample 2')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.legend()

# Highlight the critical region if null hypothesis is rejected

if p_value < alpha:

critical_region = np.linspace(min(sample1.min(), sample2.min()),


max(sample1.max(), sample2.max()), 1000)

plt.fill_between(critical_region, 0, 5, color='red', alpha=0.3, label='Critical


Region')

plt.text(11, 5, f'T-statistic: {t_statistic:.2f}', ha='center', va='center', color='black',


backgroundcolor='white')

# Show the plot

plt.show()

# Draw Conclusions

if p_value < alpha: if np.mean(sample1) > np.mean(sample2):

print("Conclusion: There is significant evidence to reject the null hypothesis.")

print("Interpretation: The mean of Sample 1 is significantly higher than that of


Sample 2.")

else:

print("Conclusion: There is significant evidence to reject the null hypothesis.")

print("Interpretation: The mean of Sample 2 is significantly higher than that of


Sample 1.")

else:

print("Conclusion: Fail to reject the null hypothesis.")

print("Interpretation: There is not enough evidence to claim a significant


difference between the means.")

Downloaded by Pawan Ramji Gupta ([email protected])


lOMoARcPSD|49569690

Output:-

Downloaded by Pawan Ramji Gupta ([email protected])


lOMoARcPSD|49569690

#chi-test

import pandas as pd

import numpy as np

import matplotlib as plt

import seaborn as sb

import warnings

from scipy import stats

warnings.filterwarnings('ignore')

df=sb.load_dataset('mpg')

print(df) print(df['horsepower'].describe())

print(df['model_year'].describe()) bins=[0,75,150,240]
df['horsepower_new']=pd.cut(df['horsepower'],bins=bins,labels=['l','m','h'])

c=df['horsepower_new']

print(c)

ybins=[69,72,74,84]

label=['t1','t2','t3']

df['modelyear_new']=pd.cut(df['model_year'],bins=ybins,labels=label)

newyear=df['modelyear_new'] print(newyear)
df_chi=pd.crosstab(df['horsepower_new'],df['modelyear_new'])

print(df_chi)

print(stats.chi2_contingency(df_chi)

Downloaded by Pawan Ramji Gupta ([email protected])


lOMoARcPSD|49569690

Output:

PRACTICAL NO:-04

ANOVA (Analysis of Variance)

Perform one-way ANOVA to compare means across multiple groups.

Downloaded by Pawan Ramji Gupta ([email protected])


lOMoARcPSD|49569690

Conduct post-hoc teststo identify significant differences between group means.

import pandas as pd

import scipy.stats as stats

from statsmodels.stats.multicomp import pairwise_tukeyhsd

group1 = [23, 25, 29, 34, 30]

group2 = [19, 20, 22, 24, 25]

group3 = [15, 18, 20, 21, 17]

group4 = [28, 24, 26, 30, 29]

all_data = group1 + group2 + group3 + group4 group_labels = ['Group1'] * len(group1) +


['Group2'] * len(group2) + ['Group3'] * len(group3) + ['Group4'] * len(group4)

f_statistics, p_value = stats.f_oneway(group1, group2, group3, group4)

print("one-way ANOVA:")

print("F-statistics:", f_statistics)

print("p-value", p_value)

tukey_results = pairwise_tukeyhsd(all_data, group_labels)

print("\nTukey-Kramer post-hoc test:")

print(tukey_results)

Output:-

Downloaded by Pawan Ramji Gupta ([email protected])


lOMoARcPSD|49569690

Downloaded by Pawan Ramji Gupta ([email protected])


lOMoARcPSD|49569690

PRACTICAL NO:-05
Regression and its Types.

import numpy as np

import pandas as pd

from sklearn.datasets import fetch_california_housing

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error, r2_score

housing = fetch_california_housing()

housing_df = pd.DataFrame(housing.data, columns=housing.feature_names)

print(housing_df)

housing_df['PRICE'] = housing.target

X = housing_df[['AveRooms']]

y = housing_df['PRICE']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()

model.fit(X_train, y_train)

mse = mean_squared_error(y_test, model.predict(X_test)) r2 = r2_score(y_test,


model.predict(X_test))

print("Mean Squared Error:", mse)

print("R-squared:", r2)

print("Intercept:", model.intercept_)

print("CoefÏcient:", model.coef_)

Downloaded by Pawan Ramji Gupta ([email protected])


lOMoARcPSD|49569690

#Multiple Liner Regression

X = housing_df.drop('PRICE',axis=1)

y = housing_df['PRICE']

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

model = LinearRegression()

model.fit(X_train,y_train)

y_pred = model.predict(X_test)

mse = mean_squared_error(y_test,y_pred)

r2 = r2_score(y_test,y_pred)

print("Mean Squared Error:",mse)

print("R-squared:",r2)

print("Intercept:",model.intercept_)

print("CoefÏcient:",model.coef_)

PRACTICAL NO:-06

Downloaded by Pawan Ramji Gupta ([email protected])


lOMoARcPSD|49569690

Logistic Regression and Decision Tree

import numpy as np

import pandas as pd

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import accuracy_score, precision_score, recall_score, classification_report

# Load the Iris dataset and create a binary classification problem

iris = load_iris()

iris_df = pd.DataFrame(data=np.c_[iris['data'], iris['target']], columns=iris['feature_names'] +


['target'])

binary_df = iris_df[iris_df['target'] != 2]

X = binary_df.drop('target', axis=1)

y = binary_df['target']

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a logistic regression model and evaluate its performance

logistic_model = LogisticRegression()

ogistic_model.fit(X_train, y_train)

y_pred_logistic = logistic_model.predict(X_test)

print("Logistic Regression Metrics")

print("Accuracy: ", accuracy_score(y_test, y_pred_logistic))

print("Precision:", precision_score(y_test, y_pred_logistic))

print("Recall: ", recall_score(y_test, y_pred_logistic))

print("\nClassification Report")

Downloaded by Pawan Ramji Gupta ([email protected])


lOMoARcPSD|49569690

print(classification_report(y_test, y_pred_logistic))

# Train a decision tree model and evaluate its performance

decision_tree_model = DecisionTreeClassifier()

decision_tree_model.fit(X_train, y_train)

y_pred_tree = decision_tree_model.predict(X_test)

print("\nDecision Tree Metrics")

print("Accuracy: ", accuracy_score(y_test, y_pred_tree))

print("Precision:", precision_score(y_test, y_pred_tree))

print("Recall: ", recall_score(y_test, y_pred_tree))

print("\nClassification Report")

print(classification_report(y_test, y_pred_tree))

Output:-

PRACTICAL NO:-07

Downloaded by Pawan Ramji Gupta ([email protected])


lOMoARcPSD|49569690

K-Means clustering

import pandas as pd

from sklearn.preprocessing import MinMaxScaler

from sklearn.cluster import KMeans

import matplotlib.pyplot as plt

data = pd.read_csv("C:\\Users\Reape\Downloads\wholesale\wholesale.csv") data.head()

categorical_features = ['Channel', 'Region']

continuous_features = ['Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper', 'Delicassen']

data[continuous_features].describe()

for col in categorical_features:

dummies = pd.get_dummies(data[col], prefix = col)

data = pd.concat([data, dummies], axis = 1) data.drop(col, axis = 1, inplace = True)

data.head()

mms = MinMaxScaler()

mms.fit(data)

data_transformed = mms.transform(data)

sum_of_squared_distances = []

K = range(1, 15)

for k in K:

km = KMeans(n_clusters=k)

km = km.fit(data_transformed)

sum_of_squared_distances.append(km.inertia_)

plt.plot(K, sum_of_squared_distances, 'bx-')

plt.xlabel('k')

plt.ylabel('sum_of_squared_distances')

Downloaded by Pawan Ramji Gupta ([email protected])


lOMoARcPSD|49569690

plt.title('elbow Mehtod for optimal k')

plt.show()

Downloaded by Pawan Ramji Gupta ([email protected])


lOMoARcPSD|49569690

PRACTICAL NO:-08

Principal Component Analysis (PCA)

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import load_iris

from sklearn.preprocessing import StandardScaler

from sklearn.decomposition import PCA

iris = load_iris()

iris_df = pd.DataFrame(data=np.c_[iris['data'], iris['target']], columns=iris['feature_names'] +


['target'])

X = iris_df.drop('target', axis=1) y = iris_df['target']

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

pca = PCA()

X_pca = pca.fit_transform(X_scaled)

explained_variance_ratio = pca.explained_variance_ratio_

plt.figure(figsize=(8, 6))

plt.plot(np.cumsum(explained_variance_ratio), marker='o', linestyle='--')

plt.title('Explained Variance Ratio')

plt.xlabel('Number of Principal Components')

plt.ylabel('Cumulative Explained Variance Ratio')

plt.grid(True)

plt.show()

Downloaded by Pawan Ramji Gupta ([email protected])


lOMoARcPSD|49569690

cumulative_variance_ratio = np.cumsum(explained_variance_ratio)

n_components = np.argmax(cumulative_variance_ratio >= 0.95) + 1

print(f"Number of principal components to explain 95% variance: {n_components}")

pca = PCA(n_components=n_components)

X_reduced = pca.fit_transform(X_scaled)

plt.figure(figsize=(8, 6))

plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, cmap='viridis', s=50, alpha=0.5)

plt.title('Data in Reduced-dimensional Space')

plt.xlabel('Principal Component 1')

plt.ylabel('Principal Component 2')

plt.colorbar(label='Target')

plt.show()

Output:-

Downloaded by Pawan Ramji Gupta ([email protected])


lOMoARcPSD|49569690

PRACTICAL NO:-09

Downloaded by Pawan Ramji Gupta ([email protected])


lOMoARcPSD|49569690

Data Visualization and Storytelling

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns import numpy as np

# Generate random data

np.random.seed(42) # Set a seed for reproducibility

# Create a DataFrame with random data

data = pd.DataFrame({

'variable1': np.random.normal(0, 1, 1000),

'variable2': np.random.normal(2, 2, 1000) + 0.5 * np.random.normal(0, 1, 1000),

'variable3': np.random.normal(-1, 1.5, 1000),

'category': pd.Series(np.random.choice(['A', 'B', 'C', 'D'], size=1000, p=[0.4, 0.3, 0.2, 0.1]),
dtype='category') })

# Create a scatter plot to visualize the relationship between two variables

plt.figure(figsize=(10, 6))

plt.scatter(data['variable1'], data['variable2'], alpha=0.5)

plt.title('Relationship between Variable 1 and Variable 2', fontsize=16)

plt.xlabel('Variable 1', fontsize=14)

plt.ylabel('Variable 2', fontsize=14)

plt.show()

# Create a bar chart to visualize the distribution of a categorical variable

plt.figure(figsize=(10, 6)) sns.countplot(x='category', data=data)

plt.title('Distribution of Categories', fontsize=16) plt.xlabel('Category', fontsize=14)

plt.ylabel('Count', fontsize=14)

plt.xticks(rotation=45)

plt.show()

Downloaded by Pawan Ramji Gupta ([email protected])


lOMoARcPSD|49569690

# Create a heatmap to visualize the correlation between numerical variables

plt.figure(figsize=(10, 8)) numerical_cols = ['variable1', 'variable2', 'variable3']


sns.heatmap(data[numerical_cols].corr(), annot=True, cmap='coolwarm')

plt.title('Correlation Heatmap', fontsize=16)

plt.show()

Output:-

Downloaded by Pawan Ramji Gupta ([email protected])


lOMoARcPSD|49569690

# Data Storytelling

Downloaded by Pawan Ramji Gupta ([email protected])


lOMoARcPSD|49569690

print("Title: Exploring the Relationship between Variable 1 and Variable 2")

print("\nThe scatter plot (Figure 1) shows the relationship between Variable 1 and
Variable 2. We can observe a positive correlation, indicating that as Variable 1 increases,
Variable 2 tends to increase as well. However, there is a considerable amount of scatter,
suggesting that other factors may influence this relationship.")

print("\nScatter Plot")

print("Figure 1: Scatter Plot of Variable 1 and Variable 2")

print("\nTo better understand the distribution of the categorical variable 'category', we


created a bar chart (Figure 2). The chart reveals that Category A has the highest
frequency, followed by Category B, Category C, and Category D. This information could
be useful for further analysis or decision-making processes.")

print("\nBar Chart")

print("Figure 2: Distribution of Categories")

print("\nAdditionally, we explored the correlation between numerical variables using a


heatmap (Figure 3). The heatmap shows that Variable 1 and Variable 2 have a strong
positive correlation, confirming the observation from the scatter plot. However, we can
also see that Variable 3 has a moderate negative correlation with both Variable 1 and
Variable 2, suggesting that it may have an opposing effect on the relationship between
the first two variables.")

print("\nHeatmap")

print("Figure 3: Correlation Heatmap")

print("\nIn summary, the visualizations and analysis provide insights into the
relationships between variables, the distribution of categories, and the correlations
between numerical variables. These findings can be used to inform further analysis,
decision-making, or to generate new hypotheses for investigation.")

Output:-

Downloaded by Pawan Ramji Gupta ([email protected])


lOMoARcPSD|49569690

Downloaded by Pawan Ramji Gupta ([email protected])

You might also like