0% found this document useful (0 votes)
10 views

Vansh

Uploaded by

vanshk970
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Vansh

Uploaded by

vanshk970
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

DELHI TECHNOLOGICAL

UNIVERSITY

IT-205n
PRACTICAL FILE

DATA SCIENCE
AND

VISUALIZATION

Submitted By: Submitted to:

Vansh Dr Abhishek Verma


23/IT/169 (Group-2)
3rd Semester
Delhi Technological University
INDEX

S.No Experiment Date Sign

Familiarize with Python software (Fibonacci numbers, sorting a


1 22-08-2024
list of numbers)
Write a program to load a dataset from the UCI repository into
2 Python workspace and print its dimensions. Also, load the 29-08-2024
target or class variable and print its dimensions.
Write a program to clean the data by removing noisy data or
3 05-09-2024
outliers and solving missing value problems.

Write a program to explore different data visualisation


4 12-09-2024
techniques.
Write a program to perform statistical analysis of the data in a
5 given dataset (mean, variance, standard deviation, median, 19-09-2024
mode).
Write a program to perform a classification experiment on a
6 dataset and its target or class variable (Naïve Bayes, Random 26-09-2024
Forest).
Write a program to perform a regression experiment on a
7 03-10-2024
dataset (linear regression).

Write a program to perform a clustering experiment on a


8 24-10-24
dataset (K-means, Hierarchical agglomerative clustering).

Write a program to perform time series analysis for a given


9 24-10-24
dataset.

Write a program to perform association rule mining for a given


10 31-10-24
dataset.

Vassu Yadav (23/IT/173)


PROGRAM 1
Objective: To familiarize with Python software (Fibonacci numbers, sorting a list of
numbers)

Code:
def fibonacci(n):
fib_sequence = [0, 1]
while len(fib_sequence) < n:
fib_sequence.append(fib_sequence[-1] + fib_sequence[-2])
return fib_sequence
n = 10
print(f"First {n} Fibonacci numbers: {fibonacci(n)}")
def bubble_sort(arr):
n = len(arr)
for i in range(n):
for j in range(0, n-i-1):
if arr[j] > arr[j+1]:
arr[j], arr[j+1] = arr[j+1], arr[j]
return arr
arr = [64, 34, 25, 12, 22, 11, 90]
sorted_arr = bubble_sort(arr)
print("Sorted array:", sorted_arr)

Output:

Vassu Yadav (23/IT/173)


PROGRAM 2
Objective: Write a program to load a dataset from UCI repository into Python workspace
and print its dimensions. Also load the target or class variable and print its dimensions.

Code:
import pandas as pd
from ucimlrepo import fetch_ucirepo

rice_dataset = fetch_ucirepo("Rice (Cammeo and Osmancik)")

data = rice_dataset.data.features
target = rice_dataset.data.targets

print("Data dimensions:", data.shape)


print("Target dimensions:", target.shape)

Output:

Vassu Yadav (23/IT/173)


PROGRAM 3
Objective: Write a program to clean the data by removing noisy data or outliers and
solving missing value problem.

Code:
import pandas as pd
from ucimlrepo import fetch_ucirepo

rice_cammeo_and_osmancik = fetch_ucirepo(id=545)
X = rice_cammeo_and_osmancik.data.features
y = rice_cammeo_and_osmancik.data.targets

df = X.copy()
df['Target'] = y
df=df.fillna(df.drop(columns=['Target']).median())

numeric_columns = df.select_dtypes(include=['float64', 'int64']).columns

def remove_outliers_iqr(df, numeric_columns):


Q1 = df[numeric_columns].quantile(0.25)
Q3 = df[numeric_columns].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR


upper_bound = Q3 + 1.5 * IQR
df_cleaned = df[~((df[numeric_columns] < lower_bound) | (df[numeric_columns] >
upper_bound)).any(axis=1)]
return df_cleaned

df_cleaned = remove_outliers_iqr(df, numeric_columns)


print(f"\nOriginal dataset size: {df.shape[0]}")
print(f"Cleaned dataset size: {df_cleaned.shape[0]}")
Output:

Vassu Yadav (23/IT/173)


PROGRAM 4
Objective: Write a program to explore different data visualization techniques
Code:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
column_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
'class']
dataset = pd.read_csv(url, names=column_names)

plt.figure(figsize=(8, 6))
sns.scatterplot(x='sepal_length', y='sepal_width', hue='class', data=dataset)
plt.title("Scatter Plot of Sepal Length vs Sepal Width")
plt.show()

plt.figure(figsize=(8, 6))
dataset['sepal_length'].hist(bins=20)
plt.title("Histogram of Sepal Length")
plt.xlabel("Sepal Length")
plt.ylabel("Frequency")
plt.show()

plt.figure(figsize=(8, 6))
sns.boxplot(x='class', y='sepal_length', data=dataset)
plt.title("Box Plot of Sepal Length by Class")
plt.show()

sns.pairplot(dataset, hue='class')
plt.show()
Output:

Vassu Yadav (23/IT/173)


Vassu Yadav (23/IT/173)
PROGRAM 5
Objective: Write a program to perform statistical analysis of the data in a given dataset
(mean, variance, standard deviation, median, mode).
Code:
import pandas as pd
from ucimlrepo import fetch_ucirepo

rice_cammeo_and_osmancik = fetch_ucirepo(id=545)
X = rice_cammeo_and_osmancik.data.features
y = rice_cammeo_and_osmancik.data.targets

df = pd.DataFrame(X)
df['Target'] = y

def statistical_analysis(dataframe, feature):


analysis = {}
analysis['Mean'] = dataframe[feature].mean()
analysis['Variance'] = dataframe[feature].var()
analysis['Standard Deviation'] = dataframe[feature].std()
analysis['Median'] = dataframe[feature].median()
analysis['Mode'] = dataframe[feature].mode()[0]
return analysis

feature_to_analyze = 'Area'
stats = statistical_analysis(df, feature_to_analyze)
print(f'Statistical Analysis for {feature_to_analyze}:')
for stat, value in stats.items():
print(f'{stat}: {value}')
Output:

Vassu Yadav (23/IT/173)


PROGRAM 6
Objective: WAP to perform a classification experiment on a dataset and its class variable
(Naïve Bayes, Random Forest).
Code:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report,
confusion_matrix
from sklearn.datasets import load_iris
import numpy as np

def load_data():
data = load_iris()
X = data.data
y = data.target
return X, y

def naive_bayes_classification(X, y):


nb_model = GaussianNB()
scores = cross_val_score(nb_model, X, y, cv=5)

print("\nNaive Bayes Classifier Results (5-Fold CV):")


print(f"Mean Accuracy: {np.mean(scores)}")
print("Accuracy per Fold:", scores)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)
nb_model.fit(X_train, y_train)
y_pred = nb_model.predict(X_test)

print("Classification Report:\n", classification_report(y_test, y_pred))


print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

def random_forest_classification(X, y):


# Initialize the Random Forest classifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
scores = cross_val_score(rf_model, X, y, cv=5) # 5-fold cross-validation

print("\nRandom Forest Classifier Results (5-Fold CV):")


print(f"Mean Accuracy: {np.mean(scores)}")
print("Accuracy per Fold:", scores)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)
rf_model.fit(X_train, y_train)
y_pred = rf_model.predict(X_test)

print("Classification Report:\n", classification_report(y_test, y_pred))


print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

def perform_classification_experiment():
X, y = load_data()
naive_bayes_classification(X, y)
random_forest_classification(X, y)

Vassu Yadav (23/IT/173)


perform_classification_experiment()

Output:

Vassu Yadav (23/IT/173)


PROGRAM 7

Objective: WAP to perform a regression experiment on a dataset (linear regression).


Code:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.datasets import fetch_california_housing

data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)


r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)


print("R^2 Score:", r2)
Output:

Vassu Yadav (23/IT/173)


PROGRAM 8

Objective: WAP to perform a clustering experiment on a dataset (K- means, Hierarchical


agglomerative clustering).

Code:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from scipy.cluster.hierarchy import dendrogram, linkage

data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

kmeans = KMeans(n_clusters=3, random_state=42)


kmeans_labels = kmeans.fit_predict(X_scaled)
hac = AgglomerativeClustering(n_clusters=3, metric='euclidean', linkage='ward')
hac_labels = hac.fit_predict(X_scaled)

plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=kmeans_labels, cmap='viridis',
marker='o')
plt.title("K-means Clustering")
plt.xlabel(data.feature_names[0])
plt.ylabel(data.feature_names[1])
plt.subplot(1, 2, 2)
Z = linkage(X_scaled, method='ward')
dendrogram(Z)
plt.title("Hierarchical Agglomerative Clustering (Dendrogram)")
plt.xlabel("Sample Index")
plt.ylabel("Distance")
plt.tight_layout()
plt.show()
Output:

Vassu Yadav (23/IT/173)


PROGRAM 9

Objective: Write a program to perform time series analysis for a given dataset.
Code:
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.holtwinters import ExponentialSmoothing

# Step 1: Load the Household Power Consumption dataset from UCI repository
url = 'https://archive.ics.uci.edu/ml/machine-learning-
databases/00235/household_power_consumption.zip'
data = pd.read_csv(url, sep=';', parse_dates={'DateTime': ['Date', 'Time']},
infer_datetime_format=True, na_values=['?'], low_memory=False)

# Convert to datetime index and clean data


data.set_index('DateTime', inplace=True)
data = data[['Global_active_power']].astype(float)
data.dropna(inplace=True)

# Resample to daily frequency


data = data.resample('D').mean()

# Step 2: Plot the time series data


plt.figure(figsize=(12, 6))
plt.plot(data, label="Global Active Power")
plt.title("Global Active Power Consumption")
plt.xlabel("Date")
plt.ylabel("Power (kW)")
plt.legend()
plt.show()

# Step 3: Decompose the time series to observe trend and seasonality


decomposition = seasonal_decompose(data, model='additive', period=365) # Assumes
yearly seasonality
decomposition.plot()
plt.show()

# Step 4: Forecast using Exponential Smoothing


model = ExponentialSmoothing(data, trend="add", seasonal="add",
seasonal_periods=365)
model_fit = model.fit()

# Forecast the next 30 days


forecast = model_fit.forecast(steps=30)

# Plot the forecasted values


plt.figure(figsize=(12, 6))
plt.plot(data, label="Historical Data")
plt.plot(forecast, label="Forecast", color="red")
plt.title("Forecast of Global Active Power Consumption")
plt.xlabel("Date")
plt.ylabel("Power (kW)")
plt.legend()
plt.show()

Vassu Yadav (23/IT/173)


Output:

Vassu Yadav (23/IT/173)


PROGRAM 10

Objective: Write a program to perform association rule mining for a given dataset.
Code:
import pandas as pd
from mlxtend.frequent_patterns import fpgrowth, association_rules

# Sample transactional dataset (or replace with your own dataset)


data = {'Milk': [1, 1, 0, 1, 0],
'Bread': [1, 0, 1, 1, 1],
'Butter': [0, 1, 1, 0, 1],
'Cheese': [1, 0, 1, 1, 0],
'Eggs': [0, 1, 1, 0, 1]}

# Convert the dictionary into a DataFrame


df = pd.DataFrame(data)

# Step 1: Generate frequent itemsets using FP-Growth algorithm


frequent_itemsets = fpgrowth(df, min_support=0.6, use_colnames=True)

# Step 2: Generate association rules


rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.2)

# Display the results


print("Frequent Itemsets using FP-Growth:")
print(frequent_itemsets)
print("\nAssociation Rules:")
print(rules)plt.show()
Output:

Vassu Yadav (23/IT/173)

You might also like