Ilovepdf Merged (1)
Ilovepdf Merged (1)
ENGINEERING
BIRLA INSTITUTE OF TECHNOLOGY,
MESRA – 835 215
NAME-Harshit Vinayak
ROLL NO.- MT/AI/10011/24
BRANCH- AI & ML
COURSE- M. TECH
YEAR: - 2024-2026
SUBJECT- DATA ANALYTICS LAB
import numpy as np
import pandas as pd
data = pd.read_csv('haberman.csv')
data
30 64 1 1.1
0 30 62 3 1
1 30 65 0 1
2 31 59 2 1
3 31 65 4 1
4 33 58 10 1
.. .. .. .. ...
300 75 62 1 1
301 76 67 0 1
302 77 65 3 1
303 78 65 1 2
304 83 58 2 2
print(data.columns)
/var/folders/k6/n04q5khn1dv_gr0dstl69lfh0000gn/T/
ipykernel_66182/909186460.py:3: FutureWarning:
/var/folders/k6/n04q5khn1dv_gr0dstl69lfh0000gn/T/
ipykernel_66182/3387935504.py:3: FutureWarning:
/var/folders/k6/n04q5khn1dv_gr0dstl69lfh0000gn/T/
ipykernel_66182/3598833194.py:3: FutureWarning:
plt.subplot(3, 3, 1)
sns.kdeplot(data['Age'], color='blue', fill=True, label='PDF')
plt.title('PDF of Age')
plt.xlabel('Age')
plt.subplot(3, 3, 2)
sns.ecdfplot(data['Age'], color='blue', label='CDF')
plt.title('CDF of Age')
plt.xlabel('Age')
plt.subplot(3, 3, 5)
sns.ecdfplot(data['Year_of_Treatment'], color='orange', label='CDF')
plt.title('CDF of Year of Treatment')
plt.xlabel('Year of Treatment')
# PDF and CDF for Lymph Nodes
plt.subplot(3, 3, 7)
sns.kdeplot(data['Lymph_Nodes'], color='green', fill=True,
label='PDF')
plt.title('PDF of Lymph Nodes')
plt.xlabel('Number of Lymph Nodes')
plt.subplot(3, 3, 8)
sns.ecdfplot(data['Lymph_Nodes'], color='green', label='CDF')
plt.title('CDF of Lymph Nodes')
plt.xlabel('Number of Lymph Nodes')
plt.tight_layout()
plt.show()
# Violin plots for Age, Year of Treatment, and Lymph Nodes grouped by
Survival Status
plt.figure(figsize=(15, 10))
plt.subplot(3, 1, 1)
sns.violinplot(x='Survival_Status', y='Age', data=data,
palette="muted")
plt.title('Violin Plot: Age vs Survival Status')
plt.subplot(3, 1, 2)
sns.violinplot(x='Survival_Status', y='Year_of_Treatment', data=data,
palette="muted")
plt.title('Violin Plot: Year of Treatment vs Survival Status')
plt.subplot(3, 1, 3)
sns.violinplot(x='Survival_Status', y='Lymph_Nodes', data=data,
palette="muted")
plt.title('Violin Plot: Lymph Nodes vs Survival Status')
plt.tight_layout()
plt.show()
/var/folders/k6/n04q5khn1dv_gr0dstl69lfh0000gn/T/
ipykernel_66182/3085701399.py:5: FutureWarning:
sns.violinplot(x='Survival_Status', y='Year_of_Treatment',
data=data, palette="muted")
/var/folders/k6/n04q5khn1dv_gr0dstl69lfh0000gn/T/ipykernel_66182/30857
01399.py:13: FutureWarning:
1) The age of patients ranges from 30 to 83 years. The distribution is slightly right-skewed, with
the majority of patients aged 40–60.
1. The KDE plot confirms a peak around the 50s, indicating that most patients are middle-
aged.
1. Patients who survived (status = 1) tend to have a slightly wider age range, but the median
age for both survival statuses is similar.
1. Patients treated in later years (closer to 1969) have slightly better survival outcomes.
Earlier treatment years are associated with more cases of non-survival.
1. Patients with fewer positive lymph nodes (closer to 0) have higher survival rates. Non-
survivors tend to have significantly more positive lymph nodes.
26/01/2025, 19:00 LabAssignment1_2
/var/folders/qd/gbss_8pd7wb7t_wz5wqdjsl40000gn/T/ipykernel_49080/185795820
1.py:1: DtypeWarning: Columns (6,11,12,17) have mixed types. Specify dtype
option on import or set low_memory=False.
df=pd.read_csv("/Users/user/Documents/BIT Mesra /M.tech/2nd Sem/DA LAB/c
a_san_francisco_2020_04_01.csv")
In [146… df.head(100)
file:///Users/user/Downloads/LabAssignment1_2.html 1/14
26/01/2025, 19:00 LabAssignment1_2
Out[147… 19911540
In [148… missing_data=df.isnull().sum()
print(missing_data)
raw_row_number 0
date 0
time 35
location 43
lat 1697
lng 1697
district 52187
subject_age 58888
subject_race 0
subject_sex 0
type 2
arrest_made 1
citation_issued 2
warning_issued 0
outcome 15682
contraband_found 851689
search_conducted 0
search_vehicle 1
search_basis 851688
reason_for_stop 2212
raw_search_vehicle_description 0
raw_result_of_contact_description 0
dtype: int64
In [149… dataTypes=df.dtypes
dataTypes
In [150… df_cleaned=df.dropna(subset=['time','location','type','arrest_made','cita
df_cleaned.isna().sum()
file:///Users/user/Downloads/LabAssignment1_2.html 2/14
26/01/2025, 19:00 LabAssignment1_2
Out[150… raw_row_number 0
date 0
time 0
location 0
lat 1654
lng 1654
district 52181
subject_age 58881
subject_race 0
subject_sex 0
type 0
arrest_made 0
citation_issued 0
warning_issued 0
outcome 15681
contraband_found 851610
search_conducted 0
search_vehicle 0
search_basis 851609
reason_for_stop 2212
raw_search_vehicle_description 0
raw_result_of_contact_description 0
dtype: int64
In [151… df_cleaned.size
Out[151… 19909692
In [152… sum_age=df_cleaned['subject_age'].sum()
age_count=df_cleaned['subject_age'].notna().sum()
age_mean=sum_age/age_count
df_cleaned.loc[ : ,'subject_age']=df_cleaned['subject_age'].fillna(age_me
df_cleaned.isnull().sum()
Out[152… raw_row_number 0
date 0
time 0
location 0
lat 1654
lng 1654
district 52181
subject_age 0
subject_race 0
subject_sex 0
type 0
arrest_made 0
citation_issued 0
warning_issued 0
outcome 15681
contraband_found 851610
search_conducted 0
search_vehicle 0
search_basis 851609
reason_for_stop 2212
raw_search_vehicle_description 0
raw_result_of_contact_description 0
dtype: int64
In [153… df_cleaned['subject_age'].head(100)
file:///Users/user/Downloads/LabAssignment1_2.html 3/14
26/01/2025, 19:00 LabAssignment1_2
Out[153… 0 37.818918
1 37.818918
2 37.818918
3 37.818918
4 37.818918
...
96 37.818918
97 37.818918
98 37.818918
99 37.818918
100 37.818918
Name: subject_age, Length: 100, dtype: float64
In [154… value_counts_reason_stop=df_cleaned['reason_for_stop'].value_counts()
max_value=value_counts_reason_stop.idxmax()
max_value
In [156… df_cleaned.isnull().sum()
Out[156… raw_row_number 0
date 0
time 0
location 0
lat 1654
lng 1654
district 52181
subject_age 0
subject_race 0
subject_sex 0
type 0
arrest_made 0
citation_issued 0
warning_issued 0
outcome 15681
contraband_found 851610
search_conducted 0
search_vehicle 0
search_basis 851609
reason_for_stop 0
raw_search_vehicle_description 0
raw_result_of_contact_description 0
dtype: int64
In [157… value_counts_search_basis=df_cleaned['search_basis'].value_counts()
max_value=value_counts_search_basis.idxmax()
max_value
df_cleaned.loc[:, 'search_basis'] = df_cleaned['search_basis'].fillna(max
df_cleaned.isnull().sum()
file:///Users/user/Downloads/LabAssignment1_2.html 4/14
26/01/2025, 19:00 LabAssignment1_2
Out[157… raw_row_number 0
date 0
time 0
location 0
lat 1654
lng 1654
district 52181
subject_age 0
subject_race 0
subject_sex 0
type 0
arrest_made 0
citation_issued 0
warning_issued 0
outcome 15681
contraband_found 851610
search_conducted 0
search_vehicle 0
search_basis 0
reason_for_stop 0
raw_search_vehicle_description 0
raw_result_of_contact_description 0
dtype: int64
lat lng
0 37.773004 -122.445873
1 37.780898 -122.468586
2 37.786919 -122.426718
3 37.746380 -122.392005
4 37.786348 -122.440003
In [159… value_counts_district=df_cleaned['district'].value_counts()
max_value=value_counts_district.idxmax()
max_value
df_cleaned.loc[:, 'district'] = df_cleaned['district'].fillna(max_value)
df_cleaned.isnull().sum()
file:///Users/user/Downloads/LabAssignment1_2.html 5/14
26/01/2025, 19:00 LabAssignment1_2
Out[159… raw_row_number 0
date 0
time 0
location 0
lat 0
lng 0
district 0
subject_age 0
subject_race 0
subject_sex 0
type 0
arrest_made 0
citation_issued 0
warning_issued 0
outcome 15681
contraband_found 851610
search_conducted 0
search_vehicle 0
search_basis 0
reason_for_stop 0
raw_search_vehicle_description 0
raw_result_of_contact_description 0
dtype: int64
In [160… value_counts_outcome=df_cleaned['outcome'].value_counts()
max_value=value_counts_outcome.idxmax()
max_value
df_cleaned.loc[:, 'outcome'] = df_cleaned['outcome'].fillna(max_value)
df_cleaned.isnull().sum()
Out[160… raw_row_number 0
date 0
time 0
location 0
lat 0
lng 0
district 0
subject_age 0
subject_race 0
subject_sex 0
type 0
arrest_made 0
citation_issued 0
warning_issued 0
outcome 0
contraband_found 851610
search_conducted 0
search_vehicle 0
search_basis 0
reason_for_stop 0
raw_search_vehicle_description 0
raw_result_of_contact_description 0
dtype: int64
In [161… value_counts_outcome=df_cleaned['contraband_found'].value_counts()
max_value=value_counts_outcome.idxmax()
max_value
df_cleaned.loc[:, 'contraband_found'] = df_cleaned['contraband_found'].fi
df_cleaned.isnull().sum()
file:///Users/user/Downloads/LabAssignment1_2.html 6/14
26/01/2025, 19:00 LabAssignment1_2
/var/folders/qd/gbss_8pd7wb7t_wz5wqdjsl40000gn/T/ipykernel_49080/212544355
9.py:4: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill,
.bfill is deprecated and will change in a future version. Call result.infe
r_objects(copy=False) instead. To opt-in to the future behavior, set `pd.s
et_option('future.no_silent_downcasting', True)`
df_cleaned.loc[:, 'contraband_found'] = df_cleaned['contraband_found'].f
illna(max_value)
Out[161… raw_row_number 0
date 0
time 0
location 0
lat 0
lng 0
district 0
subject_age 0
subject_race 0
subject_sex 0
type 0
arrest_made 0
citation_issued 0
warning_issued 0
outcome 0
contraband_found 0
search_conducted 0
search_vehicle 0
search_basis 0
reason_for_stop 0
raw_search_vehicle_description 0
raw_result_of_contact_description 0
dtype: int64
In [162… df_cleaned.isnull().sum()
Out[162… raw_row_number 0
date 0
time 0
location 0
lat 0
lng 0
district 0
subject_age 0
subject_race 0
subject_sex 0
type 0
arrest_made 0
citation_issued 0
warning_issued 0
outcome 0
contraband_found 0
search_conducted 0
search_vehicle 0
search_basis 0
reason_for_stop 0
raw_search_vehicle_description 0
raw_result_of_contact_description 0
dtype: int64
file:///Users/user/Downloads/LabAssignment1_2.html 7/14
26/01/2025, 19:00 LabAssignment1_2
Out[163… raw_row_number 0
date 0
time 0
location 0
lat 0
lng 0
district 0
subject_age 0
subject_race 0
subject_sex 0
type 0
arrest_made 0
citation_issued 0
warning_issued 0
outcome 0
contraband_found 0
search_conducted 0
search_vehicle 0
search_basis 0
reason_for_stop 0
raw_search_vehicle_description 0
raw_result_of_contact_description 0
dtype: int64
raw_row_number 904986
date 3469
time 1440
location 312994
lat 64380
lng 54321
district 13
subject_age 92
subject_race 5
subject_sex 2
type 1
arrest_made 2
citation_issued 2
warning_issued 2
outcome 3
contraband_found 2
search_conducted 2
search_vehicle 2
search_basis 3
reason_for_stop 26
raw_search_vehicle_description 36
raw_result_of_contact_description 28
dtype: int64
In [166… print(df_cleaned)
file:///Users/user/Downloads/LabAssignment1_2.html 8/14
26/01/2025, 19:00 LabAssignment1_2
file:///Users/user/Downloads/LabAssignment1_2.html 9/14
26/01/2025, 19:00 LabAssignment1_2
reason_for_stop \
0 MECHANICAL OR NON-MOVING
VIOLATION (V.C.)
1 MECHANICAL OR NON-MOVING
VIOLATION (V.C.)
2 MECHANICAL OR NON-MOVING
VIOLATION (V.C.)
3 MECHANICAL OR NON-MOVING
VIOLATION (V.C.)
4 MECHANICAL OR NON-MOVING
VIOLATION (V.C.)
... ...
905065 MOVING VIOLATION
905066 MOVING VIOLATION
905067 MOVING VIOLATION
905068 MECHANICAL OR NON-MOVING VIOLATION (V.C.)
905069 MOVING VIOLATION
raw_search_vehicle_description raw_result_of_contact_description
0 NO SEARCH WARNING
1 NO SEARCH CITATION
2 NO SEARCH CITATION
3 NO SEARCH WARNING
4 NO SEARCH CITATION
... ... ...
905065 NO SEARCH WARNING
905066 NO SEARCH CITATION
905067 NO SEARCH CITATION
905068 NO SEARCH WARNING
905069 NO SEARCH CITATION
In [167… df_cleaned.head()
file:///Users/user/Downloads/LabAssignment1_2.html 10/14
26/01/2025, 19:00 LabAssignment1_2
data_selected = df_cleaned[['subject_age']]
plt.figure(figsize=(10, 6))
sns.boxplot(data_selected)
plt.title('Boxplot for Age ')
plt.show()
In [169… Q1 = df_cleaned['subject_age'].quantile(0.25)
Q3 = df_cleaned['subject_age'].quantile(0.75)
IQR = Q3 - Q1
df_no_outliers_IQR.head()
file:///Users/user/Downloads/LabAssignment1_2.html 11/14
26/01/2025, 19:00 LabAssignment1_2
In [112… df_cleaned.head(100)
file:///Users/user/Downloads/LabAssignment1_2.html 12/14
26/01/2025, 19:00 LabAssignment1_2
In [115… df_cleaned.head(100)
file:///Users/user/Downloads/LabAssignment1_2.html 13/14
26/01/2025, 19:00 LabAssignment1_2
file:///Users/user/Downloads/LabAssignment1_2.html 14/14
DAI 10th feb-2.ipynb - Colab 22/04/25, 2:04 PM
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
https://colab.research.google.com/drive/1-gyxeTqLrtCjkruehj_bJHheyfmes_Mk?authuser=4 Page 1 of 3
DAI 10th feb-2.ipynb - Colab 22/04/25, 2:04 PM
plt.tight_layout()
plt.show()
DataafterLDA
1
LDAComponent2
setosa
versicolor
virginica
-21
https://colab.research.google.com/drive/1-gyxeTqLrtCjkruehj_bJHheyfmes_Mk?authuser=4 Page 2 of 3
mtai1000124_lab3_lda.ipynb - Colab 22/04/25, 2:05 PM
3
y_pred_lda = clf_lda.predict(X_test)
lda_accuracy = accuracy_score(y_test, y_pred_lda)
print(f"LDA + Random Forest Accuracy: {lda_accuracy:.4f}")
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
y_pred_pca = clf_pca.predict(X_test_pca)
pca_accuracy = accuracy_score(y_test_pca, y_pred_pca)
components = [1, 2, 3]
accuracy_scores = []
for n in components:
pca = PCA(n_components=n)
X_pca = pca.fit_transform(X)
y_pred_pca = clf_pca.predict(X_test_pca)
accuracy = accuracy_score(y_test_pca, y_pred_pca)
https://colab.research.google.com/drive/1oyorelwwd-isKIFFyfN_pEQbng9rtaNH?authuser=4 Page 3 of 5
mtai1000124_lab3_lda.ipynb - Colab 22/04/25, 2:05 PM
accuracy_scores.append(accuracy)
print(f"Random Forest Accuracy using PCA ({n} components): {accuracy:.4f}")
plt.figure(figsize=(8, 5))
plt.plot(components, accuracy_scores, marker='o', linestyle='-', color='b', label='
plt.xlabel("Number of PCA Components")
plt.ylabel("Accuracy")
plt.title("Random Forest Accuracy with Different PCA Components")
plt.xticks(components)
plt.legend()
plt.grid()
plt.show()
https://colab.research.google.com/drive/1oyorelwwd-isKIFFyfN_pEQbng9rtaNH?authuser=4 Page 4 of 5
mtai1000124_lab4_dt.ipynb - Colab 22/04/25, 2:11 PM
MTAI1001124
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
import seaborn as sns
https://colab.research.google.com/drive/1K26cQ8LbOluN15LfmX9gM…lTILk?authuser=4#scrollTo=281e8b17-9e43-4ece-bec6-61ed545fec25 Page 1 of 6
mtai1000124_lab4_dt.ipynb - Colab 22/04/25, 2:11 PM
df = pd.read_csv("CarPrice_Assignment.csv")
missing_values = df.isnull().sum()
df = df.dropna()
median_price = df['price'].median()
df['price_category'] = (df['price'] > median_price).astype(int) # 1 = high, 0 = lo
df = pd.get_dummies(df, drop_first=True)
X = df.drop(columns=['price_category'])
y = df['price_category']
clf = DecisionTreeClassifier(random_state=42)
https://colab.research.google.com/drive/1K26cQ8LbOluN15LfmX9gM…lTILk?authuser=4#scrollTo=281e8b17-9e43-4ece-bec6-61ed545fec25 Page 2 of 6
mtai1000124_lab4_dt.ipynb - Colab 22/04/25, 2:11 PM
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
# Create a heatmap
plt.figure(figsize=(6, 5))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", xticklabels=["Low Price
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix Heatmap")
plt.show()
https://colab.research.google.com/drive/1K26cQ8LbOluN15LfmX9gM…lTILk?authuser=4#scrollTo=281e8b17-9e43-4ece-bec6-61ed545fec25 Page 3 of 6
mtai1000124_lab4_dt.ipynb - Colab 22/04/25, 2:11 PM
Accuracy: 0.9756
Classification Report:
precision recall f1-score support
accuracy 0.98 41
macro avg 0.97 0.98 0.98 41
weighted avg 0.98 0.98 0.98 41
plt.figure(figsize=(20, 10))
plot_tree(clf_tuned, feature_names=X.columns, class_names=["Low Price", "High Price
plt.show()
y_pred_tuned = clf_tuned.predict(X_test)
accuracy_tuned = accuracy_score(y_test, y_pred_tuned)
print(f"Tuned Model Accuracy: {accuracy_tuned:.4f}")
https://colab.research.google.com/drive/1K26cQ8LbOluN15LfmX9gM…lTILk?authuser=4#scrollTo=281e8b17-9e43-4ece-bec6-61ed545fec25 Page 4 of 6
mtai1000124_lab4_dt.ipynb - Colab 22/04/25, 2:11 PM
https://colab.research.google.com/drive/1K26cQ8LbOluN15LfmX9gM…lTILk?authuser=4#scrollTo=281e8b17-9e43-4ece-bec6-61ed545fec25 Page 5 of 6
mtai1000124_lab4_dt.ipynb - Colab 22/04/25, 2:11 PM
Sensitivity: 1.0000
Specificity: 0.9565
https://colab.research.google.com/drive/1K26cQ8LbOluN15LfmX9gM…lTILk?authuser=4#scrollTo=281e8b17-9e43-4ece-bec6-61ed545fec25 Page 6 of 6
mtai1000124_lab5_linear_regression.ipynb - Colab 22/04/25, 2:15 PM
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.linear_model import LinearRegression
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import PolynomialFeatures
df = pd.read_csv('advertising.csv')
df
https://colab.research.google.com/drive/1CZRj6qmJjeM_aOf9TUU9DWgSRAmGcTDN?authuser=4 Page 1 of 9
mtai1000124_lab5_linear_regression.ipynb - Colab 22/04/25, 2:15 PM
model = LinearRegression()
model.fit(X_train, y_train)
▾ LinearRegression
LinearRegression()
y_pred = model.predict(X_test)
https://colab.research.google.com/drive/1CZRj6qmJjeM_aOf9TUU9DWgSRAmGcTDN?authuser=4 Page 2 of 9
mtai1000124_lab5_linear_regression.ipynb - Colab 22/04/25, 2:15 PM
plt.figure(figsize=(8, 6))
plt.scatter(y_pred, y_test, color='orange', alpha=0.6, label='Actual vs Predicted')
plt.plot([y_pred.min(), y_pred.max()], [y_test.min(), y_test.max()], 'b--', lw=2, l
plt.xlabel('Predicted Sales')
plt.ylabel('Actual Sales')
plt.title('Predicted vs Actual Sales')
plt.legend()
plt.show()
https://colab.research.google.com/drive/1CZRj6qmJjeM_aOf9TUU9DWgSRAmGcTDN?authuser=4 Page 3 of 9
mtai1000124_lab5_linear_regression.ipynb - Colab 22/04/25, 2:15 PM
cv_scores_individual = {}
for feature in ['TV', 'Radio', 'Newspaper']:
X_single = df[[feature]]
scores = cross_val_score(LinearRegression(), X_single, y, cv=5, scoring='r2')
cv_scores_individual[feature] = scores.mean()
https://colab.research.google.com/drive/1CZRj6qmJjeM_aOf9TUU9DWgSRAmGcTDN?authuser=4 Page 4 of 9
mtai1000124_lab5_linear_regression.ipynb - Colab 22/04/25, 2:15 PM
plt.figure(figsize=(15, 5))
plt.subplot(1, 3, 1)
plt.scatter(df['TV'], df['Sales'], color='green', alpha=0.6)
plt.xlabel('TV Advertising')
plt.ylabel('Sales')
plt.title('TV vs Sales')
plt.subplot(1, 3, 2)
plt.scatter(df['Radio'], df['Sales'], color='black', alpha=0.6)
plt.xlabel('Radio Advertising')
plt.ylabel('Sales')
plt.title('Radio vs Sales')
plt.subplot(1, 3, 3)
plt.scatter(df['Newspaper'], df['Sales'], color='purple', alpha=0.6)
plt.xlabel('Newspaper Advertising')
plt.ylabel('Sales')
plt.title('Newspaper vs Sales')
plt.tight_layout()
plt.show()
https://colab.research.google.com/drive/1CZRj6qmJjeM_aOf9TUU9DWgSRAmGcTDN?authuser=4 Page 5 of 9
mtai1000124_lab5_linear_regression.ipynb - Colab 22/04/25, 2:15 PM
results = {}
cv_results = {}
model = LinearRegression()
model.fit(X_train_poly, y_train)
y_train_pred = model.predict(X_train_poly)
y_test_pred = model.predict(X_test_poly)
results[degree] = {
'Train_MSE': mse_train,
'Test_MSE': mse_test,
'Train_MAE': mae_train,
'Test_MAE': mae_test,
'Train_RMSE': rmse_train,
'Test_RMSE': rmse_test
}
https://colab.research.google.com/drive/1CZRj6qmJjeM_aOf9TUU9DWgSRAmGcTDN?authuser=4 Page 6 of 9
mtai1000124_lab5_linear_regression.ipynb - Colab 22/04/25, 2:15 PM
results_df = pd.DataFrame(results).T
cv_df = pd.DataFrame(cv_results).T
print("Error Metrics:")
print(results_df.round(3))
print("\nCross-Validation MSE:")
print(cv_df.round(3))
Error Metrics:
Train_MSE Test_MSE Train_MAE Test_MAE Train_RMSE Test_RMSE
1 2.676 2.908 1.234 1.275 1.636 1.705
2 1.908 1.443 1.057 0.903 1.381 1.201
3 1.698 1.812 0.985 0.935 1.303 1.346
Cross-Validation MSE:
CV_MSE_Mean CV_MSE_Std
1 2.842 1.061
2 1.994 0.769
3 2.186 0.846
https://colab.research.google.com/drive/1CZRj6qmJjeM_aOf9TUU9DWgSRAmGcTDN?authuser=4 Page 7 of 9
mtai1000124_lab5_linear_regression.ipynb - Colab 22/04/25, 2:15 PM
plt.figure(figsize=(8, 5))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Matrix")
plt.tight_layout()
plt.show()
https://colab.research.google.com/drive/1CZRj6qmJjeM_aOf9TUU9DWgSRAmGcTDN?authuser=4 Page 8 of 9
DAI 10march.ipynb - Colab 22/04/25, 2:22 PM
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
0 1 Male 19 15 39
1 2 Male 21 15 81
2 3 Female 20 16 6
3 4 Female 23 16 77
4 5 Female 31 17 40
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CustomerID 200 non-null int64
1 Gender 200 non-null object
2 Age 200 non-null int64
3 Annual Income (k$) 200 non-null int64
4 Spending Score (1-100) 200 non-null int64
dtypes: int64(4), object(1)
memory usage: 7.9+ KB
https://colab.research.google.com/drive/1Nib4LrAI_OxUYLC6w_3PYhGRNunuP8uX?authuser=4 Page 1 of 15
DAI 10march.ipynb - Colab 22/04/25, 2:22 PM
https://colab.research.google.com/drive/1Nib4LrAI_OxUYLC6w_3PYhGRNunuP8uX?authuser=4 Page 2 of 15
DAI 10march.ipynb - Colab 22/04/25, 2:22 PM
https://colab.research.google.com/drive/1Nib4LrAI_OxUYLC6w_3PYhGRNunuP8uX?authuser=4 Page 3 of 15
DAI 10march.ipynb - Colab 22/04/25, 2:22 PM
https://colab.research.google.com/drive/1Nib4LrAI_OxUYLC6w_3PYhGRNunuP8uX?authuser=4 Page 4 of 15
DAI 10march.ipynb - Colab 22/04/25, 2:22 PM
sse = []
for i in range(1, 10):
km = KMeans(n_clusters = i)
km.fit(df[['Annual Income (k$)', 'Spending Score (1-100)']])
sse.append(km.inertia_)
/opt/anaconda3/lib/python3.12/site-packages/sklearn/cluster/_kmeans.py:1412: F
super()._check_params_vs_input(X, default_n_init=10)
/opt/anaconda3/lib/python3.12/site-packages/sklearn/cluster/_kmeans.py:1412: F
super()._check_params_vs_input(X, default_n_init=10)
/opt/anaconda3/lib/python3.12/site-packages/sklearn/cluster/_kmeans.py:1412: F
super()._check_params_vs_input(X, default_n_init=10)
/opt/anaconda3/lib/python3.12/site-packages/sklearn/cluster/_kmeans.py:1412: F
super()._check_params_vs_input(X, default_n_init=10)
/opt/anaconda3/lib/python3.12/site-packages/sklearn/cluster/_kmeans.py:1412: F
super()._check_params_vs_input(X, default_n_init=10)
/opt/anaconda3/lib/python3.12/site-packages/sklearn/cluster/_kmeans.py:1412: F
super()._check_params_vs_input(X, default_n_init=10)
/opt/anaconda3/lib/python3.12/site-packages/sklearn/cluster/_kmeans.py:1412: F
super()._check_params_vs_input(X, default_n_init=10)
/opt/anaconda3/lib/python3.12/site-packages/sklearn/cluster/_kmeans.py:1412: F
super()._check_params_vs_input(X, default_n_init=10)
/opt/anaconda3/lib/python3.12/site-packages/sklearn/cluster/_kmeans.py:1412: F
super()._check_params_vs_input(X, default_n_init=10)
sse
[269981.28,
181363.59595959593,
106348.37306211119,
73679.78903948834,
44448.45544793371,
37233.814510710006,
30259.65720728547,
25028.02047526941,
21830.041978049438]
https://colab.research.google.com/drive/1Nib4LrAI_OxUYLC6w_3PYhGRNunuP8uX?authuser=4 Page 5 of 15
DAI 10march.ipynb - Colab 22/04/25, 2:22 PM
# Finding the optimal number of clusters using the Elbow Method (SSE)
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
/opt/anaconda3/lib/python3.12/site-packages/sklearn/cluster/_kmeans.py:1412: F
super()._check_params_vs_input(X, default_n_init=10)
/opt/anaconda3/lib/python3.12/site-packages/sklearn/cluster/_kmeans.py:1412: F
super()._check_params_vs_input(X, default_n_init=10)
/opt/anaconda3/lib/python3.12/site-packages/sklearn/cluster/_kmeans.py:1412: F
super()._check_params_vs_input(X, default_n_init=10)
/opt/anaconda3/lib/python3.12/site-packages/sklearn/cluster/_kmeans.py:1412: F
super()._check_params_vs_input(X, default_n_init=10)
/opt/anaconda3/lib/python3.12/site-packages/sklearn/cluster/_kmeans.py:1412: F
super()._check_params_vs_input(X, default_n_init=10)
/opt/anaconda3/lib/python3.12/site-packages/sklearn/cluster/_kmeans.py:1412: F
super()._check_params_vs_input(X, default_n_init=10)
/opt/anaconda3/lib/python3.12/site-packages/sklearn/cluster/_kmeans.py:1412: F
super()._check_params_vs_input(X, default_n_init=10)
/opt/anaconda3/lib/python3.12/site-packages/sklearn/cluster/_kmeans.py:1412: F
super()._check_params_vs_input(X, default_n_init=10)
/opt/anaconda3/lib/python3.12/site-packages/sklearn/cluster/_kmeans.py:1412: F
super()._check_params_vs_input(X, default_n_init=10)
/opt/anaconda3/lib/python3.12/site-packages/sklearn/cluster/_kmeans.py:1412: F
super()._check_params_vs_input(X, default_n_init=10)
https://colab.research.google.com/drive/1Nib4LrAI_OxUYLC6w_3PYhGRNunuP8uX?authuser=4 Page 6 of 15
DAI 10march.ipynb - Colab 22/04/25, 2:22 PM
https://colab.research.google.com/drive/1Nib4LrAI_OxUYLC6w_3PYhGRNunuP8uX?authuser=4 Page 7 of 15
DAI 10march.ipynb - Colab 22/04/25, 2:22 PM
silhouette_scores = []
silhouette_values = {}
for i in range(2, 11): # Silhouette Score is valid for clusters >= 2 as not valid
kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
cluster_labels = kmeans.fit_predict(X)
silhouette_avg = silhouette_score(X, cluster_labels)
silhouette_scores.append(silhouette_avg)
silhouette_values[i] = silhouette_avg
/opt/anaconda3/lib/python3.12/site-packages/sklearn/cluster/_kmeans.py:1412: F
super()._check_params_vs_input(X, default_n_init=10)
/opt/anaconda3/lib/python3.12/site-packages/sklearn/cluster/_kmeans.py:1412: F
super()._check_params_vs_input(X, default_n_init=10)
/opt/anaconda3/lib/python3.12/site-packages/sklearn/cluster/_kmeans.py:1412: F
super()._check_params_vs_input(X, default_n_init=10)
/opt/anaconda3/lib/python3.12/site-packages/sklearn/cluster/_kmeans.py:1412: F
super()._check_params_vs_input(X, default_n_init=10)
/opt/anaconda3/lib/python3.12/site-packages/sklearn/cluster/_kmeans.py:1412: F
super()._check_params_vs_input(X, default_n_init=10)
/opt/anaconda3/lib/python3.12/site-packages/sklearn/cluster/_kmeans.py:1412: F
super()._check_params_vs_input(X, default_n_init=10)
/opt/anaconda3/lib/python3.12/site-packages/sklearn/cluster/_kmeans.py:1412: F
super()._check_params_vs_input(X, default_n_init=10)
/opt/anaconda3/lib/python3.12/site-packages/sklearn/cluster/_kmeans.py:1412: F
super()._check_params_vs_input(X, default_n_init=10)
/opt/anaconda3/lib/python3.12/site-packages/sklearn/cluster/_kmeans.py:1412: F
super()._check_params_vs_input(X, default_n_init=10)
https://colab.research.google.com/drive/1Nib4LrAI_OxUYLC6w_3PYhGRNunuP8uX?authuser=4 Page 8 of 15
DAI 10march.ipynb - Colab 22/04/25, 2:22 PM
# Applying K-means clustering with optimal clusters (from Elbow Method, assume 5)
kmeans = KMeans(n_clusters=5, init='k-means++', random_state=42)
df['Cluster'] = kmeans.fit_predict(X)
centroids = kmeans.cluster_centers_
/opt/anaconda3/lib/python3.12/site-packages/sklearn/cluster/_kmeans.py:1412: F
super()._check_params_vs_input(X, default_n_init=10)
https://colab.research.google.com/drive/1Nib4LrAI_OxUYLC6w_3PYhGRNunuP8uX?authuser=4 Page 9 of 15
DAI 10march.ipynb - Colab 22/04/25, 2:22 PM
km = KMeans(n_clusters = 4)
predicted = km.fit_predict(df[['Annual Income (k$)', 'Spending Score (1-100)']])
predicted
/opt/anaconda3/lib/python3.12/site-packages/sklearn/cluster/_kmeans.py:1412: F
super()._check_params_vs_input(X, default_n_init=10)
array([0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 2, 3, 2, 3, 2, 3, 2, 3,
2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3,
2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3,
2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3,
2, 3], dtype=int32)
df['Cluster'] = predicted
df
0 1 Male 19 15 39 0
1 2 Male 21 15 81 1
2 3 Female 20 16 6 0
3 4 Female 23 16 77 1
4 5 Female 31 17 40 0
https://colab.research.google.com/drive/1Nib4LrAI_OxUYLC6w_3PYhGRNunuP8uX?authuser=4 Page 10 of 15
DAI 10march.ipynb - Colab 22/04/25, 2:22 PM
df1 = df[df.Cluster==0]
df2 = df[df.Cluster==1]
df3 = df[df.Cluster==2]
df4 = df[df.Cluster==3]
df5 = df[df.Cluster==4]
plt.scatter(km.cluster_centers_[:,0],km.cluster_centers_[:,1],
color='purple',marker='*',label='centroid')
<matplotlib.legend.Legend at 0x175b5d1f0>
2) Fuzzy C means
https://colab.research.google.com/drive/1Nib4LrAI_OxUYLC6w_3PYhGRNunuP8uX?authuser=4 Page 11 of 15
DAI 10march.ipynb - Colab 22/04/25, 2:22 PM
Collecting scikit-fuzzy
Downloading scikit_fuzzy-0.5.0-py2.py3-none-any.whl.metadata (2.6 kB)
Downloading scikit_fuzzy-0.5.0-py2.py3-none-any.whl (920 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 920.8/920.8 kB 2.1 MB/s eta 0:00:00
Installing collected packages: scikit-fuzzy
Successfully installed scikit-fuzzy-0.5.0
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import skfuzzy as fuzz # Fuzzy C-Means library
display(df.head())
0 1 Male 19 15 39
1 2 Male 21 15 81
2 3 Female 20 16 6
3 4 Female 23 16 77
4 5 Female 31 17 40
https://colab.research.google.com/drive/1Nib4LrAI_OxUYLC6w_3PYhGRNunuP8uX?authuser=4 Page 12 of 15
DAI 10march.ipynb - Colab 22/04/25, 2:22 PM
# Assigning
df['Fuzzy Cluster']
cluster =
labels
cluster_labels
based on maximum membership
cluster_labels = np.argmax(u, axis=0)
df['Fuzzy Cluster'] = cluster_labels
# Print
print(f"Customer
membership values
{i+1}:
for
{u[:,
the i]}")
first 10 customers
print("Fuzzy Membership Values for the first 10 customers:")
for i in range(10):
print(f"Customer {i+1}: {u[:, i]}")
# Categorizing customers
df['Category'] = df['Fuzzy
based
Cluster'].map(categories)
on clusters
categories = {
0: "Frugal", # Low Income, Low Spending
1: "Careless", # Low Income, High Spending
2: "Sensible", # Moderate Income, Moderate Spending
3: "Lavish", # High Income, High Spending
4: "Cautious" # High Income, Low Spending
}
df['Category'] = df['Fuzzy Cluster'].map(categories)
# Marking centroids
plt.scatter(cntr[:, 0], cntr[:, 1], s=300, c='black', marker='X', label='Centroids'
https://colab.research.google.com/drive/1Nib4LrAI_OxUYLC6w_3PYhGRNunuP8uX?authuser=4 Page 13 of 15
DAI 10march.ipynb - Colab 22/04/25, 2:22 PM
plt.show()
0 15 39 4 Cautious
1 15 81 3 Lavish
2 16 6 4 Cautious
3 16 77 3 Lavish
4 17 40 4 Cautious
95 60 52 1 Careless
96 60 47 1 Careless
97 60 50 1 Careless
98 61 42 1 Careless
https://colab.research.google.com/drive/1Nib4LrAI_OxUYLC6w_3PYhGRNunuP8uX?authuser=4 Page 14 of 15
DAI 17th march.ipynb - Colab 22/04/25, 2:31 PM
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial import distance_matrix
from sklearn.cluster import AgglomerativeClustering
# Load dataset
df = pd.read_csv('Mall_Customers.csv.csv')
display(df.head())
0 1 Male 19 15 39
1 2 Male 21 15 81
2 3 Female 20 16 6
3 4 Female 23 16 77
4 5 Female 31 17 40
# Hierarchical Clustering
num_clusters = 5 # Determined from the dendrogram
hc = AgglomerativeClustering(n_clusters=num_clusters, affinity='euclidean', linkage
y_hc = hc.fit_predict(X)
plt.legend()
plt.show()
/opt/anaconda3/lib/python3.12/site-packages/sklearn/cluster/_agglomerative.py:1
warnings.warn(
https://colab.research.google.com/drive/1SRf2Nq5i7CFdCDzydwqM5Ylxn3Z7GeeX?authuser=4 Page 2 of 4
DAI 17th march.ipynb - Colab 22/04/25, 2:31 PM
dendrogram(linked)
plt.title("Dendrogram for Hierarchical Clustering")
plt.xlabel("Customers")
plt.ylabel("Euclidean Distance")
plt.show()
https://colab.research.google.com/drive/1SRf2Nq5i7CFdCDzydwqM5Ylxn3Z7GeeX?authuser=4 Page 3 of 4
DAI 17th march.ipynb - Colab 22/04/25, 2:31 PM
Proximity Matrix:
[[ 0. 42. 33.01514804 ... 116.38728453 123.79418403
129.69194269]
[ 42. 0. 75.00666637 ... 111.22050171 137.3062271
122.01639234]
[ 33.01514804 75.00666637 0. ... 129.32130528 121.59358536
143.42245291]
...
[116.38728453 111.22050171 129.32130528 ... 0. 57.07013229
14.2126704 ]
[123.79418403 137.3062271 121.59358536 ... 57.07013229 0.
65. ]
[129.69194269 122.01639234 143.42245291 ... 14.2126704 65.
0. ]]
https://colab.research.google.com/drive/1SRf2Nq5i7CFdCDzydwqM5Ylxn3Z7GeeX?authuser=4 Page 4 of 4
mtai1000124_lab8.ipynb - Colab 22/04/25, 2:36 PM
LAB 8
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler, Normalizer
from sklearn.model_selection import train_test_split
iris
y = pd.Series(iris.target)
= load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Series(iris.target)
scaler = =
X_scaled StandardScaler()
scaler.fit_transform(X)
X_scaled = scaler.fit_transform(X)
normalizer_l1
X_test_l1 = normalizer_l1.transform(X_test)
= Normalizer(norm='l1')
X_train_l1 = normalizer_l1.fit_transform(X_train)
X_test_l1 = normalizer_l1.transform(X_test)
normalizer_l2
X_test_l2 = normalizer_l2.transform(X_test)
= Normalizer(norm='l2')
X_train_l2 = normalizer_l2.fit_transform(X_train)
X_test_l2 = normalizer_l2.transform(X_test)
print("Original
print("\nL2 Normalized
Scaled Features (First 3 Samples):\n", X_train[:3])
X_train_l2[:3])
print("\nL1 Normalized Features (First 3 Samples):\n", X_train_l1[:3])
print("\nL2 Normalized Features (First 3 Samples):\n", X_train_l2[:3])
https://colab.research.google.com/drive/1ON6S0WMmfdxmWCpUtlpC…FxX?authuser=4#scrollTo=6e1f4665-27e8-49ee-9442-91c9e0840f15 Page 1 of 2
mtai1000124_lab9.ipynb - Colab 22/04/25, 2:39 PM
Lab9
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split, GridSearchCV
1
diabetes X_test,
X_train, = load_diabetes()
y_train, y_test = train_test_split(X, y, test_size=0.2, random_sta
X = diabetes.data
y = diabetes.target
feature_names = diabetes.feature_names
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
lasso = Lasso(alpha=1.0)
lasso.fit(X_train, y_train)
y_pred_lasso = lasso.predict(X_test)
https://colab.research.google.com/drive/1XzD0vTh0lvEBhVjBTXXf3X…oRJxDt?authuser=4#scrollTo=1f887668-3ebe-4f78-ba67-30dca3c3f6ff Page 1 of 6
mtai1000124_lab9.ipynb - Colab 22/04/25, 2:39 PM
2
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
y_pred_ridge = ridge.predict(X_test)
# Compare coefficients
print("\nLasso Coefficients:\n", lasso.coef_)
print("\nRidge Coefficients:\n", ridge.coef_)
Lasso Coefficients:
[ 0. -0. 413.43184792 34.83051518 0.
0. -0. 0. 258.15289363 0. ]
Ridge Coefficients:
[ 45.36737726 -76.66608563 291.33883165 198.99581745 -0.53030959
-28.57704987 -144.51190505 119.26006559 230.22160832 112.14983004]
3
https://colab.research.google.com/drive/1XzD0vTh0lvEBhVjBTXXf3X…RJxDt?authuser=4#scrollTo=1f887668-3ebe-4f78-ba67-30dca3c3f6ff Page 2 of 6
mtai1000124_lab9.ipynb - Colab 22/04/25, 2:39 PM
best_lasso = lasso_cv.best_estimator_
y_pred_best_lasso = best_lasso.predict(X_test)
https://colab.research.google.com/drive/1XzD0vTh0lvEBhVjBTXXf3X…RJxDt?authuser=4#scrollTo=1f887668-3ebe-4f78-ba67-30dca3c3f6ff Page 3 of 6
mtai1000124_lab9.ipynb - Colab 22/04/25, 2:39 PM
best_ridge = ridge_cv.best_estimator_
y_pred_best_ridge = best_ridge.predict(X_test)
results = ridge_cv.cv_results_
plt.plot(alphas, results['mean_test_score'])
plt.xlabel("Alpha")
plt.ylabel("Cross-validated R²")
plt.title("Ridge Alpha vs R²")
plt.grid(True)
plt.show()
https://colab.research.google.com/drive/1XzD0vTh0lvEBhVjBTXXf3X…RJxDt?authuser=4#scrollTo=1f887668-3ebe-4f78-ba67-30dca3c3f6ff Page 4 of 6
mtai1000124_lab9.ipynb - Colab 22/04/25, 2:39 PM
4
models = {
"Linear Regression": (lr, y_pred_lr),
"Lasso (best alpha)": (best_lasso, y_pred_best_lasso),
"Ridge (best alpha)": (best_ridge, y_pred_best_ridge)
}
Linear Regression
R² score: 0.4526027629719197
RMSE: 53.853445836765914
https://colab.research.google.com/drive/1XzD0vTh0lvEBhVjBTXXf3X…RJxDt?authuser=4#scrollTo=1f887668-3ebe-4f78-ba67-30dca3c3f6ff Page 5 of 6
mtai1000124_lab9.ipynb - Colab 22/04/25, 2:39 PM
https://colab.research.google.com/drive/1XzD0vTh0lvEBhVjBTXXf3X…RJxDt?authuser=4#scrollTo=1f887668-3ebe-4f78-ba67-30dca3c3f6ff Page 6 of 6
2/24/25, 12:16 PM MTAI1001124(harshit.2) - Jupyter Notebook
In [2]: df.head()
Out[2]:
TAX BUILDING
BUILDING
CLASS CLAS
BOROUGH NEIGHBORHOOD CLASS BLOCK LOT EASEMENT
AT A
CATEGORY
PRESENT PRESEN
01 ONE
0 2 BATHGATE FAMILY 1 2907.0 24.0 NaN A
DWELLINGS
01 ONE
1 2 BATHGATE FAMILY 1 3030.0 69.0 NaN A
DWELLINGS
01 ONE
2 2 BATHGATE FAMILY 1 3046.0 10.0 NaN A
DWELLINGS
01 ONE
3 2 BATHGATE FAMILY 1 3046.0 27.0 NaN A
DWELLINGS
01 ONE
4 2 BATHGATE FAMILY 1 3046.0 40.0 NaN A
DWELLINGS
5 rows × 21 columns
localhost:8889/notebooks/MTAI1001124(harshit.2).ipynb 1/2
2/24/25, 12:16 PM MTAI1001124(harshit.2) - Jupyter Notebook
In [ ]:
localhost:8889/notebooks/MTAI1001124(harshit.2).ipynb 2/2
2/24/25, 12:15 PM MTAI1001124(Harshit)
In [2]:
import pandas as pd
Out[2]:
TAX
BUILDING BUILDING
CLASS
BOROUGH NEIGHBORHOOD CLASS BLOCK LOT EASEMENT CLASS AT AD
AT
CATEGORY PRESENT
PRESENT
01 ONE
409
0 2 BATHGATE FAMILY 1 2907.0 24.0 NaN A1
A
DWELLINGS
01 ONE
444
1 2 BATHGATE FAMILY 1 3030.0 69.0 NaN A1
A
DWELLINGS
01 ONE
2 2 BATHGATE FAMILY 1 3046.0 10.0 NaN A1 WASHIN
DWELLINGS A
01 ONE
3 2 BATHGATE FAMILY 1 3046.0 27.0 NaN A1 WASHIN
DWELLINGS A
01 ONE
4 2 BATHGATE FAMILY 1 3046.0 40.0 NaN A1 BAT
DWELLINGS A
5 rows × 21 columns
In [3]:
# Trim whitespace from column names
df.columns = df.columns.str.strip()
# Convert 'SALE PRICE', 'LAND SQUARE FEET', and 'GROSS SQUARE FEET' to numeric, hand
df['SALE PRICE'] = pd.to_numeric(df['SALE PRICE'].str.replace(',', ''), errors='coer
df['LAND SQUARE FEET'] = pd.to_numeric(df['LAND SQUARE FEET'].str.replace(',', ''),
df['GROSS SQUARE FEET'] = pd.to_numeric(df['GROSS SQUARE FEET'].str.replace(',', '')
# Remove sale price outliers: Transactions with SALE PRICE = 0 are likely invalid
df = df[df['SALE PRICE'] > 0]
localhost:8889/nbconvert/html/MTAI1001124(Harshit).ipynb?download=false 1/3
2/24/25, 12:15 PM MTAI1001124(Harshit)
# Display the mean sale price per ZIP code
print(mean_sale_price_by_zip)
ZIP CODE
10451.0 1.501295e+06
10452.0 1.250472e+06
10453.0 1.721677e+06
10454.0 2.554382e+06
10455.0 8.571722e+05
10456.0 9.594280e+05
10457.0 1.703850e+06
10458.0 1.346083e+06
10459.0 9.214199e+05
10460.0 1.122764e+06
10461.0 1.015786e+06
10462.0 5.525803e+05
10463.0 1.726541e+06
10464.0 7.024356e+05
10465.0 7.157930e+05
10466.0 6.505253e+05
10467.0 8.504955e+05
10468.0 1.300199e+06
10469.0 8.046570e+05
10470.0 8.942567e+05
10471.0 1.279964e+06
10472.0 8.675365e+05
10473.0 9.937852e+05
10474.0 2.406104e+06
10475.0 5.628368e+05
Name: SALE PRICE, dtype: float64
In [5]:
# Plot mean sale price per ZIP code
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 6))
mean_sale_price_by_zip.plot(kind='bar', color='skyblue', edgecolor='black')
plt.xlabel("ZIP Code")
plt.ylabel("Mean Sale Price ($)")
plt.title("Mean Sale Price by ZIP Code")
plt.xticks(rotation=90) # Rotate labels for better readability
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
localhost:8889/nbconvert/html/MTAI1001124(Harshit).ipynb?download=false 2/3
2/24/25, 12:15 PM MTAI1001124(Harshit)
In [ ]:
localhost:8889/nbconvert/html/MTAI1001124(Harshit).ipynb?download=false 3/3
SET B manager salary
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import pandas as pd
df.head(10)
Timestamp How old are you? What industry do you work in?
\
0 4/27/2021 11:02:10 25-34 Education (Higher Education)
Job title \
0 Research and Instruction Librarian
1 Change & Internal Communications Manager
2 Marketing Specialist
3 Program Manager
4 Accounting Manager
5 Scholarly Publishing Librarian
6 Publishing Assistant
7 Librarian
8 Systems Analyst
9 Senior Accountant
If your job title needs additional context, please clarify here: \
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 High school, FT
8 Data developer/ETL Developer
9 NaN
1 54,600
2 34,000
3 62,000
4 60,000
5 62,000
6 33,000
7 50,000
8 112,000
9 45,000
1 4000.0
2 NaN
3 3000.0
4 7000.0
5 NaN
6 2000.0
7 NaN
8 10000.0
9 0.0
If you're in the U.S., what state do you work in? What city do you
work in? \
0 Massachusetts
Boston
1 NaN
Cambridge
2 Tennessee
Chattanooga
3 Wisconsin
Milwaukee
4 South Carolina
Greenville
5 New Hampshire
Hanover
6 South Carolina
Columbia
7 Arizona
Yuma
8 Missouri
St. Louis
9 Florida
Palm Coast
1 8 - 10 years
2 2 - 4 years
3 8 - 10 years
4 8 - 10 years
5 8 - 10 years
6 2 - 4 years
7 5-7 years
8 21 - 30 years
9 21 - 30 years
1 5-7 years
2 2 - 4 years
3 5-7 years
4 5-7 years
5 2 - 4 years
6 2 - 4 years
7 5-7 years
8 21 - 30 years
9 21 - 30 years
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py:5039:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
Data Preprocessing
data
C:\Users\BIT~1.L5-\AppData\Local\Temp/ipykernel_8184/3036456837.py:2:
FutureWarning: The default value of regex will change from True to
False in a future version. In addition, single character regular
expressions will *not* be treated as literal strings when regex=True.
data['salary'] = data['salary'].str.replace(',',
'').str.replace('$', '').str.strip()
C:\Users\BIT~1.L5-\AppData\Local\Temp/ipykernel_8184/3036456837.py:2:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
C:\Users\BIT~1.L5-\AppData\Local\Temp/ipykernel_8184/4290375551.py:2:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
data['age'] = data['age'].apply(convert_experience)
data['exp_field'] = data['exp_field'].apply(convert_experience)
data['exp_overall'] = data['exp_overall'].apply(convert_experience)
C:\Users\BIT~1.L5-\AppData\Local\Temp/ipykernel_8184/3789075842.py:16:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
C:\ProgramData\Anaconda3\lib\site-packages\pandas\util\
_decorators.py:311: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\backends\
backend_agg.py:240: RuntimeWarning: Glyph 128202 missing from current
font.
font.set_text(s, 0.0, flags=flags)
C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\backends\
backend_agg.py:203: RuntimeWarning: Glyph 128202 missing from current
font.
font.set_text(s, 0, flags=flags)
C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\backends\
backend_agg.py:240: RuntimeWarning: Glyph 127891 missing from current
font.
font.set_text(s, 0.0, flags=flags)
C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\backends\
backend_agg.py:203: RuntimeWarning: Glyph 127891 missing from current
font.
font.set_text(s, 0, flags=flags)
C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\backends\
backend_agg.py:240: RuntimeWarning: Glyph 129489 missing from current
font.
font.set_text(s, 0.0, flags=flags)
C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\backends\
backend_agg.py:240: RuntimeWarning: Glyph 127891 missing from current
font.
font.set_text(s, 0.0, flags=flags)
C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\backends\
backend_agg.py:203: RuntimeWarning: Glyph 129489 missing from current
font.
font.set_text(s, 0, flags=flags)
C:\ProgramData\Anaconda3\lib\site-packages\matplotlib\backends\
backend_agg.py:203: RuntimeWarning: Glyph 127891 missing from current
font.
font.set_text(s, 0, flags=flags)
df.isnull().sum()
Timestamp
0
How old are you?
0
What industry do you work in?
72
Job title
0
If your job title needs additional context, please clarify here:
20708
What is your annual salary? (You'll indicate the currency in a later
question. If you are part-time or hourly, please enter an annualized
equivalent -- what you would earn if you worked the job 40 hours a
week, 52 weeks a year.) 0
How much additional monetary compensation do you get, if any (for
example, bonuses or overtime in an average year)? Please only include
monetary compensation here, not the value of benefits.
7253
Please indicate the currency
0
If "Other," please indicate the currency here:
27743
If your income needs additional context, please provide it here:
24906
What country do you work in?
0
If you're in the U.S., what state do you work in?
4981
What city do you work in?
75
How many years of professional work experience do you have overall?
0
How many years of professional work experience do you have in your
field?
0
What is your highest level of education completed?
213
What is your gender?
166
What is your race? (Choose all that apply.)
168
dtype: int64
PREDICTION
#One-Hot Encode categorical columns
encoded_data = pd.get_dummies(data, columns=['industry', 'country',
'education'], drop_first=True)
#Train-Test Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)
LinearRegression()
#Make predictions
y_pred = model.predict(X_test)
y_pred