DS Problem Statements and Codes
DS Problem Statements and Codes
# Turn categorical variables into quantitative variables using one hot encoding
df_encoded = pd.get_dummies(df)
# Print the updated data frame with label encoded categorical variables
print(df)
# StandardScaler
scaler = StandardScaler()
df_standardized = pd.DataFrame(scaler.fit_transform(df_imputed), columns=df_imputed.columns)
# RobustScaler
scaler = RobustScaler()
df_robust_scaled = pd.DataFrame(scaler.fit_transform(df_imputed), columns=df_imputed.columns)
# Print the results
print("Dataset after dropping rows with missing values:")
print(df_dropped)
print("Dataset after imputing missing values:")
print(df_imputed)
print("Dataset after removing outliers (Z-Score method):")
print(df_no_outliers_zscore)
print("Dataset after removing outliers (Tukey's fences method):")
print(df_no_outliers_tukey)
print("Dataset after applying z-score data normalization (StandardScaler):")
print(df_standardized)
print("Dataset after applying z-score data normalization (RobustScaler):")
print(df_robust_scaled)
6. Perform the following operations using Python for given dataset.
• Scan all variables for missing values and inconsistencies. If there are missing values and/or
inconsistencies, use any 2 suitable techniques to deal with them.
• Identify outliers if any using any 2 techniques
• Apply min max data normalization technique
Code:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from scipy.stats import zscore
scaler = MinMaxScaler()
df_minmax_scaled = pd.DataFrame(scaler.fit_transform(df_imputed), columns=df_imputed.columns)
# Display mean, median, minimum, maximum, and standard deviation for the given dataset
summary_statistics = df.describe()
print("Summary Statistics for the Dataset:")
print(summary_statistics)
# Display mean, median, minimum, maximum, and standard deviation for a given dataset with numeric
variables grouped by a categorical variable
8.
• Use the inbuilt dataset 'titanic' which contains information about the passengers who boarded the
unfortunate Titanic ship. Use the Seaborn library to see if we can find any patterns in the data.
• Write a code to check how the price of the ticket (column name: 'fare') for each passenger is
distributed by plotting a histogram.
Code:
import seaborn as sns
# Plot box plot for age distribution with respect to gender and survival status
sns.boxplot(data=titanic_data, x='sex', y='age', hue='survived')
plt.title("Age Distribution by Gender and Survival Status")
plt.xlabel("Gender")
plt.ylabel("Age")
plt.show()
plt.tight_layout()
plt.show()
# Alternatively, you can programmatically identify outliers using various statistical methods,
# such as the z-score method or Tukey's fences method, as mentioned in the previous responses.
z_scores = zscore(df)
outliers = (z_scores > 3).any(axis=1)
outlier_rows = df[outliers]
print("Outlier Rows:")
print(outlier_rows)
12. Create a Linear Regression Model using Python/R to predict home prices using Boston Housing
Dataset. Find the performance of your model.
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
13. Create a logistic regression model to perform classification on given dataset. Compute
Confusion matrix to find TP, FP, TN, FN, Accuracy, Error rate, Precision, Recall on the given dataset.
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
14. Create a Naïve Bayes classification model using Python on given dataset. Compute Confusion
matrix to find TP, FP, TN, FN, Accuracy, Error rate, Precision, Recall on the given dataset.
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
# Text to be preprocessed
text = 'Hello Everyone!, Welcome to my blog post on Medium. We are studying Natural Language
Processing.'
# Tokenization
tokens = word_tokenize(text)
# POS Tagging
pos_tags = pos_tag(tokens)
# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(word, pos=tag[0].lower()) if tag[0].lower() in ['a', 'n', 'v'] else
lemmatizer.lemmatize(word) for word, tag in pos_tags]