0% found this document useful (0 votes)
19 views21 pages

DS Problem Statements and Codes

Uploaded by

zzmnqwpo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views21 pages

DS Problem Statements and Codes

Uploaded by

zzmnqwpo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

1. Perform the following operations using Python on given data set.

• Load the Dataset into pandas data frame


• Display information about missing values in the data
• Display initial statistics.
• Check the dimensions of the data frame.
• Summarize the types of variables by checking the data types (i.e., character, numeric, integer, factor,
and logical) of the variables in the data set. If variables are not in the correct data type, apply proper
type conversions.
• Turn categorical variables into quantitative variables using one hot encoding.
CODE:
import pandas as pd

# Load the dataset into a pandas data frame


df = pd.read_csv('dataset.csv')

# Display information about missing values in the data


print(df.isnull().sum())

# Display initial statistics


print(df.describe())

# Check the dimensions of the data frame


print(df.shape)

# Summarize the types of variables


print(df.dtypes)

# Convert variables to the correct data types if necessary


# For example, if a column named 'age' should be numeric instead of object
df['age'] = pd.to_numeric(df['age'])

# Turn categorical variables into quantitative variables using one hot encoding
df_encoded = pd.get_dummies(df)

# Print the encoded data frame


print(df_encoded)
2. Perform the following operations using Python on given data set.
• Load the Dataset into pandas data frame.
• Display information about missing values in the data
• Display initial statistics.
• Check the dimensions of the data frame.
• Summarize the types of variables by checking the data types (i.e., character, numeric, integer, factor,
and logical) of the variables in the data set. If variables are not in the correct data type, apply proper
type conversions.
• Turn categorical variables into quantitative variables using label encoder
Code:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Load the dataset into a pandas data frame


df = pd.read_csv('dataset.csv')

# Display information about missing values in the data


print(df.isnull().sum())

# Display initial statistics


print(df.describe())

# Check the dimensions of the data frame


print(df.shape)

# Summarize the types of variables


print(df.dtypes)

# Convert variables to the correct data types if necessary


# For example, if a column named 'age' should be numeric instead of object
df['age'] = pd.to_numeric(df['age'])

# Turn categorical variables into quantitative variables using label encoder


le = LabelEncoder()

# Loop through each column in the data frame


for column in df.columns:
# Check if the column is categorical (object or string type)
if df[column].dtype == 'object':
df[column] = le.fit_transform(df[column].astype(str))

# Print the updated data frame with label encoded categorical variables
print(df)

3. Perform the following operations using Python for given dataset.


• Scan all variables for missing values and inconsistencies. If there are missing values and/or
inconsistencies, use any 2 suitable techniques to deal with them.
• Identify outliers using any 2 techniques
• If there are outliers, deal with them using any technique
Code:
import pandas as pd
import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import RobustScaler

# Load the dataset into a pandas data frame


df = pd.read_csv('dataset.csv')

# Scan all variables for missing values and inconsistencies

# Check for missing values


missing_values = df.isnull().sum()
print("Missing Values:")
print(missing_values)

# Deal with missing values

# Technique 1: Drop rows with missing values


df_dropped = df.dropna()

# Technique 2: Impute missing values with mean or median


df_imputed = df.fillna(df.median())

# Identify outliers using two techniques

# Technique 1: Z-Score method


z_scores = np.abs((df_imputed - df_imputed.mean()) / df_imputed.std())
outliers_zscore = (z_scores > 3).any(axis=1)

# Technique 2: Isolation Forest


scaler = RobustScaler()
df_scaled = scaler.fit_transform(df_imputed)
clf = IsolationForest(contamination=0.05)
clf.fit(df_scaled)
outliers_isolation = clf.predict(df_scaled) == -1

# Deal with outliers

# Technique 1: Remove outliers using Z-Score method


df_no_outliers_zscore = df_imputed[~outliers_zscore]

# Technique 2: Remove outliers using Isolation Forest


df_no_outliers_isolation = df_imputed[~outliers_isolation]

# Print the results


print("Dataset after dropping rows with missing values:")
print(df_dropped)
print("Dataset after imputing missing values:")
print(df_imputed)
print("Dataset after removing outliers (Z-Score method):")
print(df_no_outliers_zscore)
print("Dataset after removing outliers (Isolation Forest):")
print(df_no_outliers_isolation)
4. Perform the following operations using Python for given dataset.
• Scan all variables for missing values and inconsistencies. If there are missing values and/or
inconsistencies, use any 2 suitable techniques to deal with them.
• Identify outliers using any box plot
• If there are outliers, deal with them using any technique.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset into a pandas data frame


df = pd.read_csv('dataset.csv')

# Scan all variables for missing values and inconsistencies

# Check for missing values


missing_values = df.isnull().sum()
print("Missing Values:")
print(missing_values)

# Deal with missing values

# Technique 1: Drop rows with missing values


df_dropped = df.dropna()

# Technique 2: Impute missing values with mean or median


df_imputed = df.fillna(df.median())

# Identify outliers using box plot

# Plot box plots for all numerical variables


plt.figure(figsize=(10, 6))
sns.boxplot(data=df_imputed)
plt.title("Box Plot of Numerical Variables")
plt.xticks(rotation=45)
plt.show()
# Deal with outliers

# Technique 1: Remove outliers using Interquartile Range (IQR) method


Q1 = df_imputed.quantile(0.25)
Q3 = df_imputed.quantile(0.75)
IQR = Q3 - Q1
df_no_outliers_iqr = df_imputed[~((df_imputed < (Q1 - 1.5 * IQR)) | (df_imputed > (Q3 + 1.5 *
IQR))).any(axis=1)]

# Technique 2: Replace outliers with median


df_no_outliers_median = df_imputed.copy()
for column in df_no_outliers_median.columns:
if df_no_outliers_median[column].dtype != 'object':
median = df_no_outliers_median[column].median()
df_no_outliers_median[column] = np.where(
(df_no_outliers_median[column] < (Q1[column] - 1.5 * IQR[column])) |
(df_no_outliers_median[column] > (Q3[column] + 1.5 * IQR[column])),
median, df_no_outliers_median[column]
)

# Print the results


print("Dataset after dropping rows with missing values:")
print(df_dropped)
print("Dataset after imputing missing values:")
print(df_imputed)
print("Dataset after removing outliers (IQR method):")
print(df_no_outliers_iqr)
print("Dataset after replacing outliers with median:")
print(df_no_outliers_median)
5. Perform the following operations using Python for given dataset.
• Scan all variables for missing values and inconsistencies. If there are missing values and/or
inconsistencies, use any 2 suitable techniques to deal with them.
• Identify outliers if any using any 2 techniques
• Apply z-score data normalization technique
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, RobustScaler

# Load the dataset into a pandas data frame


df = pd.read_csv('dataset.csv')

# Scan all variables for missing values and inconsistencies

# Check for missing values


missing_values = df.isnull().sum()
print("Missing Values:")
print(missing_values)

# Deal with missing values

# Technique 1: Drop rows with missing values


df_dropped = df.dropna()

# Technique 2: Impute missing values with mean or median


df_imputed = df.fillna(df.median())

# Identify outliers using two techniques

# Technique 1: Z-Score method


z_scores = np.abs((df_imputed - df_imputed.mean()) / df_imputed.std())
outliers_zscore = (z_scores > 3).any(axis=1)

# Technique 2: Tukey's fences method


Q1 = df_imputed.quantile(0.25)
Q3 = df_imputed.quantile(0.75)
IQR = Q3 - Q1
outliers_tukey = ((df_imputed < (Q1 - 1.5 * IQR)) | (df_imputed > (Q3 + 1.5 * IQR))).any(axis=1)

# Deal with outliers

# Technique 1: Remove outliers using Z-Score method


df_no_outliers_zscore = df_imputed[~outliers_zscore]

# Technique 2: Remove outliers using Tukey's fences method


df_no_outliers_tukey = df_imputed[~outliers_tukey]

# Apply z-score data normalization technique

# StandardScaler
scaler = StandardScaler()
df_standardized = pd.DataFrame(scaler.fit_transform(df_imputed), columns=df_imputed.columns)

# RobustScaler
scaler = RobustScaler()
df_robust_scaled = pd.DataFrame(scaler.fit_transform(df_imputed), columns=df_imputed.columns)
# Print the results
print("Dataset after dropping rows with missing values:")
print(df_dropped)
print("Dataset after imputing missing values:")
print(df_imputed)
print("Dataset after removing outliers (Z-Score method):")
print(df_no_outliers_zscore)
print("Dataset after removing outliers (Tukey's fences method):")
print(df_no_outliers_tukey)
print("Dataset after applying z-score data normalization (StandardScaler):")
print(df_standardized)
print("Dataset after applying z-score data normalization (RobustScaler):")
print(df_robust_scaled)
6. Perform the following operations using Python for given dataset.
• Scan all variables for missing values and inconsistencies. If there are missing values and/or
inconsistencies, use any 2 suitable techniques to deal with them.
• Identify outliers if any using any 2 techniques
• Apply min max data normalization technique
Code:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from scipy.stats import zscore

# Load the dataset into a pandas data frame


df = pd.read_csv('dataset.csv')

# Scan all variables for missing values and inconsistencies

# Check for missing values


missing_values = df.isnull().sum()
print("Missing Values:")
print(missing_values)

# Deal with missing values

# Technique 1: Drop rows with missing values


df_dropped = df.dropna()

# Technique 2: Impute missing values with mean or median


df_imputed = df.fillna(df.median())

# Identify outliers using two techniques

# Technique 1: Z-Score method


z_scores = np.abs(zscore(df_imputed))
outliers_zscore = (z_scores > 3).any(axis=1)
# Technique 2: Tukey's fences method
Q1 = df_imputed.quantile(0.25)
Q3 = df_imputed.quantile(0.75)
IQR = Q3 - Q1
outliers_tukey = ((df_imputed < (Q1 - 1.5 * IQR)) | (df_imputed > (Q3 + 1.5 * IQR))).any(axis=1)

# Deal with outliers

# Technique 1: Remove outliers using Z-Score method


df_no_outliers_zscore = df_imputed[~outliers_zscore]

# Technique 2: Remove outliers using Tukey's fences method


df_no_outliers_tukey = df_imputed[~outliers_tukey]

# Apply min-max data normalization technique

scaler = MinMaxScaler()
df_minmax_scaled = pd.DataFrame(scaler.fit_transform(df_imputed), columns=df_imputed.columns)

# Print the results


print("Dataset after dropping rows with missing values:")
print(df_dropped)
print("Dataset after imputing missing values:")
print(df_imputed)
print("Dataset after removing outliers (Z-Score method):")
print(df_no_outliers_zscore)
print("Dataset after removing outliers (Tukey's fences method):")
print(df_no_outliers_tukey)
print("Dataset after applying min-max data normalization:")
print(df_minmax_scaled)
7. Perform the following operations on given dataset
• Display mean, median, minimum, maximum, standard deviation for a given dataset
• Display mean, median, minimum, maximum, standard deviation for a given dataset with numeric
variables grouped by one of the qualitative (categorical) variable. For example, if your categorical
variable is age groups and quantitative variable is income, then provide summary statistics of income
grouped by the age groups
• Scan all variables for missing values and inconsistencies. If there are missing values and/or
inconsistencies, use any 2 suitable techniques to deal with them.
• Use the Seaborn library to see if we can find any patterns in the data.
Code:
import pandas as pd
import numpy as np
import seaborn as sns

# Load the dataset into a pandas data frame


df = pd.read_csv('dataset.csv')

# Display mean, median, minimum, maximum, and standard deviation for the given dataset
summary_statistics = df.describe()
print("Summary Statistics for the Dataset:")
print(summary_statistics)

# Display mean, median, minimum, maximum, and standard deviation for a given dataset with numeric
variables grouped by a categorical variable

# Example: Group by 'age groups' and calculate summary statistics of 'income'


summary_statistics_grouped = df.groupby('age groups')['income'].describe()
print("\nSummary Statistics of Income Grouped by Age Groups:")
print(summary_statistics_grouped)

# Scan all variables for missing values and inconsistencies

# Check for missing values


missing_values = df.isnull().sum()
print("\nMissing Values:")
print(missing_values)
# Deal with missing values

# Technique 1: Drop rows with missing values


df_dropped = df.dropna()

# Technique 2: Impute missing values with mean or median


df_imputed = df.fillna(df.median())

# Use the Seaborn library to visualize patterns in the data


sns.pairplot(df)
plt.show()

8.
• Use the inbuilt dataset 'titanic' which contains information about the passengers who boarded the
unfortunate Titanic ship. Use the Seaborn library to see if we can find any patterns in the data.
• Write a code to check how the price of the ticket (column name: 'fare') for each passenger is
distributed by plotting a histogram.
Code:
import seaborn as sns

# Load the Titanic dataset from Seaborn


titanic_data = sns.load_dataset('titanic')

# Plot a histogram of the ticket prices


sns.histplot(data=titanic_data, x='fare', kde=True)
plt.title("Distribution of Ticket Prices")
plt.xlabel("Fare")
plt.ylabel("Count")
plt.show()
9.
• Use the inbuilt dataset 'titanic' and plot a box plot for distribution of age with respect to each gender
along with the information about whether they survived or not. (Column names : 'sex' and 'age')
• Write observations on the inference from the above statistics.
import seaborn as sns

# Load the Titanic dataset from Seaborn


titanic_data = sns.load_dataset('titanic')

# Plot box plot for age distribution with respect to gender and survival status
sns.boxplot(data=titanic_data, x='sex', y='age', hue='survived')
plt.title("Age Distribution by Gender and Survival Status")
plt.xlabel("Gender")
plt.ylabel("Age")
plt.show()

10. Perform the following operations on given dataset


• List down the features and their types
• Create a box plot for each feature in the dataset.
• Compare distributions and identify outliers.
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset into a pandas data frame


df = pd.read_csv('your_dataset.csv')

# List down the features and their types


feature_types = df.dtypes
print("Features and their Types:")
print(feature_types)

# Create a box plot for each feature in the dataset


fig, axes = plt.subplots(nrows=len(df.columns), figsize=(8, 6 * len(df.columns)))

for i, column in enumerate(df.columns):


ax = axes[i]
ax.boxplot(df[column])
ax.set_title(f"Box Plot of {column}")
ax.set_ylabel(column)

plt.tight_layout()
plt.show()

# Compare distributions and identify outliers


# You can visually inspect the box plots created above to compare the distributions of each feature.
# Outliers can be identified as individual points beyond the whiskers of the box plots.

# Alternatively, you can programmatically identify outliers using various statistical methods,
# such as the z-score method or Tukey's fences method, as mentioned in the previous responses.

# For example, to identify outliers using the z-score method:


from scipy.stats import zscore

z_scores = zscore(df)
outliers = (z_scores > 3).any(axis=1)
outlier_rows = df[outliers]
print("Outlier Rows:")
print(outlier_rows)

12. Create a Linear Regression Model using Python/R to predict home prices using Boston Housing
Dataset. Find the performance of your model.
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Load the Boston Housing Dataset


boston = load_boston()
df = pd.DataFrame(boston.data, columns=boston.feature_names)
target = pd.Series(boston.target, name='MEDV')

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(df, target, test_size=0.2, random_state=42)

# Create a linear regression model


model = LinearRegression()

# Fit the model on the training data


model.fit(X_train, y_train)

# Make predictions on the testing data


y_pred = model.predict(X_test)

# Evaluate the performance of the model


mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Print the performance metrics


print("Mean Squared Error (MSE):", mse)
print("R-squared (R2):", r2)

13. Create a logistic regression model to perform classification on given dataset. Compute
Confusion matrix to find TP, FP, TN, FN, Accuracy, Error rate, Precision, Recall on the given dataset.
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

# Load the Breast Cancer Dataset


breast_cancer = load_breast_cancer()
df = pd.DataFrame(breast_cancer.data, columns=breast_cancer.feature_names)
target = pd.Series(breast_cancer.target, name='target')

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(df, target, test_size=0.2, random_state=42)

# Create a logistic regression model


model = LogisticRegression()

# Fit the model on the training data


model.fit(X_train, y_train)

# Make predictions on the testing data


y_pred = model.predict(X_test)

# Compute the confusion matrix


confusion = confusion_matrix(y_test, y_pred)

# Extract TP, FP, TN, FN from the confusion matrix


TN = confusion[0, 0]
FP = confusion[0, 1]
FN = confusion[1, 0]
TP = confusion[1, 1]

# Compute performance metrics


accuracy = accuracy_score(y_test, y_pred)
error_rate = 1 - accuracy
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print the confusion matrix and performance metrics


print("Confusion Matrix:")
print(confusion)
print("\nTrue Positive (TP):", TP)
print("False Positive (FP):", FP)
print("True Negative (TN):", TN)
print("False Negative (FN):", FN)
print("\nAccuracy:", accuracy)
print("Error Rate:", error_rate)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)

14. Create a Naïve Bayes classification model using Python on given dataset. Compute Confusion
matrix to find TP, FP, TN, FN, Accuracy, Error rate, Precision, Recall on the given dataset.
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

# Load the Iris Dataset


iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
target = pd.Series(iris.target, name='target')

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(df, target, test_size=0.2, random_state=42)

# Create a Naïve Bayes model (Gaussian Naïve Bayes in this case)


model = GaussianNB()

# Fit the model on the training data


model.fit(X_train, y_train)

# Make predictions on the testing data


y_pred = model.predict(X_test)
# Compute the confusion matrix
confusion = confusion_matrix(y_test, y_pred)

# Extract TP, FP, TN, FN from the confusion matrix


TN = confusion[0, 0]
FP = confusion[0, 1]
FN = confusion[1, 0]
TP = confusion[1, 1]

# Compute performance metrics


accuracy = accuracy_score(y_test, y_pred)
error_rate = 1 - accuracy
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

# Print the confusion matrix and performance metrics


print("Confusion Matrix:")
print(confusion)
print("\nTrue Positive (TP):", TP)
print("False Positive (FP):", FP)
print("True Negative (TN):", TN)
print("False Negative (FN):", FN)
print("\nAccuracy:", accuracy)
print("Error Rate:", error_rate)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
15. Create a bernouli Naïve Bayes classification model using Python on given dataset. Compute
Confusion matrix to find TP, FP, TN, FN, Accuracy, Error rate, Precision, Recall on the given dataset.
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

# Load the Iris Dataset


iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
target = pd.Series(iris.target, name='target')

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(df, target, test_size=0.2, random_state=42)

# Create a Bernoulli Naïve Bayes model


model = BernoulliNB()

# Fit the model on the training data


model.fit(X_train, y_train)

# Make predictions on the testing data


y_pred = model.predict(X_test)

# Compute the confusion matrix


confusion = confusion_matrix(y_test, y_pred)

# Extract TP, FP, TN, FN from the confusion matrix


TN = confusion[0, 0]
FP = confusion[0, 1]
FN = confusion[1, 0]
TP = confusion[1, 1]
# Compute performance metrics
accuracy = accuracy_score(y_test, y_pred)
error_rate = 1 - accuracy
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

# Print the confusion matrix and performance metrics


print("Confusion Matrix:")
print(confusion)
print("\nTrue Positive (TP):", TP)
print("False Positive (FP):", FP)
print("True Negative (TN):", TN)
print("False Negative (FN):", FN)
print("\nAccuracy:", accuracy)
print("Error Rate:", error_rate)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
16. For given text apply following preprocessing methods:
• Tokenization
• POS Tagging
• Lemmatization
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer

# Text to be preprocessed
text = 'Hello Everyone!, Welcome to my blog post on Medium. We are studying Natural Language
Processing.'

# Tokenization
tokens = word_tokenize(text)

# POS Tagging
pos_tags = pos_tag(tokens)

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(word, pos=tag[0].lower()) if tag[0].lower() in ['a', 'n', 'v'] else
lemmatizer.lemmatize(word) for word, tag in pos_tags]

# Print the results


print("Tokenization:")
print(tokens)
print("\nPOS Tagging:")
print(pos_tags)
print("\nLemmatization:")
print(lemmas)

You might also like