0% found this document useful (0 votes)
2 views

Practical _Questions_Unit 1 and 2

The document outlines a series of data analysis tasks, including Titanic Survival dataset analysis, dummy variable creation, untidy to tidy data transformation, Winsorization method, and missing value imputation. It also covers exploratory data analysis (EDA) and feature engineering on an insurance charges dataset, as well as linear regression model fitting and visualization. Each task specifies the steps to be performed, including data loading, visualization, statistical analysis, and model evaluation.

Uploaded by

zaheerkkd1312
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Practical _Questions_Unit 1 and 2

The document outlines a series of data analysis tasks, including Titanic Survival dataset analysis, dummy variable creation, untidy to tidy data transformation, Winsorization method, and missing value imputation. It also covers exploratory data analysis (EDA) and feature engineering on an insurance charges dataset, as well as linear regression model fitting and visualization. Each task specifies the steps to be performed, including data loading, visualization, statistical analysis, and model evaluation.

Uploaded by

zaheerkkd1312
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Questions

1. Titanic Survival Dataset Analysis:


o Load the Titanic Survival dataset.
o Display 5 sample observations from the dataset.
o Check and display the dataset information.
o Count the number of survivors based on gender (sex-wise
survival count).
o Check if there are any null values in the dataset.
o Plot a count plot showing the survival count passenger-wise.
o Create a strip plot of Age vs. Sex with hue as Survival status.
Identify the key factors influencing survival.
o Plot a pie chart for survival status and identify the percentage of
survivors.
2. Dummy Variable Creation:
o Create a DataFrame in Python using the following dataset:

Item Color
Item1 Red
Item2 Green
Item3 Blue
Item4 Red
Item5 Green

o Generate dummy variables for the Color column.


o Display the resulting DataFrame with the dummy variables.
3. Untidy to Tidy Data Transformation:
o Consider the following dataset:
Populatio
Country Year GDP
n
USA 2010 308 14992
USA 2011 311 15543
Canada 2010 34 1536
Canada 2011 35 1601

o Convert this untidy dataset into a tidy format such that


Population and GDP are represented as separate variables under
one column, and their respective values are listed in another
column.
o Display the tidy dataset.
4. Winsorization Method:
o For the given data [10, 15, 20, 25, 100, 150, 200], replace the
outliers with the 5th and 95th percentiles using the Winsorization
method.
5. Missing Value Imputation:
o For the given dataset:

Feature Feature
1 2
5 12
7 NaN
3 8
NaN 15
8 6
10 9
6 NaN
NaN 5
9 11
o Replace the missing values with the mean of their respective
columns.
o Replace the missing values with the median of their respective
columns.
o Replace the missing values using the K-Nearest Neighbors (KNN)
imputation method.

Question: Exploratory Data Analysis (EDA) and Feature Engineering

a) Load the dataset Insurance Charges Prediction.csv into a DataFrame


and perform the following:
o Display the first 5 rows of the dataset.
o Display the dataset information.
o Provide the statistical summary of the dataset for numerical
features.
o Provide the statistical summary of the dataset for categorical
features.
b) Perform Univariate Analysis:
o Plot histograms for all numerical columns in the dataset.
c) Perform Bivariate Analysis:
o Visualize the distribution of charges based on:
 Gender (sex) using a boxplot.
 Region (region) using a boxplot.
 Smoking status (smoker) using a boxplot.
o Plot a count plot to show the distribution of smoker status with
hue as sex.
o Create scatter plots for:
 Age vs. Charges.
 BMI vs. Charges.
d) Perform Correlation Analysis:
o Filter out the numerical columns.
o Calculate and display the correlation matrix for numerical
variables.
o Visualize the correlation matrix using a heatmap.
e) Filter the categorical variables:
o Extract only categorical features.
o Display the names of the categorical columns.
f) Perform Feature Engineering:
o Use pd.get_dummies function to encode the categorical variables
into dummy variables.

Question: Linear Regression Model and Visualization

a) Given the following dataset:

x y
5 5
15 20
25 14
35 32
45 22
55 38

b) Plot a scatter plot to visualize the relationship between x and y.


c) Fit a Linear Regression model using x and y.
o Calculate the coefficient of determination (R²) to evaluate the
goodness of fit.
o Display the intercept and coefficient of the linear regression
model.
d) Predict the dependent variable (y) for the following new values of the
independent variable (x): 8, 15, and 35.
e) Plot the original data points and overlay the fitted regression line.

You might also like