0% found this document useful (0 votes)
3 views

vertopal.com_homework1

The document outlines a data analysis project using the Titanic dataset, including loading the data, displaying basic statistics, and visualizing distributions. Key findings include the average age and fare, survival rates, and correlations between variables. Additionally, it discusses outlier detection and distribution analysis for age and fare, highlighting their skewness.

Uploaded by

mannkanit
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

vertopal.com_homework1

The document outlines a data analysis project using the Titanic dataset, including loading the data, displaying basic statistics, and visualizing distributions. Key findings include the average age and fare, survival rates, and correlations between variables. Additionally, it discusses outlier detection and distribution analysis for age and fare, highlighting their skewness.

Uploaded by

mannkanit
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

import kagglehub

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

/Library/Frameworks/Python.framework/Versions/3.13/lib/python3.13/
site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found.
Please update jupyter and ipywidgets. See
https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm

Submission by Kanit Mann

Part 1
1.1
• a) Load the dataset and display the first 5 rows.
path = kagglehub.dataset_download("yasserh/titanic-dataset")
print("Path to dataset files:", path)
data = pd.read_csv('Titanic-Dataset.csv')
df = pd.DataFrame(data)
print(df.head(5))

Path to dataset files:


/Users/kanitmann/.cache/kagglehub/datasets/yasserh/titanic-dataset/
versions/1
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3

Name Sex Age


SibSp \
0 Braund, Mr. Owen Harris male 22.0
1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0
1
2 Heikkinen, Miss. Laina female 26.0
0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0
1
4 Allen, Mr. William Henry male 35.0
0

Parch Ticket Fare Cabin Embarked


0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S

• b) Display the total number of rows and columns.


print(f'Number of rows in dataset are: ${df.shape[0]} and columns
number are: ${df.shape[1]})')

Number of rows in dataset are: $891 and columns number are: $12)

• c) List all the column names and their data types.


print(df.dtypes)

PassengerId int64
Survived int64
Pclass int64
Name object
Sex object
Age float64
SibSp int64
Parch int64
Ticket object
Fare float64
Cabin object
Embarked object
dtype: object

1.2
• a) Generate summary statistics (mean, median, standard deviation, min, and max) for the
following columns:
– Age
– Fare
age_mean = "{:.2f}".format(df['Age'].mean())
age_max = df['Age'].max()
age_min = df['Age'].min()
age_median = df['Age'].median()
age_std = df['Age'].std()

fare_mean = "{:.2f}".format(df['Fare'].mean())
fare_max = df['Fare'].max()
fare_min = df['Fare'].min()
fare_median = df['Fare'].median()
fare_std = df['Fare'].std()

print(f'Age Mean: {age_mean} \tAge Max: {age_max}\tAge Min: {age_min}\


nAge Median: {age_median}\tAge Standard Deviation: {age_std}')
print(f'\nFare Mean: {fare_mean}\tFare Max: {fare_max}\tFare Min:
{fare_min}\nFare Median: {fare_median}\tFare Standard Deviation:
{fare_std}')

Age Mean: 29.70 Age Max: 80.0 Age Min: 0.42


Age Median: 28.0 Age Standard Deviation: 14.526497332334042

Fare Mean: 32.20 Fare Max: 512.3292 Fare Min: 0.0


Fare Median: 14.4542 Fare Standard Deviation: 49.6934285971809

• b) What is the range of the Age and Fare columns? (Range = Max - Min)
age_range = age_max - age_min
fare_range = fare_max - fare_min

print(f'Range of age is: {age_range}\nRange of fare is: {fare_range}')

Range of age is: 79.58


Range of fare is: 512.3292

• c) Identify which class (Pclass) had the highest average Fare. Provide the mean Fare for
each class.
mean_fare_by_class = df.groupby("Pclass")["Fare"].mean()
highest_fare_class = mean_fare_by_class.idxmax()

print("Mean Fare for each class:")


print(mean_fare_by_class)
print(f"\nClass with the highest average Fare: {highest_fare_class}")

Mean Fare for each class:


Pclass
1 84.154687
2 20.662183
3 13.675550
Name: Fare, dtype: float64

Class with the highest average Fare: 1

1.3:
• a) Count the number of passengers in each Pclass (1st, 2nd, 3rd).
pclass_counts = df["Pclass"].value_counts()

• b) What percentage of passengers survived?


survival_rate = df["Survived"].mean() * 100

• c) Create a cross-tabulation of Survived vs Sex. What percentage of males and females


survived?
survival_table = pd.crosstab(df["Sex"], df["Survived"], margins=True)

survival_percentage = survival_table.div(survival_table["All"],
axis=0) * 100

print("a) Passenger count in each class:")


print(pclass_counts)

print("\nb) Percentage of passengers who survived:")


print(f"{survival_rate:.2f}%")

print("\nc) Cross-tabulation of Survived vs. Sex:")


print(survival_table)

print("\nSurvival Percentage by Sex:")


print(survival_percentage[1])

a) Passenger count in each class:


Pclass
3 491
1 216
2 184
Name: count, dtype: int64

b) Percentage of passengers who survived:


38.38%

c) Cross-tabulation of Survived vs. Sex:


Survived 0 1 All
Sex
female 81 233 314
male 468 109 577
All 549 342 891

Survival Percentage by Sex:


Sex
female 74.203822
male 18.890815
All 38.383838
Name: 1, dtype: float64

Part 2:
2.4
• a) Calculate the correlation coefficient between Age and Fare. Is the correlation positive,
negative, or negligible?
• b) Calculate the correlation between Pclass and Fare. Interpret the result—do higher
classes pay more on average?
correlation = df["Age"].corr(df["Fare"])
correlation_type = ("positive" if correlation > 0 else "negative" if
correlation < 0 else "negligible")
print(f"Correlation Coefficient between Age and Fare:
{correlation:.2f}")
print(f"The correlation is {correlation_type}.")

Correlation Coefficient between Age and Fare: 0.10


The correlation is positive.

correlation = df["Pclass"].corr(df["Fare"])
if correlation < 0:
interpretation = "negative correlation, meaning higher classes
(lower Pclass values) pay more on average."
elif correlation > 0:
interpretation = "positive correlation, meaning higher classes
(higher Pclass values) pay more on average."
else:
interpretation = "no correlation."

print(f"Correlation Coefficient between Pclass and Fare:


{correlation:.2f}")
print(f"Interpretation: {interpretation}")

Correlation Coefficient between Pclass and Fare: -0.55


Interpretation: negative correlation, meaning higher classes (lower
Pclass values) pay more on average.

2.5: Outlier Detection:


• a) Identify outliers in the Fare column using the Interquartile Range (IQR) method. How
many outliers are detected?
• b) Visualize the Fare column using a boxplot. Describe any visible outliers and how they
might affect the analysis.
Q1 = df["Fare"].quantile(0.25)
Q3 = df["Fare"].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR


upper_bound = Q3 + 1.5 * IQR

outliers = df[(df["Fare"] < lower_bound) | (df["Fare"] > upper_bound)]

num_outliers = outliers.shape[0]

print(f"Lower Bound: {lower_bound:.2f}, Upper Bound:


{upper_bound:.2f}")
print(f"Number of outliers detected: {num_outliers}")
print("\nOutliers detected in 'Fare' column:")
print(outliers)
Lower Bound: -26.72, Upper Bound: 65.63
Number of outliers detected: 116

Outliers detected in 'Fare' column:


PassengerId Survived Pclass \
1 2 1 1
27 28 0 1
31 32 1 1
34 35 0 1
52 53 1 1
.. ... ... ...
846 847 0 3
849 850 1 1
856 857 1 1
863 864 0 3
879 880 1 1

Name Sex Age


SibSp \
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0
1
27 Fortune, Mr. Charles Alexander male 19.0
3
31 Spencer, Mrs. William Augustus (Marie Eugenie) female NaN
1
34 Meyer, Mr. Edgar Joseph male 28.0
1
52 Harper, Mrs. Henry Sleeper (Myna Haxtun) female 49.0
1
.. ... ... ...
...
846 Sage, Mr. Douglas Bullen male NaN
8
849 Goldenberg, Mrs. Samuel L (Edwiga Grabowska) female NaN
1
856 Wick, Mrs. George Dennick (Mary Hitchcock) female 45.0
1
863 Sage, Miss. Dorothy Edith "Dolly" female NaN
8
879 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0
0

Parch Ticket Fare Cabin Embarked


1 0 PC 17599 71.2833 C85 C
27 2 19950 263.0000 C23 C25 C27 S
31 0 PC 17569 146.5208 B78 C
34 0 PC 17604 82.1708 NaN C
52 0 PC 17572 76.7292 D33 C
.. ... ... ... ... ...
846 2 CA. 2343 69.5500 NaN S
849 0 17453 89.1042 C92 C
856 1 36928 164.8667 NaN S
863 2 CA. 2343 69.5500 NaN S
879 1 11767 83.1583 C50 C

[116 rows x 12 columns]

plt.figure(figsize=(8, 5))
sns.boxplot(x=df["Fare"])
plt.title("Boxplot of Fare Column")
plt.xlabel("Fare")
plt.show()

2.6: Distribution Analysis:


• a) Plot a histogram of the Age column. Describe the shape of the distribution (e.g.,
normal, skewed).
• b) Plot the distribution of Fare and check for skewness. Is the data skewed to the left or
right?
plt.figure(figsize=(8, 5))
sns.histplot(df["Age"], bins=10, kde=True, color="blue")
plt.title("Histogram of Age")
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.show()

• Based on the graph, the shape is close to normal, but slightly skewed, because presence
of long tail suggest positive skewness
plt.figure(figsize=(8, 5))
sns.histplot(df["Fare"], bins=10, kde=True, color="green")
plt.title("Histogram of Fare")
plt.xlabel("Fare")
plt.ylabel("Frequency")
plt.show()
age_skewness = df["Age"].skew()
fare_skewness = df["Fare"].skew()

print(f"Skewness of Age: {age_skewness:.2f}")


print(f"Skewness of Fare: {fare_skewness:.2f}")
if age_skewness > 0:
age_interpretation = "positively skewed (right-skewed)"
elif age_skewness < 0:
age_interpretation = "negatively skewed (left-skewed)"
else:
age_interpretation = "approximately symmetric"

if fare_skewness > 0:
fare_interpretation = "positively skewed (right-skewed)"
elif fare_skewness < 0:
fare_interpretation = "negatively skewed (left-skewed)"
else:
fare_interpretation = "approximately symmetric"

print(f"\nThe Age distribution is {age_interpretation}.")


print(f"The Fare distribution is {fare_interpretation}.")

Skewness of Age: 0.39


Skewness of Fare: 4.79
The Age distribution is positively skewed (right-skewed).
The Fare distribution is positively skewed (right-skewed).

Questions from book:


• 2.2 Suppose that the data for analysis includes the attribute age. The age values for
the data tuples are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25,
25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
– (a) What is the mean of the data? What is the median?
– (b) What is the mode of the data? Comment on the data’s modality (i.e.,
bimodal,trimodal, etc.).
– (c) What is the midrange of the data?
– (d) Can you find (roughly) the first quartile (Q1) and the third quartile (Q3) of the
data?
– (e) Give the five-number summary of the data.
– (f) Show a boxplot of the data.
– (g) How is a quantile–quantile plot different from a quantile plot?
ages = [13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30,
33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70]

# (a) Mean and Median


mean_age = np.mean(ages)
median_age = np.median(ages)
print(f"Mean: {mean_age}")
print(f"Median: {median_age}")

Mean: 29.962962962962962
Median: 25.0

# (b) Mode and Modality


from scipy import stats

mode_age = stats.mode(ages)
print(mode_age)
print(f"Mode: {mode_age[0]} with count: {mode_age[1]}")

ModeResult(mode=np.int64(25), count=np.int64(4))
Mode: 25 with count: 4

Mode for ages are 25 and 35, but stats.mode() only gives the smallest value that appears
most frequently. Hence, logically the the data is bimodal

midrange_age = (min(ages) + max(ages)) / 2


print(f"Midrange: {midrange_age}")

Midrange: 41.5
# (d) First Quartile (Q1) and Third Quartile (Q3)
Q1 = np.percentile(ages, 25)
Q3 = np.percentile(ages, 75)

print(f"First Quartile (Q1): {Q1}")


print(f"Third Quartile (Q3): {Q3}")

First Quartile (Q1): 20.5


Third Quartile (Q3): 35.0

# (e) Five-number summary


five_number_summary = (min(ages), Q1, median_age, Q3, max(ages))
print(f"Five-number summary: {five_number_summary}")

Five-number summary: (13, np.float64(20.5), np.float64(25.0),


np.float64(35.0), 70)

# (f) Boxplot
plt.boxplot(ages, vert=False)
plt.title('Boxplot of Ages')
plt.xlabel('Age')
plt.show()

• g) How is a quantile–quantile plot different from a quantile plot?


– Quantile Plot: Displays the quantiles of a data set against their expected values
under a specified distribution.
– Quantile-Quantile (Q-Q) Plot: Compares the quantiles of two data sets or a data
set against a theoretical distribution.
• Qn 2.3: 2.3 Suppose that the values for a given set of data are grouped into intervals. The
intervals and corresponding frequencies are as follows: age frequency

1–5 200

6–15 450

16–20 300

21–50 1500

51–80 700

81–110 44

Compute an approximate median value for the data.

intervals = [(1, 5), (6, 15), (16, 20), (21, 50), (51, 80), (81, 110)]
frequencies = [200, 450, 300, 1500, 700, 44]

N = sum(frequencies)

cumulative_frequency = 0
median_class_index = -1

for i, freq in enumerate(frequencies):


cumulative_frequency += freq
if cumulative_frequency >= (N/2):
median_class_index = i
break

L = intervals[median_class_index][0]
F = cumulative_frequency - frequencies[median_class_index]
f = frequencies[median_class_index]
w = intervals[median_class_index][1] - intervals[median_class_index]
[0]

median = L + (((N/2) - F) / f) * w

print(f"Approximate Median: {median:.2f}")

Approximate Median: 33.51

• 2.4 Suppose that a hospital tested the age and body fat data for 18 randomly selected
adults with the following results:

age 23 23 27 27 39 41 47 49 50
%fat 9.5 26.5 7.8 17.8 31.4 25.9 27.4 27.2 31.2

age 52 54 54 56 57 58 58 60 61

%fat 34.6 42.5 28.8 33.4 30.2 34.1 32.9 41.2 35.7

• (a) Calculate the mean, median, and standard deviation of age and %fat.
import numpy as np

ages = np.array([23, 23, 27, 27, 39, 41, 47, 49, 50, 52, 54, 54, 56,
57, 58, 58, 60, 61])
body_fat = np.array([9.5, 26.5, 7.8, 17.8, 31.4, 25.9, 27.4, 27.2,
31.2, 34.6, 42.5, 28.8, 33.4, 30.2, 34.1, 32.9, 41.2, 35.7])

mean_age = np.mean(ages)
mean_fat = np.mean(body_fat)

median_age = np.median(ages)
median_fat = np.median(body_fat)

std_age = np.std(ages, ddof=1)


std_fat = np.std(body_fat, ddof=1)

print(f"Mean Age: {mean_age:.2f}, Mean %Fat: {mean_fat:.2f}")


print(f"Median Age: {median_age}, Median %Fat: {median_fat}")
print(f"Standard Deviation Age: {std_age:.2f}, Standard Deviation
%Fat: {std_fat:.2f}")

Mean Age: 46.44, Mean %Fat: 28.78


Median Age: 51.0, Median %Fat: 30.7
Standard Deviation Age: 13.22, Standard Deviation %Fat: 9.25

• (b) Draw the boxplots for age and %fat.


import matplotlib.pyplot as plt

plt.figure(figsize=(10, 5))

plt.subplot(1, 2, 1)
plt.boxplot(ages, vert=False)
plt.title('Boxplot of Age')
plt.xlabel('Age')

plt.subplot(1, 2, 2)
plt.boxplot(body_fat, vert=False)
plt.title('Boxplot of %Fat')
plt.xlabel('%Fat')

plt.tight_layout()
plt.show()
• (c) Draw a scatter plot and a q-q plot based on these two variables.
plt.figure(figsize=(10, 5))
plt.scatter(ages, body_fat, color='blue')
plt.title('Scatter Plot of Age vs %Fat')
plt.xlabel('Age')
plt.ylabel('%Fat')
plt.grid(True)
plt.show()

plt.figure(figsize=(10, 5))

plt.subplot(1, 2, 1)
stats.probplot(ages, dist="norm", plot=plt)
plt.title('Q-Q Plot of Age')

plt.subplot(1, 2, 2)
stats.probplot(body_fat, dist="norm", plot=plt)
plt.title('Q-Q Plot of %Fat')

plt.tight_layout()
plt.show()
• 2.8 It is important to define or select similarity measures in data analysis. However, there
is no commonly accepted subjective similarity measure. Results can vary depending on
the similarity measures used. Nonetheless, seemingly different similarity measures may
be equivalent after some transformation. Suppose we have the following 2-D data set:

A1 A2

x1 1.5 1.7

x2 2 1.9
x3 1.6 1.8

x4 1.2 1.5

x5 1.5 1.0

• (a) Consider the data as 2-D data points. Given a new data point, x = (1.4,1.6) as a query,
rank the database points based on similarity with the query using Euclidean distance,
Manhattan distance, supremum distance, and cosine similarity.
import numpy as np
from scipy.spatial import distance

data_points = np.array([
[1.5, 1.7],
[2.0, 1.9],
[1.6, 1.8],
[1.2, 1.5],
[1.5, 1.0]
])

query_point = np.array([1.4, 1.6])


euclidean_distances = np.linalg.norm(data_points - query_point,
axis=1)
manhattan_distances = np.sum(np.abs(data_points - query_point),
axis=1)
supremum_distances = np.max(np.abs(data_points - query_point), axis=1)
cosine_similarities = np.array([1 - distance.cosine(query_point, dp)
for dp in data_points])

euclidean_rank = np.argsort(euclidean_distances)
manhattan_rank = np.argsort(manhattan_distances)
supremum_rank = np.argsort(supremum_distances)
cosine_rank = np.argsort(-cosine_similarities)

print("Euclidean Distance Ranking:", euclidean_rank + 1)


print("Manhattan Distance Ranking:", manhattan_rank + 1)
print("Supremum Distance Ranking:", supremum_rank + 1)
print("Cosine Similarity Ranking:", cosine_rank + 1)

Euclidean Distance Ranking: [1 4 3 5 2]


Manhattan Distance Ranking: [1 4 3 5 2]
Supremum Distance Ranking: [1 4 3 2 5]
Cosine Similarity Ranking: [1 3 4 2 5]

• (b) Normalize the data set to make the norm of each data point equal to 1. Use Euclidean
distance on the transformed data to rank the data points.
normalized_data_points = np.array([dp / np.linalg.norm(dp) for dp in
data_points])
normalized_query_point = query_point / np.linalg.norm(query_point)
normalized_euclidean_distances = np.linalg.norm(normalized_data_points
- normalized_query_point, axis=1)
normalized_euclidean_rank = np.argsort(normalized_euclidean_distances)

print("Normalized Euclidean Distance Ranking:",


normalized_euclidean_rank + 1)

Normalized Euclidean Distance Ranking: [1 3 4 2 5]

You might also like