0% found this document useful (0 votes)
34 views

Test Questions and Analysis

Test questions

Uploaded by

otieni.reagan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Test Questions and Analysis

Test questions

Uploaded by

otieni.reagan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

10/16/24, 5:52 PM cardio_covid_data_analysis

Analysis to answer questions based on cardio


Dataset
Import necessary libraries, import csv file, cleaning and formating the
dataset

In [1]: import pandas as pd # for data manipulation


import numpy as np # for numerical calculations
import matplotlib.pyplot as plt # for plotting (optional)

i. Import the csv file

In [3]: # Load the dataset


file_path = r'C:\Users\JOHN M\Downloads\Cardio data\cardio_base.csv'
cardio_data = pd.read_csv(file_path)

ii. Check top rows to ensure file loaded correctly

In [4]: # Display the first few rows


cardio_data.head()

Out[4]: id age gender height weight ap_hi ap_lo cholesterol smoke

0 0 18393 2 168 62.0 110 80 1 0

1 1 20228 1 156 85.0 140 90 3 0

2 2 18857 1 165 64.0 130 70 3 0

3 3 17623 2 169 82.0 150 100 1 0

4 4 17474 1 156 56.0 100 60 1 0

iii. Format columns, transform data

In [5]: # Convert age from days to years and round down


cardio_data['age_years'] = (cardio_data['age'] / 365).astype(int)

In [7]: # Group by age in years and calculate the mean weight


age_weight_group = cardio_data.groupby('age_years')['weight'].mean()

Question 1: Find the Age Groups with Highest and Lowest


Average Weights
In [8]: # Find age group with the highest and lowest average weight
max_weight_age_group = age_weight_group.idxmax()
min_weight_age_group = age_weight_group.idxmin()

# Find the corresponding average weights


max_avg_weight = age_weight_group.max()
min_avg_weight = age_weight_group.min()

file:///C:/Users/JOHN M/Desktop/cardio_covid_data_analysis.html 1/10


10/16/24, 5:52 PM cardio_covid_data_analysis

# Calculate the difference


weight_difference = max_avg_weight - min_avg_weight

print(f"The age group with the highest average weight is {max_weight_age_group}


print(f"The age group with the lowest average weight is {min_weight_age_group} y
print(f"The difference in average weight is {weight_difference:.2f} kg.")

The age group with the highest average weight is 63 years.


The age group with the lowest average weight is 30 years.
The difference in average weight is 16.87 kg.

i. Calculate the differnce in percentage

In [9]: # Calculate the percentage difference


percentage_difference = ((max_avg_weight - min_avg_weight) / min_avg_weight) * 1

# Display the result


print(f"The difference in average weight is {weight_difference:.2f} kg.")
print(f"The percentage difference between the highest and lowest average weight

The difference in average weight is 16.87 kg.


The percentage difference between the highest and lowest average weight is 28.6
0%.

Question 2: Do people over 50 have more cholesterol levels


than the rest?
In [10]: # Create a new column to classify people as over 50 or not
cardio_data['age_over_50'] = cardio_data['age_years'] > 50

# Group the data by this new column and calculate the mean cholesterol level
cholesterol_comparison = cardio_data.groupby('age_over_50')['cholesterol'].mean(

# Extract the cholesterol levels for both groups


cholesterol_over_50 = cholesterol_comparison[True]
cholesterol_under_or_equal_50 = cholesterol_comparison[False]

# Display the results


print(f"Average cholesterol for people over 50: {cholesterol_over_50:.2f}")
print(f"Average cholesterol for people 50 or younger: {cholesterol_under_or_equa

# Compare the two


if cholesterol_over_50 > cholesterol_under_or_equal_50:
print("People over 50 have higher average cholesterol than those 50 or young
else:
print("People over 50 do not have higher average cholesterol than those 50 o

Average cholesterol for people over 50: 1.43


Average cholesterol for people 50 or younger: 1.25
People over 50 have higher average cholesterol than those 50 or younger.

-> Convert the difference to percentage

In [12]: # Create a new column to classify people as over 50 or not


cardio_data['age_over_50'] = cardio_data['age_years'] > 50

# Group the data by this new column and calculate the mean cholesterol level
cholesterol_comparison = cardio_data.groupby('age_over_50')['cholesterol'].mean(

file:///C:/Users/JOHN M/Desktop/cardio_covid_data_analysis.html 2/10


10/16/24, 5:52 PM cardio_covid_data_analysis

# Extract the cholesterol levels for both groups


cholesterol_over_50 = cholesterol_comparison[True]
cholesterol_under_or_equal_50 = cholesterol_comparison[False]

# Calculate the percentage difference in cholesterol


cholesterol_percentage_difference = (cholesterol_over_50 - cholesterol_under_or_

# Display the results


print(f"Average cholesterol for people over 50: {cholesterol_over_50:.2f}")
print(f"Average cholesterol for people 50 or younger: {cholesterol_under_or_equa
print(f"The percentage difference in cholesterol between the two groups is {chol

# Compare the two


if cholesterol_over_50 > cholesterol_under_or_equal_50:
print("People over 50 have higher average cholesterol than those 50 or young
else:
print("People over 50 do not have higher average cholesterol than those 50 o

Average cholesterol for people over 50: 1.43


Average cholesterol for people 50 or younger: 1.25
The percentage difference in cholesterol between the two groups is 14.69%
People over 50 have higher average cholesterol than those 50 or younger.

Question 3: Are men more likely to smoke than women


based on gender ID's ?
In [15]: # Load the dataset (ensure the correct path)
import pandas as pd

cardio_data = pd.read_csv(r'C:\Users\JOHN M\Downloads\Cardio data\cardio_base.cs

# Group by gender and calculate the percentage of smokers


smoking_comparison = cardio_data.groupby('gender')['smoke'].mean() * 100

# Display results
print(f"Percentage of smokers among men: {smoking_comparison.get(1, 0):.2f}%")
print(f"Percentage of smokers among women: {smoking_comparison.get(2, 0):.2f}%")

# Compare the two


if smoking_comparison.get(1, 0) > smoking_comparison.get(2, 0):
print("Men are more likely to be smokers than women.")
else:
print("Women are more likely to be smokers than men.")

Percentage of smokers among men: 1.79%


Percentage of smokers among women: 21.89%
Women are more likely to be smokers than men.

Question 4: How tall are the tallest 1% of people?


In [16]: # Load the dataset (ensure the correct path)
cardio_data = pd.read_csv(r'C:\Users\JOHN M\Downloads\Cardio data\cardio_base.cs

# Calculate the height threshold for the tallest 1%


height_threshold = cardio_data['height'].quantile(0.99)

# Display the result


print(f"The height of the tallest 1% of people is: {height_threshold:.2f} cm")

file:///C:/Users/JOHN M/Desktop/cardio_covid_data_analysis.html 3/10


10/16/24, 5:52 PM cardio_covid_data_analysis

The height of the tallest 1% of people is: 184.00 cm

Question 5: which 2 features have the highest spearman


rank correlation?
In [18]: # Calculate the Spearman correlation matrix
spearman_corr = cardio_data.corr(method='spearman')

# Stack the correlation matrix to find pairs of features and their correlation v
correlation_pairs = spearman_corr.stack()

# Remove self-correlations
correlation_pairs = correlation_pairs[correlation_pairs.index.get_level_values(0

# Get the maximum correlation and its corresponding feature pair


max_corr = correlation_pairs.idxmax() # Get the index of the max correlation
max_value = correlation_pairs.max() # Get the max correlation value

# Display the results


print(f"The two features with the highest Spearman rank correlation are: {max_co

The two features with the highest Spearman rank correlation are: ap_hi and ap_lo
with a correlation of 0.74

Question 6: What percentage of people are more than 2


standard deviations far from the average height?
In [19]: # Calculate the mean and standard deviation of height
mean_height = cardio_data['height'].mean()
std_dev_height = cardio_data['height'].std()

# Calculate the thresholds for more than 2 standard deviations from the mean
lower_threshold = mean_height - 2 * std_dev_height
upper_threshold = mean_height + 2 * std_dev_height

# Count the number of people outside these thresholds


count_outside = cardio_data[(cardio_data['height'] < lower_threshold) | (cardio_

# Calculate the total number of people


total_count = cardio_data.shape[0]

# Calculate the percentage


percentage_outside = (count_outside / total_count) * 100

# Display the result


print(f"Percentage of people more than 2 standard deviations away from the avera

Percentage of people more than 2 standard deviations away from the average heigh
t: 3.34%

Question 7: What percentage of the population over 50


years old consume alcohol?
Also use the cardio alco.csv and merge the datasets on ID. Ignore those
persons, where we have no alcohol consumption information!

file:///C:/Users/JOHN M/Desktop/cardio_covid_data_analysis.html 4/10


10/16/24, 5:52 PM cardio_covid_data_analysis

In [34]: # Load the cardio data


cardio_data = pd.read_csv(r'C:\Users\JOHN M\Downloads\Cardio data\cardio_base.cs

# Load the alcohol data using the correct separator


cardio_alco_data = pd.read_csv(r'C:\Users\JOHN M\Downloads\Cardio data\cardio_al

# Display the first few rows and columns to confirm it loaded correctly
print("Cardio Data Head:")
print(cardio_data.head())
print("Cardio Alco Data Head:")
print(cardio_alco_data.head())

# Rename the column to 'alco_id'


cardio_alco_data.rename(columns={'id': 'alco_id'}, inplace=True)

# Check the columns of both DataFrames to confirm the names


print("Cardio Data Columns:")
print(cardio_data.columns)
print("Cardio Alco Data Columns:")
print(cardio_alco_data.columns)

# Merge the datasets on 'id' and 'alco_id'


merged_data = pd.merge(cardio_data, cardio_alco_data, left_on='id', right_on='al

# Filter for individuals over 50 years old


over_50_data = merged_data[merged_data['age'] > 50]

# Calculate the percentage of individuals who consume alcohol (alco = 1)


alcohol_consumption_percentage = (over_50_data['alco'] == 1).mean() * 100

# Display the result


print(f"Percentage of people over 50 who consume alcohol: {alcohol_consumption_p

Cardio Data Head:


id age gender height weight ap_hi ap_lo cholesterol smoke
0 0 18393 2 168 62.0 110 80 1 0
1 1 20228 1 156 85.0 140 90 3 0
2 2 18857 1 165 64.0 130 70 3 0
3 3 17623 2 169 82.0 150 100 1 0
4 4 17474 1 156 56.0 100 60 1 0
Cardio Alco Data Head:
id alco
0 44 0
1 45 0
2 46 0
3 47 0
4 49 0
Cardio Data Columns:
Index(['id', 'age', 'gender', 'height', 'weight', 'ap_hi', 'ap_lo',
'cholesterol', 'smoke'],
dtype='object')
Cardio Alco Data Columns:
Index(['alco_id', 'alco'], dtype='object')
Percentage of people over 50 who consume alcohol: 5.34%

Question 8: Which of the following is true with 95%


confidence?
1. Smokers have higher cholesterol level than non smokers
file:///C:/Users/JOHN M/Desktop/cardio_covid_data_analysis.html 5/10
10/16/24, 5:52 PM cardio_covid_data_analysis

2. Smokers weight less than non smokers

3. Men have higher blood pressure than women

4. Smokers have higher blood pressure than non-smokers

To determine which of the statements is true with 95% confidence, i will conduct statistical tests (like t-
tests or ANOVA) on the relevant data to compare the groups for each statement. Below is an outline of
how I will approach verifying these statements statistically:

(i) Smokers have higher cholesterol levels than non-smokers:


Test: Compare the mean cholesterol levels between smokers and non-smokers using a t-test.

Hypothesis: Null hypothesis (H0): There is no difference in cholesterol levels. Alternative hypothesis
(H1): Smokers have higher cholesterol levels.

Confidence Interval: Calculate the 95% confidence interval for the difference in means.

(ii) Smokers weigh less than non-smokers:


Test: Compare the mean weights between smokers and non-smokers using a t-test.

Hypothesis: H0: There is no difference in weight. H1: Smokers weigh less than non-smokers.

Confidence Interval: Calculate the 95% confidence interval for the difference in means.

(iii) Men have higher blood pressure than women:


Test: Compare the mean blood pressure readings between men and women using a t-test.

Hypothesis: H0: There is no difference in blood pressure. H1: Men have higher blood pressure than
women.

Confidence Interval: Calculate the 95% confidence interval for the difference in means.

(iv)Smokers have higher blood pressure than non-smokers:


Test: Compare the mean blood pressure readings between smokers and non-smokers using a t-test.

Hypothesis: H0: There is no difference in blood pressure. H1: Smokers have higher blood pressure than
non-smokers.

Confidence Interval: Calculate the 95% confidence interval for the difference in means._

In [36]: import pandas as pd


import scipy.stats as stats

# Load the dataset


data = pd.read_csv(r'C:\Users\JOHN M\Downloads\Cardio data\cardio_base.csv')

# Create boolean series for smokers and non-smokers


smokers = data[data['smoke'] == 1]
non_smokers = data[data['smoke'] == 0]

# Hypothesis 1: Smokers have higher cholesterol level than non-smokers


smokers_cholesterol = smokers['cholesterol']
non_smokers_cholesterol = non_smokers['cholesterol']
t_stat_chol, p_value_chol = stats.ttest_ind(smokers_cholesterol, non_smokers_cho
confidence_chol = p_value_chol < 0.05

# Hypothesis 2: Smokers weigh less than non-smokers


smokers_weight = smokers['weight']
non_smokers_weight = non_smokers['weight']
t_stat_weight, p_value_weight = stats.ttest_ind(smokers_weight, non_smokers_weig

file:///C:/Users/JOHN M/Desktop/cardio_covid_data_analysis.html 6/10


10/16/24, 5:52 PM cardio_covid_data_analysis

confidence_weight = p_value_weight < 0.05

# Hypothesis 3: Men have higher blood pressure than women


men_bp = data[data['gender'] == 1]['ap_hi']
women_bp = data[data['gender'] == 2]['ap_hi']
t_stat_bp, p_value_bp = stats.ttest_ind(men_bp, women_bp)
confidence_bp = p_value_bp < 0.05

# Hypothesis 4: Smokers have higher blood pressure than non-smokers


smokers_bp = smokers['ap_hi']
non_smokers_bp = non_smokers['ap_hi']
t_stat_smoker_bp, p_value_smoker_bp = stats.ttest_ind(smokers_bp, non_smokers_bp
confidence_smoker_bp = p_value_smoker_bp < 0.05

# Output the results


print("Hypothesis 1: Smokers have higher cholesterol level than non-smokers:", c
print("Hypothesis 2: Smokers weigh less than non-smokers:", confidence_weight)
print("Hypothesis 3: Men have higher blood pressure than women:", confidence_bp)
print("Hypothesis 4: Smokers have higher blood pressure than non-smokers:", conf

Hypothesis 1: Smokers have higher cholesterol level than non-smokers: True


Hypothesis 2: Smokers weigh less than non-smokers: True
Hypothesis 3: Men have higher blood pressure than women: False
Hypothesis 4: Smokers have higher blood pressure than non-smokers: False

Additional Dataset: Covid_data.csv (contains covid infection data)

Question 9: When did the differenice in the total number of


confirmed cases between Italy and Germany become more
than 10 000
In [37]: # Load the COVID-19 dataset
# Update the path to your dataset file
covid_data = pd.read_csv(r'C:\Users\JOHN M\Downloads\Cardio data\covid_data.csv'

# Display the first few rows to confirm it loaded correctly


print(covid_data.head())

location date new_cases new_deaths population \


0 Afghanistan 2019-12-31 0 0 38928341.0
1 Afghanistan 2020-01-01 0 0 38928341.0
2 Afghanistan 2020-01-02 0 0 38928341.0
3 Afghanistan 2020-01-03 0 0 38928341.0
4 Afghanistan 2020-01-04 0 0 38928341.0

aged_65_older_percent gdp_per_capita hospital_beds_per_thousand


0 2.581 1803.987 0.5
1 2.581 1803.987 0.5
2 2.581 1803.987 0.5
3 2.581 1803.987 0.5
4 2.581 1803.987 0.5

In [41]: # Filter for Italy and Germany


italy_data = covid_data[covid_data['location'] == 'Italy'].copy()
germany_data = covid_data[covid_data['location'] == 'Germany'].copy()

# Convert the date column to datetime format


italy_data['date'] = pd.to_datetime(italy_data['date'])
germany_data['date'] = pd.to_datetime(germany_data['date'])

file:///C:/Users/JOHN M/Desktop/cardio_covid_data_analysis.html 7/10


10/16/24, 5:52 PM cardio_covid_data_analysis

# Set the date as the index


italy_data.set_index('date', inplace=True)
germany_data.set_index('date', inplace=True)

# Calculate cumulative cases for Italy


italy_data['total_cases'] = italy_data['new_cases'].cumsum()

# Calculate cumulative cases for Germany


germany_data['total_cases'] = germany_data['new_cases'].cumsum()

# Merge the two datasets on the date index


merged_data = pd.merge(italy_data[['total_cases']], germany_data[['total_cases']
left_index=True, right_index=True, suffixes=('_Italy', '_

# Calculate the difference in total confirmed cases


merged_data['case_difference'] = merged_data['total_cases_Italy'] - merged_data[

# Find the first date where the difference exceeded 10,000


date_exceeding_10000 = merged_data[merged_data['case_difference'] > 10000].index

# Display the result


if date_exceeding_10000 is not None:
print(f"The difference in the total number of confirmed cases between Italy
else:
print("The difference never exceeded 10,000 cases.")

The difference in the total number of confirmed cases between Italy and Germany b
ecame more than 10,000 on: 2020-03-12

Question 10: Look at the cumulative number of confirmed


cases in Italy between 2020-02-28 and 2020-03-20. Fit an
exponential function (Ae^(Bx)) to this set to express
cumulative cases as a function of days passed, by
minimizing squared loss.
What is the difference between the exponential curve and the total
number of real cases on 2020-03-207?

In [42]: import matplotlib.pyplot as plt


from scipy.optimize import curve_fit

# Filter for Italy and the specific date range


italy_data = covid_data[covid_data['location'] == 'Italy']
italy_data['date'] = pd.to_datetime(italy_data['date'])
italy_data = italy_data[(italy_data['date'] >= '2020-02-28') & (italy_data['date

# Calculate cumulative cases


italy_data['total_cases'] = italy_data['new_cases'].cumsum()

# Prepare x and y for curve fitting


x_data = (italy_data['date'] - italy_data['date'].min()).dt.days.values # Days
y_data = italy_data['total_cases'].values

# Define the exponential function


def exponential_func(x, A, B):
return A * np.exp(B * x)

# Fit the exponential curve

file:///C:/Users/JOHN M/Desktop/cardio_covid_data_analysis.html 8/10


10/16/24, 5:52 PM cardio_covid_data_analysis

params, covariance = curve_fit(exponential_func, x_data, y_data)

# Calculate the predicted cumulative cases for 2020-03-20 (x = days from start)
days_from_start = (pd.to_datetime('2020-03-20') - italy_data['date'].min()).days
predicted_cases = exponential_func(days_from_start, *params)

# Get the actual cases on 2020-03-20


actual_cases_on_2020_03_20 = italy_data[italy_data['date'] == '2020-03-20']['tot

# Calculate the difference


difference = predicted_cases - actual_cases_on_2020_03_20

# Output the results


print(f"Predicted cumulative cases on 2020-03-20: {predicted_cases:.2f}")
print(f"Actual cumulative cases on 2020-03-20: {actual_cases_on_2020_03_20}")
print(f"Difference: {difference:.2f}")

# Plotting the data and the fitted curve for visualization


plt.figure(figsize=(10, 6))
plt.scatter(x_data, y_data, label='Actual Cases', color='blue')
plt.plot(x_data, exponential_func(x_data, *params), label='Fitted Exponential Cu
plt.title('Cumulative COVID-19 Cases in Italy (Exponential Fit)')
plt.xlabel('Days since 2020-02-28')
plt.ylabel('Cumulative Cases')
plt.xticks(ticks=np.arange(0, max(x_data)+1, 2), labels=[f'{(italy_data["date"].
plt.legend()
plt.grid()
plt.tight_layout()
plt.show()

C:\Users\JOHN M\AppData\Local\Temp\ipykernel_13712\2161348383.py:6: SettingWithCo


pyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stabl


e/user_guide/indexing.html#returning-a-view-versus-a-copy
italy_data['date'] = pd.to_datetime(italy_data['date'])
Predicted cumulative cases on 2020-03-20: 42346.69
Actual cumulative cases on 2020-03-20: 40635
Difference: 1711.69

file:///C:/Users/JOHN M/Desktop/cardio_covid_data_analysis.html 9/10


10/16/24, 5:52 PM cardio_covid_data_analysis

file:///C:/Users/JOHN M/Desktop/cardio_covid_data_analysis.html 10/10

You might also like