0% found this document useful (0 votes)
9 views

assigment_3_EDA

The document presents an exploratory data analysis assignment focused on the Obesity Classification dataset, detailing univariate statistics for various features including age, height, weight, and BMI. Key statistics such as mean, median, mode, standard deviation, and skewness are provided for each variable, along with histograms and stem-and-leaf plots for age and BMI distributions. The analysis highlights similarities and differences in data distribution, noting peaks in age groups and a bimodal distribution in BMI.

Uploaded by

Ikram
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

assigment_3_EDA

The document presents an exploratory data analysis assignment focused on the Obesity Classification dataset, detailing univariate statistics for various features including age, height, weight, and BMI. Key statistics such as mean, median, mode, standard deviation, and skewness are provided for each variable, along with histograms and stem-and-leaf plots for age and BMI distributions. The analysis highlights similarities and differences in data distribution, noting peaks in age groups and a bimodal distribution in BMI.

Uploaded by

Ikram
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

A231

STIS3033-EXPLORATORY DATA ANALYSIS

ASSIGNMENT 3

PREPARED FOR: PROF FARHAN BIN MOHD HOSSEIN

NAME NO MATRIC
MUHAMMAD LUQMAN HAKIM BIN ABDUL ANNAS 291590
MUHAMMAD IKRAM BIN MD ASRAN 291666
1. Univariate Analysis
a. Perform descriptive statistics for all variables.

shows the feature statistics for all variables in the Obesity Classification dataset
which is provided by excel. Below is the analysis for each feature in the dataset:

Mean (56.05): The average value of the "id" variable is 56.05, which indicates the central
tendency of the dataset.

Standard Error (3.07): This measures the accuracy with which the sample mean represents
the population mean. A standard error of 3.07 suggests moderate variability in the sample mean.

Median (56.5): The median value is 56.5, which indicates the middle value of the dataset
when arranged in ascending order. It is very close to the mean, suggesting a relatively
symmetrical distribution.

Mode (1): The mode is 1, indicating that 1 is the most frequently occurring value in the
dataset. This might suggest a high frequency of the value 1, indicating a potential outlier or a
specific pattern in the dataset.

Standard Deviation (31.92): This measures the spread of the dataset around the mean. A
standard deviation of 31.92 indicates substantial variability in the data.

Sample Variance (1018.75): Variance is the square of the standard deviation and provides
another measure of data spread. A variance of 1018.75 reinforces the substantial variability in the
data.
Kurtosis (-1.19): Kurtosis measures the "tailedness" of the data distribution. A kurtosis of -
1.19 indicates a platykurtic distribution, meaning the data has lighter tails and fewer outliers than
a normal distribution.

Skewness (-0.03): Skewness measures the asymmetry of the data distribution. A skewness
close to 0 (-0.03) suggests the data is symmetrical.

Range (109): The range calculated as the difference between the maximum and minimum
values (110 - 1), is 109. This indicates the spread of the dataset.

Minimum (1): The smallest value in the dataset is 1.

Maximum (110): The largest value in the dataset is 110.

Sum (6053): The sum of all values in the dataset is 6053.

Count (108): The dataset contains 108 observations.

Mean (46.56): The average age in the dataset is approximately 46.56 years, indicating the
central tendency of the age distribution.

Standard Error (2.38): This measures the precision of the sample mean as an estimate of the
population mean. A standard error of 2.38 suggests moderate variability in the sample mean.

Median (42.5): The median age is 42.5 years, which is the middle value of the dataset when
arranged in ascending order. It is slightly lower than the mean, suggesting a slight right skew in
the data.
Mode (16): The most frequently occurring age in the dataset is 16 years. This indicates a peak
or a high frequency at this particular age.

Standard Deviation (24.72): This measures the spread of the ages around the mean. A
standard deviation of 24.72 years indicates substantial variability in the ages.

Sample Variance (611.11): Variance is the square of the standard deviation and provides
another measure of data spread. A variance of 611.11 reinforces the substantial variability in the
age data.

Kurtosis (0.05): Kurtosis measures the "tailedness" of the data distribution. A kurtosis of 0.05
indicates a distribution very close to normal in terms of tailedness.

Skewness (0.82): Skewness measures the asymmetry of the data distribution. A skewness of
0.82 indicates a slight right skew, meaning there are more values on the higher end of the age
scale.

Range (101): The range, calculated as the difference between the maximum and minimum
values (112 - 11), is 101 years. This indicates a wide spread of ages in the dataset.

Minimum (11): The youngest age in the dataset is 11 years.

Maximum (112): The oldest age in the dataset is 112 years.

Sum (5028): The sum of all ages in the dataset is 5028 years.

Count (108): The dataset contains 108 observations.


Mean (166.57): The average height in the dataset is approximately 166.57 cm, indicating the
central tendency of the height distribution.

Standard Error (2.68): This measures the precision of the sample mean as an estimate of the
population mean. A standard error of 2.68 suggests moderate variability in the sample mean.

Median (175): The median height is 175 cm, which is the middle value of the dataset when
arranged in ascending order. It is higher than the mean, suggesting a slight left skew in the data.

Mode (160): The most frequently occurring height in the dataset is 160 cm. This indicates a
peak or a high frequency at this particular height.

Standard Deviation (27.87): This measures the spread of the heights around the mean. A
standard deviation of 27.87 cm indicates substantial variability in the heights.

Sample Variance (776.94): Variance is the square of the standard deviation and provides
another measure of data spread. A variance of 776.94 reinforces the substantial variability in the
height data.

Kurtosis (-1.15): Kurtosis measures the "tailedness" of the data distribution. A kurtosis of -
1.15 indicates a platykurtic distribution, meaning the data has lighter tails and fewer outliers than
a normal distribution.

Skewness (-0.11): Skewness measures the asymmetry of the data distribution. A skewness of -
0.11 suggests the data is fairly symmetrical with a slight left skew.

Range (90): The range, calculated as the difference between the maximum and minimum
values (210 - 120), is 90 cm. This indicates a wide spread of heights in the dataset.
Minimum (120): The shortest height in the dataset is 120 cm.

Maximum (210): The tallest height in the dataset is 210 cm.

Sum (17990): The sum of all heights in the dataset is 17990 cm.

Count (108): The dataset contains 108 observations.

Mean (59.49): The average weight in the dataset is approximately 59.49 kg, indicating the
central tendency of the weight distribution.

Standard Error (2.78): This measures the precision of the sample mean as an estimate of the
population mean. A standard error of 2.78 suggests moderate variability in the sample mean.

Median (55): The median weight is 55 kg, which is the middle value of the dataset when
arranged in ascending order. It is slightly lower than the mean, suggesting a right skew in the
data.

Mode (75): The most frequently occurring weight in the dataset is 75 kg. This indicates a
peak or a high frequency at this particular weight.

Standard Deviation (28.86): This measures the spread of the weights around the mean. A
standard deviation of 28.86 kg indicates substantial variability in the weights.

Sample Variance (832.68): Variance is the square of the standard deviation and provides
another measure of data spread. A variance of 832.68 reinforces the substantial variability in the
weight data.
Kurtosis (-0.96): Kurtosis measures the "tailedness" of the data distribution. A kurtosis of -
0.96 indicates a platykurtic distribution, meaning the data has lighter tails and fewer outliers than
a normal distribution.

Skewness (0.20): Skewness measures the asymmetry of the data distribution. A skewness of
0.20 suggests the data is fairly symmetrical with a slight right skew.

Range (110): The range, calculated as the difference between the maximum and minimum
values (120 - 10), is 110 kg. This indicates a wide spread of weights in the dataset.

Minimum (10): The lowest weight in the dataset is 10 kg.

Maximum (120): The highest weight in the dataset is 120 kg.

Sum (6425): The sum of all weights in the dataset is 6425 kg.

Count (108): The dataset contains 108 observations.


Mean (20.55): The average BMI in the dataset is approximately 20.55, indicating the central
tendency of the BMI distribution.

Standard Error (0.73): This measures the precision of the sample mean as an estimate of the
population mean. A standard error of 0.73 suggests low variability in the sample mean.

Median (21.2): The median BMI is 21.2, which is the middle value of the dataset when
arranged in ascending order. It is slightly higher than the mean, suggesting a slight left skew in
the data.

Mode (16.7): The most frequently occurring BMI in the dataset is 16.7. This indicates a peak
or a high frequency at this particular BMI value.

Standard Deviation (7.58): This measures the spread of the BMIs around the mean. A
standard deviation of 7.58 indicates moderate variability in the BMIs.

Sample Variance (57.51): Variance is the square of the standard deviation and provides
another measure of data spread. A variance of 57.51 reinforces the moderate variability in the
BMI data.

Kurtosis (-0.36): Kurtosis measures the "tailedness" of the data distribution. A kurtosis of -
0.36 indicates a platykurtic distribution, meaning the data has lighter tails and fewer outliers than
a normal distribution.

Skewness (-0.28): Skewness measures the asymmetry of the data distribution. A skewness of
-0.28 suggests the data is fairly symmetrical with a slight left skew.

Range (33.3): The range, calculated as the difference between the maximum and minimum
values (37.2 - 3.9), is 33.3. This indicates a wide spread of BMIs in the dataset.

Minimum (3.9): The lowest BMI in the dataset is 3.9.

Maximum (37.2): The highest BMI in the dataset is 37.2.

Sum (2219): The sum of all BMIs in the dataset is 2219.

Count (108): The dataset contains 108 observations.


b. Construct histograms and stem-and-leaf plots for the Age and BMI variables. Analyze insights,
similarities, and differences in data distribution from both analyses.

# Plotting the histograms for Age


fig, axes = plt.subplots(2, 1, figsize=(10, 8))

# Histogram for Age


axes[0].hist(df['Age'], bins=15, color='blue', edgecolor='black')
axes[0].set_title('Histogram of Age')
axes[0].set_xlabel('Age')
axes[0].set_ylabel('Frequency')
# Adjust layout
plt.tight_layout()

# Show the plot


plt.show()
# Plotting the histograms for BMI
fig, axes = plt.subplots(2, 1, figsize=(10, 8))
# Histogram for BMI
axes[1].hist(df['BMI'], bins=15, color='green', edgecolor='black')
axes[1].set_title('Histogram of BMI')
axes[1].set_xlabel('BMI')
axes[1].set_ylabel('Frequency')

# Adjust layout
plt.tight_layout()

# Show the plot


plt.show()

stem-and-leaf plot for Age

10 1234566778899
20 0112233445566778899
30 00112233455677889
40 0012233455677889
50 0012233455677889
60 001234578
70 02378
80 2378
90 2378
100 2378
110 2
stem-and-leaf plot for BMI:

0.0 3.9 3.9 5.6 5.6 5.6 5.6 8.3 8.3 8.3 8.3
10.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.3 3.3 3.3 3.3 3.3 3.3 3.3 3.3 6.69 6.69 6.69 6.69 6.69 6.69
6.69 6.69 6.69 6.69 6.69 8.7 8.7 8.7 8.7 8.7
20.0 0.0 0.0 0.0 0.0 0.0 0.0 1.19 1.19 1.19 1.19 1.19 1.19 1.19 1.19 2.5 2.5 2.5 2.5 2.69 2.69
2.69 2.69 2.69 2.69 2.69 3.39 3.39 4.19 4.19 4.19 4.19 4.19 5.0 5.0 5.0 5.0 5.3 5.3 5.3
6.1 6.1 6.1 6.1 6.1 7.0 7.0 7.0 7.0 7.3 7.3 7.3 7.5 7.5 7.5 8.89 8.89 8.89 9.1
30.0 0.8 1.19 1.19 1.19 4.2 4.2 4.2 7.2 7.2

Histogram of age distribution:

 The histogram depicts the frequency distribution of ages.

 The younger to middle-aged groups (20-60 years) have higher frequencies.

 The frequency decreases significantly for people above the age of 60.

Histogram of Age Groups:

 Peaks occur around ages 20-30 and 50-60.

 A more even distribution between 30 and 60.

 Individuals beyond the age of 70 experience lower frequencies.

Age distribution using a stem-and-leaf plot.

 The stem-and-leaf plot gives a thorough picture of individual age values.

 The data is organized into tens (e.g., 10-19, 20-29, 30-39).

 Each leaf represents a single data point from the tens group.

Age group using a stem-and-leaf plot


 Similar peaks are evident in the 20-29 and 50-59 age groups.

 The graphic depicts the actual age values, providing a detailed picture of the distribution.

 The concentration of values in the age categories 20-29 and 50-59 corresponds to the
histogram's maximum.
Similarities:
Peaks in Age Distribution:

 Both the histogram and the stem-and-leaf plot show higher frequencies in the 20-30 and
50-60 age groups.

 The distribution is less dense beyond 60 years in both representations.

General Trend:

 Both plots indicate a relatively even distribution of ages from 20 to 60, with fewer
individuals in older age groups.

Differences:
Level of Detail:

• The histogram provides a summary perspective by grouping age ranges. It is easy to see
overarching trends and patterns in age distribution.

• The stem-and-leaf plot provides a thorough view of individual age values within each
group. This graphic shows the specific ages and their frequencies.

BMI shape and distribution for histogram


 Histogram Analysis: Shape and Distribution.

 The histogram indicates a bimodal distribution with two peaks.

 The initial surge occurs at the 5-10 BMI range.

 The second peak occurs in the 20-25 BMI range.

 The incidence of BMIs between 10 and 15 has decreased significantly.

 The distribution is tilted to the right, with a long tail extending to higher BMI levels
(more than 30).

Frequency:

 The highest frequency (14) is observed in the 25-30 BMI range.


 Lower frequencies are seen in the lower BMI ranges (below 10 and above 30).
Shape and distribution for Stem-and-Leaf Plot Analysis

 The stem-and-leaf graphic shows a thorough breakdown of specific BMI values.

 Similar to the histogram, the data reveals a larger concentration of values between
5-10 and 20-25 BMI.

 Each stem contains clusters of values around 6, 8, and 9, confirming the


histogram's bimodal character.

Detailed Frequency:

 The stem-and-leaf plot displays accurate counts of individual BMI values, which
confirm the histogram's peaks.

 For example, inside the "0" stem, the leaf values include many occurrences of 3.3,
3.9, and 5.6, contributing to the histogram's bars in the lower BMI range.

 Similarly, numbers in the "2" stem, such as 2.5, 2.69, 3.3, and 4.19, correspond to
the greater frequencies in the 20-25 BMI range seen in the histogram.

Similarities
 Both the histogram and the stem-and-leaf plot show a bimodal distribution with
peaks at 5-10 and 20-25 BMI levels.

 They both have a right-skewed distribution, with fewer values in the upper BMI
levels (>30).

Differences
 The histogram gives a visual overview using bins, allowing you to immediately
determine the overall shape and distribution.

 The stem-and-leaf plot provides precise data points, allowing for accurate
identification of particular BMI values and frequencies.

 The histogram displays aggregated frequencies, but the stem-and-leaf plot depicts
each value separately, offering more clarity.
Conclusion
The histograms and stem-and-leaf plots for BMI and Age provide further insights into
data distributions. Both the histogram and the stem-and-leaf plot for BMI show a bimodal
distribution with peaks at 5-10 and 20-25 BMI, as well as a right-skewed distribution with a
larger frequency in the 25-30 range. Similarly, the histogram and stem-and-leaf plot for Age
indicate a multimodal distribution with substantial peaks around the 15-20 and 30-35 age ranges,
as well as a noticeable fall in frequency for those beyond 60. The histogram provides a rapid
visual summary of the distributions, but the stem-and-leaf plot provides detailed, accurate counts
of individual values, resulting in a full comprehension of the dataset.

2. Multivariate Analysis
a. Construct cross-tabulation tables for the following variables:
categorize BMI into different ranges to make the cross-tabulation more insightful. Common BMI
categories are:
 Underweight: BMI < 18.5
 Normal weight: 18.5 ≤ BMI < 24.9
 Overweight: 25 ≤ BMI < 29.9
 Obese: BMI ≥ 30

I. Gender and BMI

Gender Underweight Normal weight Overweight Obese

Female 24 20 8 0

Male 12 17 18 9

# Categorize BMI
def categorize_bmi(bmi):
if bmi < 18.5:
return 'Underweight'
elif 18.5 <= bmi < 24.9:
return 'Normal Weight'
elif 25 <= bmi < 29.9:
return 'Overweight'
else:
return 'Obese'

df['BMI_Category'] = df['BMI'].apply(categorize_bmi)

# Perform cross-tabulation of Gender and BMI categories


cross_tab = pd.crosstab(df['Gender'], df['BMI_Category'], margins=True)

print(cross_tab)

 Female Distribution: A higher proportion of females are underweight, while fewer are overweight
or obese.

 Male Distribution: More males fall into the overweight and obese categories compared to
females, with a relatively balanced distribution across the other BMI categories.

II. Age and Label


Label Normal Weight Obese Overweight Underweight
Age_Category
18-30 9 0 1 15
31-45 8 1 5 11
46-60 4 6 6 8
<18 2 0 0 7
>60 6 5 8 6

# Categorize Age
def categorize_age(age):
if age < 18:
return '<18'
elif 18 <= age <= 30:
return '18-30'
elif 31 <= age <= 45:
return '31-45'
elif 46 <= age <= 60:
return '46-60'
else:
return '>60'

df['Age_Category'] = df['Age'].apply(categorize_age)

# Convert the Label column to numeric values


label_mapping = {
'Underweight': 1,
'Normal Weight': 2,
'Overweight': 3,
'Obese': 4
}
df['Label_numeric'] = df['Label'].map(label_mapping)

# Perform cross-tabulation of Age category and Label


cross_tab = pd.crosstab(df['Age_Category'], df['Label'])
print(cross_tab)

 Younger age groups (especially 18-30) tend to have higher proportions of underweight
individuals.

 Middle-aged groups (31-45 and 46-60) show a more balanced distribution across normal weight,
overweight, and obese categories.

 Older age groups (>60) have higher proportions of overweight and obese individuals.

 The <18 age group shows significant numbers in the underweight category but very few in the
other categories.

III. Age and BMI


BMI Category Normal Weight Obese Overweight Underweight
Age Category
18-30 10 0 3 12
31-45 13 1 5 6
46-60 4 4 8 8
<18 3 0 0 6
>60 7 4 10 4

# Function to categorize BMI


def categorize_bmi(bmi):
if bmi < 18.5:
return 'Underweight'
elif 18.5 <= bmi < 24.9:
return 'Normal Weight'
elif 25 <= bmi < 29.9:
return 'Overweight'
else:
return 'Obese'

# Function to categorize age


def categorize_age(age):
if age < 18:
return '<18'
elif 18 <= age <= 30:
return '18-30'
elif 31 <= age <= 45:
return '31-45'
elif 46 <= age <= 60:
return '46-60'
else:
return '>60'

# Apply the functions to create new columns


df['BMI Category'] = df['BMI'].apply(categorize_bmi)
df['Age Category'] = df['Age'].apply(categorize_age)

# Create a crosstab to summarize the data


crosstab = pd.crosstab(df['Age Category'], df['BMI Category'])

# Display the crosstab


print(crosstab)

 Younger age groups (particularly 18-30) have a larger prevalence of underweight people.

 Middle-aged groups (31-45 and 46-60) had a more equal mix of normal weight,
overweight, and obese categories.

 Individuals over the age of 60 are more likely to be overweight or obese, indicating
possible health hazards connected with aging.

 The <18 age group has a high proportion of underweight and normal weight individuals,
but relatively few are overweight or obese.
Correlation Results
Variable Pair Pearson Interpretation (relationship
Correlation strength & direction)
Coefficient (r)
Age and 0.4651 moderate positive relationship
Weight
Gender and 0.3423 weak positive relationship
BMI
Age and Label 0.4524 moderate positive relationship
Height and 0.3543 weak positive relationship
BMI

Age and weight


Dropping non-numeric columns to calculate the Pearson correlation matrix
df_numeric = df.select_dtypes(include=[float, int])

# Calculate the Pearson correlation matrix


correlation_matrix = df_numeric.corr(method='pearson')

# Extract the Pearson correlation coefficient between Age and Weight


pearson_corr_age_weight = correlation_matrix.loc['Age', 'Weight']

print(f"Pearson correlation coefficient between Age and Weight: {pearson_corr_age_weight:.3f}")

Gender and bmi


gender_mapping = {
'Male': 1,
'Female': 0
}
df['Gender_numeric'] = df['Gender'].map(gender_mapping)

# Calculate the Pearson correlation coefficient between Gender and BMI


pearson_corr_gender_bmi = df['Gender_numeric'].corr(df['BMI'])

print(f"Pearson correlation coefficient between Gender and BMI: {pearson_corr_gender_bmi:.3f}")


Age and label
label_mapping = {
'Underweight': 1,
'Normal Weight': 2,
'Overweight': 3,
'Obese': 4
}
df['Label_numeric'] = df['Label'].map(label_mapping)

# Calculate the Pearson correlation coefficient between Age and Label


pearson_corr_age_label = df['Age'].corr(df['Label_numeric'])

print(f"Pearson correlation coefficient between Age and Label: {pearson_corr_age_label:.3f}")

Height and bmi


# Calculate the Pearson correlation coefficient between Height and BMI
correlation_height_bmi = df[['Height', 'BMI']].corr().iloc[0, 1]

# Create a results table


results_correlation = pd.DataFrame({
'Variable 1': ['Height'],
'Variable 2': ['BMI'],
'Pearson Correlation Coefficient': [correlation_height_bmi]
})

# Display the results


print(results_correlation)

Interpretation:
1. Age and Weight (r = 0.4651):
o Strength: Moderate
o Direction: Positive
o Interpretation: As age increases, weight tends to increase moderately.
2. Gender and BMI (r = 0.3423):
o Strength: Weak
o Direction: Positive
o Interpretation: There is a weak tendency for BMI to be associated with gender.
3. Age and Label (r = 0.4524):
o Strength: Moderate
o Direction: positive
o Interpretation: There is a moderate tendency for the BMI category (Label) to increase as
age increases. that as individuals get older, they tend to move towards higher BMI
categories such as Overweight and Obese.
4. Height and BMI (r = 0.3543):
o Strength: Weak
o Direction: Positive
o Interpretation: As height increases, BMI tends to increase weakly.

Conclusion:
According to the Pearson correlation coefficients derived for various pairings of
variables, age and weight show a moderate positive connection (r = 0.4651), suggesting that
weight increases significantly with age. Similarly, there is a moderate positive association
between age and BMI category (r = 0.4524), indicating that older people tend to fall into higher
BMI categories such as overweight and obese. Gender and BMI have a slight positive correlation
(r = 0.3423), indicating a poor relationship between BMI and gender. Finally, height and BMI
show a slight positive connection (r = 0.3543), indicating that as height grows, BMI tends to rise
slowly.
c) Create a scatter plot for one of the variable pairs from the list:
• Age and Weight

Coding:
# Scatterplot
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Age', y='Weight', data=df)
plt.title('Scatterplot of Age and Weight')
plt.xlabel('Age')
plt.ylabel('Weight')
plt.show()

• Gender and BMI


Coding:
# Scatterplot
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Gender', y='BMI', data=df)
plt.title('Scatterplot of Gender and BMI')
plt.xlabel('Gender')
plt.ylabel('BMI')
plt.show()
• Age and Label

Coding:
# Scatterplot
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Label', y='Age', data=df)
plt.title('Scatterplot of Age and Label')
plt.xlabel('Label')
plt.ylabel('Age')
plt.show()
• Height and BMI

Coding:
# Boxplot of BMI
plt.figure(figsize=(10, 6))
sns.boxplot(y='BMI', data=df)
plt.title('Boxplot of BMI Overall')
plt.ylabel('BMI')
plt.show()
Provide an interpretation of the scatter plot (e.g., the nature of the relationship, any
visible patterns, or anomalies). Compare the result with the relationship found in the
Pearson correlation (2b).

Intepretation
Types of Attributes Description
Age and Weight  Positive pattern shown in the scatter plot,
where weight increases generally
corresponding with age increases.
 The fact that the data points are widely
dispersed suggests that weight varies with
age.
 Beyond 60 years of age, there is greater
weight dispersion.
 There are outliers, particularly in the elder
age categories (70-100 years), when the
weights range greatly from 40 to over
than 100.
 Both the scatter plot and the Pearson
correlation value suggest a moderate
positive relationship between age and
weight.
Gender and BMI  No clear pattern between gender and
BMI.
 Weak positive correlation (0.342), but not
visually evident in the scatter plot.
 No anomalies or outliers; consistent data
points within each gender category.

Age and Label  Data points are dispersed within each


weight category, indicating variability in
age across all categories.
 No clear trend is visible in the scatter plot.
 A broad range of ages are represented in
each weight group, demonstrating that age
is not a determining factor in weight
categorization.
 The scatter plot does not demonstrate a
clear linear trend between the age and
weight categories; however, the Pearson
correlation suggests a moderate
relationship
Height and BMI  Correlation value of 0.354 is a weak
relationship meaning that while there is a
positive relationship, it is not highly
predictive of BMI based on height alone.
 Points are scattered without a clear linear
pattern, implying that factors other than
height may influence BMI.
 No significant outliers or anomalies are
detected, indicating a consistent
relationship across the dataset.

d) Create box plots for:

• The BMI attribute (overall).

Coding:
# Boxplot of BMI
plt.figure(figsize=(10, 6))
sns.boxplot(y='BMI', data=df)
plt.title('Boxplot of BMI Overall')
plt.ylabel('BMI')
plt.show()
Intepratation:
This box plot shows that the majority of BMI values in the dataset are concentrated between 18 and 25,
with the median value being around 22. The BMI values range from a minimum of about 5 to a
maximum of about 35, indicating the variability in BMI across the dataset. The BMI data of this
datasets is well separated because based on the diagram, the line between the box is in the middle
shows that the data is well separated.

• BMI by target class (e.g., Normal Weight, Overweight, Underweight, Obese).

Coding:
# Boxplot
plt.figure(figsize=(10, 6))
sns.boxplot(x='Label', y='BMI', data=df)
plt.title('Boxplots of BMI Based On Target Class')
plt.xlabel('Label')
plt.ylabel('BMI')
plt.show()

Intepretation:
 For Normal Weight, The Median of BMI will be around 2, it shows that the data of BMI is well
separated for the target class indicating consistent BMI values within this group.
 For Overweight, most of the data will be below than median value of 27 and it concentrated the
data is below than value of median for this class.
 For Underweight, the median is around 18, and the data is well separated because the median is
in the middle of the class, with a wider IQR, showing more variability in BMI values.
 For Obese, the median is around 31 and most of the data is concentrated above over median
value. It shows that the data is contrated above over median class.
 For all of the BMI for target class, it shows that no outlier is detected for this dataset.

You might also like