assigment_3_EDA
assigment_3_EDA
ASSIGNMENT 3
NAME NO MATRIC
MUHAMMAD LUQMAN HAKIM BIN ABDUL ANNAS 291590
MUHAMMAD IKRAM BIN MD ASRAN 291666
1. Univariate Analysis
a. Perform descriptive statistics for all variables.
shows the feature statistics for all variables in the Obesity Classification dataset
which is provided by excel. Below is the analysis for each feature in the dataset:
Mean (56.05): The average value of the "id" variable is 56.05, which indicates the central
tendency of the dataset.
Standard Error (3.07): This measures the accuracy with which the sample mean represents
the population mean. A standard error of 3.07 suggests moderate variability in the sample mean.
Median (56.5): The median value is 56.5, which indicates the middle value of the dataset
when arranged in ascending order. It is very close to the mean, suggesting a relatively
symmetrical distribution.
Mode (1): The mode is 1, indicating that 1 is the most frequently occurring value in the
dataset. This might suggest a high frequency of the value 1, indicating a potential outlier or a
specific pattern in the dataset.
Standard Deviation (31.92): This measures the spread of the dataset around the mean. A
standard deviation of 31.92 indicates substantial variability in the data.
Sample Variance (1018.75): Variance is the square of the standard deviation and provides
another measure of data spread. A variance of 1018.75 reinforces the substantial variability in the
data.
Kurtosis (-1.19): Kurtosis measures the "tailedness" of the data distribution. A kurtosis of -
1.19 indicates a platykurtic distribution, meaning the data has lighter tails and fewer outliers than
a normal distribution.
Skewness (-0.03): Skewness measures the asymmetry of the data distribution. A skewness
close to 0 (-0.03) suggests the data is symmetrical.
Range (109): The range calculated as the difference between the maximum and minimum
values (110 - 1), is 109. This indicates the spread of the dataset.
Mean (46.56): The average age in the dataset is approximately 46.56 years, indicating the
central tendency of the age distribution.
Standard Error (2.38): This measures the precision of the sample mean as an estimate of the
population mean. A standard error of 2.38 suggests moderate variability in the sample mean.
Median (42.5): The median age is 42.5 years, which is the middle value of the dataset when
arranged in ascending order. It is slightly lower than the mean, suggesting a slight right skew in
the data.
Mode (16): The most frequently occurring age in the dataset is 16 years. This indicates a peak
or a high frequency at this particular age.
Standard Deviation (24.72): This measures the spread of the ages around the mean. A
standard deviation of 24.72 years indicates substantial variability in the ages.
Sample Variance (611.11): Variance is the square of the standard deviation and provides
another measure of data spread. A variance of 611.11 reinforces the substantial variability in the
age data.
Kurtosis (0.05): Kurtosis measures the "tailedness" of the data distribution. A kurtosis of 0.05
indicates a distribution very close to normal in terms of tailedness.
Skewness (0.82): Skewness measures the asymmetry of the data distribution. A skewness of
0.82 indicates a slight right skew, meaning there are more values on the higher end of the age
scale.
Range (101): The range, calculated as the difference between the maximum and minimum
values (112 - 11), is 101 years. This indicates a wide spread of ages in the dataset.
Sum (5028): The sum of all ages in the dataset is 5028 years.
Standard Error (2.68): This measures the precision of the sample mean as an estimate of the
population mean. A standard error of 2.68 suggests moderate variability in the sample mean.
Median (175): The median height is 175 cm, which is the middle value of the dataset when
arranged in ascending order. It is higher than the mean, suggesting a slight left skew in the data.
Mode (160): The most frequently occurring height in the dataset is 160 cm. This indicates a
peak or a high frequency at this particular height.
Standard Deviation (27.87): This measures the spread of the heights around the mean. A
standard deviation of 27.87 cm indicates substantial variability in the heights.
Sample Variance (776.94): Variance is the square of the standard deviation and provides
another measure of data spread. A variance of 776.94 reinforces the substantial variability in the
height data.
Kurtosis (-1.15): Kurtosis measures the "tailedness" of the data distribution. A kurtosis of -
1.15 indicates a platykurtic distribution, meaning the data has lighter tails and fewer outliers than
a normal distribution.
Skewness (-0.11): Skewness measures the asymmetry of the data distribution. A skewness of -
0.11 suggests the data is fairly symmetrical with a slight left skew.
Range (90): The range, calculated as the difference between the maximum and minimum
values (210 - 120), is 90 cm. This indicates a wide spread of heights in the dataset.
Minimum (120): The shortest height in the dataset is 120 cm.
Sum (17990): The sum of all heights in the dataset is 17990 cm.
Mean (59.49): The average weight in the dataset is approximately 59.49 kg, indicating the
central tendency of the weight distribution.
Standard Error (2.78): This measures the precision of the sample mean as an estimate of the
population mean. A standard error of 2.78 suggests moderate variability in the sample mean.
Median (55): The median weight is 55 kg, which is the middle value of the dataset when
arranged in ascending order. It is slightly lower than the mean, suggesting a right skew in the
data.
Mode (75): The most frequently occurring weight in the dataset is 75 kg. This indicates a
peak or a high frequency at this particular weight.
Standard Deviation (28.86): This measures the spread of the weights around the mean. A
standard deviation of 28.86 kg indicates substantial variability in the weights.
Sample Variance (832.68): Variance is the square of the standard deviation and provides
another measure of data spread. A variance of 832.68 reinforces the substantial variability in the
weight data.
Kurtosis (-0.96): Kurtosis measures the "tailedness" of the data distribution. A kurtosis of -
0.96 indicates a platykurtic distribution, meaning the data has lighter tails and fewer outliers than
a normal distribution.
Skewness (0.20): Skewness measures the asymmetry of the data distribution. A skewness of
0.20 suggests the data is fairly symmetrical with a slight right skew.
Range (110): The range, calculated as the difference between the maximum and minimum
values (120 - 10), is 110 kg. This indicates a wide spread of weights in the dataset.
Sum (6425): The sum of all weights in the dataset is 6425 kg.
Standard Error (0.73): This measures the precision of the sample mean as an estimate of the
population mean. A standard error of 0.73 suggests low variability in the sample mean.
Median (21.2): The median BMI is 21.2, which is the middle value of the dataset when
arranged in ascending order. It is slightly higher than the mean, suggesting a slight left skew in
the data.
Mode (16.7): The most frequently occurring BMI in the dataset is 16.7. This indicates a peak
or a high frequency at this particular BMI value.
Standard Deviation (7.58): This measures the spread of the BMIs around the mean. A
standard deviation of 7.58 indicates moderate variability in the BMIs.
Sample Variance (57.51): Variance is the square of the standard deviation and provides
another measure of data spread. A variance of 57.51 reinforces the moderate variability in the
BMI data.
Kurtosis (-0.36): Kurtosis measures the "tailedness" of the data distribution. A kurtosis of -
0.36 indicates a platykurtic distribution, meaning the data has lighter tails and fewer outliers than
a normal distribution.
Skewness (-0.28): Skewness measures the asymmetry of the data distribution. A skewness of
-0.28 suggests the data is fairly symmetrical with a slight left skew.
Range (33.3): The range, calculated as the difference between the maximum and minimum
values (37.2 - 3.9), is 33.3. This indicates a wide spread of BMIs in the dataset.
# Adjust layout
plt.tight_layout()
10 1234566778899
20 0112233445566778899
30 00112233455677889
40 0012233455677889
50 0012233455677889
60 001234578
70 02378
80 2378
90 2378
100 2378
110 2
stem-and-leaf plot for BMI:
0.0 3.9 3.9 5.6 5.6 5.6 5.6 8.3 8.3 8.3 8.3
10.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.3 3.3 3.3 3.3 3.3 3.3 3.3 3.3 6.69 6.69 6.69 6.69 6.69 6.69
6.69 6.69 6.69 6.69 6.69 8.7 8.7 8.7 8.7 8.7
20.0 0.0 0.0 0.0 0.0 0.0 0.0 1.19 1.19 1.19 1.19 1.19 1.19 1.19 1.19 2.5 2.5 2.5 2.5 2.69 2.69
2.69 2.69 2.69 2.69 2.69 3.39 3.39 4.19 4.19 4.19 4.19 4.19 5.0 5.0 5.0 5.0 5.3 5.3 5.3
6.1 6.1 6.1 6.1 6.1 7.0 7.0 7.0 7.0 7.3 7.3 7.3 7.5 7.5 7.5 8.89 8.89 8.89 9.1
30.0 0.8 1.19 1.19 1.19 4.2 4.2 4.2 7.2 7.2
The frequency decreases significantly for people above the age of 60.
Each leaf represents a single data point from the tens group.
The graphic depicts the actual age values, providing a detailed picture of the distribution.
The concentration of values in the age categories 20-29 and 50-59 corresponds to the
histogram's maximum.
Similarities:
Peaks in Age Distribution:
Both the histogram and the stem-and-leaf plot show higher frequencies in the 20-30 and
50-60 age groups.
General Trend:
Both plots indicate a relatively even distribution of ages from 20 to 60, with fewer
individuals in older age groups.
Differences:
Level of Detail:
• The histogram provides a summary perspective by grouping age ranges. It is easy to see
overarching trends and patterns in age distribution.
• The stem-and-leaf plot provides a thorough view of individual age values within each
group. This graphic shows the specific ages and their frequencies.
The distribution is tilted to the right, with a long tail extending to higher BMI levels
(more than 30).
Frequency:
Similar to the histogram, the data reveals a larger concentration of values between
5-10 and 20-25 BMI.
Detailed Frequency:
The stem-and-leaf plot displays accurate counts of individual BMI values, which
confirm the histogram's peaks.
For example, inside the "0" stem, the leaf values include many occurrences of 3.3,
3.9, and 5.6, contributing to the histogram's bars in the lower BMI range.
Similarly, numbers in the "2" stem, such as 2.5, 2.69, 3.3, and 4.19, correspond to
the greater frequencies in the 20-25 BMI range seen in the histogram.
Similarities
Both the histogram and the stem-and-leaf plot show a bimodal distribution with
peaks at 5-10 and 20-25 BMI levels.
They both have a right-skewed distribution, with fewer values in the upper BMI
levels (>30).
Differences
The histogram gives a visual overview using bins, allowing you to immediately
determine the overall shape and distribution.
The stem-and-leaf plot provides precise data points, allowing for accurate
identification of particular BMI values and frequencies.
The histogram displays aggregated frequencies, but the stem-and-leaf plot depicts
each value separately, offering more clarity.
Conclusion
The histograms and stem-and-leaf plots for BMI and Age provide further insights into
data distributions. Both the histogram and the stem-and-leaf plot for BMI show a bimodal
distribution with peaks at 5-10 and 20-25 BMI, as well as a right-skewed distribution with a
larger frequency in the 25-30 range. Similarly, the histogram and stem-and-leaf plot for Age
indicate a multimodal distribution with substantial peaks around the 15-20 and 30-35 age ranges,
as well as a noticeable fall in frequency for those beyond 60. The histogram provides a rapid
visual summary of the distributions, but the stem-and-leaf plot provides detailed, accurate counts
of individual values, resulting in a full comprehension of the dataset.
2. Multivariate Analysis
a. Construct cross-tabulation tables for the following variables:
categorize BMI into different ranges to make the cross-tabulation more insightful. Common BMI
categories are:
Underweight: BMI < 18.5
Normal weight: 18.5 ≤ BMI < 24.9
Overweight: 25 ≤ BMI < 29.9
Obese: BMI ≥ 30
Female 24 20 8 0
Male 12 17 18 9
# Categorize BMI
def categorize_bmi(bmi):
if bmi < 18.5:
return 'Underweight'
elif 18.5 <= bmi < 24.9:
return 'Normal Weight'
elif 25 <= bmi < 29.9:
return 'Overweight'
else:
return 'Obese'
df['BMI_Category'] = df['BMI'].apply(categorize_bmi)
print(cross_tab)
Female Distribution: A higher proportion of females are underweight, while fewer are overweight
or obese.
Male Distribution: More males fall into the overweight and obese categories compared to
females, with a relatively balanced distribution across the other BMI categories.
# Categorize Age
def categorize_age(age):
if age < 18:
return '<18'
elif 18 <= age <= 30:
return '18-30'
elif 31 <= age <= 45:
return '31-45'
elif 46 <= age <= 60:
return '46-60'
else:
return '>60'
df['Age_Category'] = df['Age'].apply(categorize_age)
Younger age groups (especially 18-30) tend to have higher proportions of underweight
individuals.
Middle-aged groups (31-45 and 46-60) show a more balanced distribution across normal weight,
overweight, and obese categories.
Older age groups (>60) have higher proportions of overweight and obese individuals.
The <18 age group shows significant numbers in the underweight category but very few in the
other categories.
Younger age groups (particularly 18-30) have a larger prevalence of underweight people.
Middle-aged groups (31-45 and 46-60) had a more equal mix of normal weight,
overweight, and obese categories.
Individuals over the age of 60 are more likely to be overweight or obese, indicating
possible health hazards connected with aging.
The <18 age group has a high proportion of underweight and normal weight individuals,
but relatively few are overweight or obese.
Correlation Results
Variable Pair Pearson Interpretation (relationship
Correlation strength & direction)
Coefficient (r)
Age and 0.4651 moderate positive relationship
Weight
Gender and 0.3423 weak positive relationship
BMI
Age and Label 0.4524 moderate positive relationship
Height and 0.3543 weak positive relationship
BMI
Interpretation:
1. Age and Weight (r = 0.4651):
o Strength: Moderate
o Direction: Positive
o Interpretation: As age increases, weight tends to increase moderately.
2. Gender and BMI (r = 0.3423):
o Strength: Weak
o Direction: Positive
o Interpretation: There is a weak tendency for BMI to be associated with gender.
3. Age and Label (r = 0.4524):
o Strength: Moderate
o Direction: positive
o Interpretation: There is a moderate tendency for the BMI category (Label) to increase as
age increases. that as individuals get older, they tend to move towards higher BMI
categories such as Overweight and Obese.
4. Height and BMI (r = 0.3543):
o Strength: Weak
o Direction: Positive
o Interpretation: As height increases, BMI tends to increase weakly.
Conclusion:
According to the Pearson correlation coefficients derived for various pairings of
variables, age and weight show a moderate positive connection (r = 0.4651), suggesting that
weight increases significantly with age. Similarly, there is a moderate positive association
between age and BMI category (r = 0.4524), indicating that older people tend to fall into higher
BMI categories such as overweight and obese. Gender and BMI have a slight positive correlation
(r = 0.3423), indicating a poor relationship between BMI and gender. Finally, height and BMI
show a slight positive connection (r = 0.3543), indicating that as height grows, BMI tends to rise
slowly.
c) Create a scatter plot for one of the variable pairs from the list:
• Age and Weight
Coding:
# Scatterplot
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Age', y='Weight', data=df)
plt.title('Scatterplot of Age and Weight')
plt.xlabel('Age')
plt.ylabel('Weight')
plt.show()
Coding:
# Scatterplot
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Label', y='Age', data=df)
plt.title('Scatterplot of Age and Label')
plt.xlabel('Label')
plt.ylabel('Age')
plt.show()
• Height and BMI
Coding:
# Boxplot of BMI
plt.figure(figsize=(10, 6))
sns.boxplot(y='BMI', data=df)
plt.title('Boxplot of BMI Overall')
plt.ylabel('BMI')
plt.show()
Provide an interpretation of the scatter plot (e.g., the nature of the relationship, any
visible patterns, or anomalies). Compare the result with the relationship found in the
Pearson correlation (2b).
Intepretation
Types of Attributes Description
Age and Weight Positive pattern shown in the scatter plot,
where weight increases generally
corresponding with age increases.
The fact that the data points are widely
dispersed suggests that weight varies with
age.
Beyond 60 years of age, there is greater
weight dispersion.
There are outliers, particularly in the elder
age categories (70-100 years), when the
weights range greatly from 40 to over
than 100.
Both the scatter plot and the Pearson
correlation value suggest a moderate
positive relationship between age and
weight.
Gender and BMI No clear pattern between gender and
BMI.
Weak positive correlation (0.342), but not
visually evident in the scatter plot.
No anomalies or outliers; consistent data
points within each gender category.
Coding:
# Boxplot of BMI
plt.figure(figsize=(10, 6))
sns.boxplot(y='BMI', data=df)
plt.title('Boxplot of BMI Overall')
plt.ylabel('BMI')
plt.show()
Intepratation:
This box plot shows that the majority of BMI values in the dataset are concentrated between 18 and 25,
with the median value being around 22. The BMI values range from a minimum of about 5 to a
maximum of about 35, indicating the variability in BMI across the dataset. The BMI data of this
datasets is well separated because based on the diagram, the line between the box is in the middle
shows that the data is well separated.
Coding:
# Boxplot
plt.figure(figsize=(10, 6))
sns.boxplot(x='Label', y='BMI', data=df)
plt.title('Boxplots of BMI Based On Target Class')
plt.xlabel('Label')
plt.ylabel('BMI')
plt.show()
Intepretation:
For Normal Weight, The Median of BMI will be around 2, it shows that the data of BMI is well
separated for the target class indicating consistent BMI values within this group.
For Overweight, most of the data will be below than median value of 27 and it concentrated the
data is below than value of median for this class.
For Underweight, the median is around 18, and the data is well separated because the median is
in the middle of the class, with a wider IQR, showing more variability in BMI values.
For Obese, the median is around 31 and most of the data is concentrated above over median
value. It shows that the data is contrated above over median class.
For all of the BMI for target class, it shows that no outlier is detected for this dataset.