0% found this document useful (0 votes)

9 views

assigment_3_EDA

The document presents an exploratory data analysis assignment focused on the Obesity Classification dataset, detailing univariate statistics for various features including age, height, weight, and BMI. Key statistics such as mean, median, mode, standard deviation, and skewness are provided for each variable, along with histograms and stem-and-leaf plots for age and BMI distributions. The analysis highlights similarities and differences in data distribution, noting peaks in age groups and a bimodal distribution in BMI.

Uploaded by

Ikram

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

assigment_3_EDA

Uploaded by

Ikram

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

A231

STIS3033-EXPLORATORY DATA ANALYSIS

ASSIGNMENT 3

PREPARED FOR: PROF FARHAN BIN MOHD HOSSEIN

NAME NO MATRIC
MUHAMMAD LUQMAN HAKIM BIN ABDUL ANNAS 291590
MUHAMMAD IKRAM BIN MD ASRAN 291666
1. Univariate Analysis
a. Perform descriptive statistics for all variables.

shows the feature statistics for all variables in the Obesity Classification dataset
which is provided by excel. Below is the analysis for each feature in the dataset:

Mean (56.05): The average value of the "id" variable is 56.05, which indicates the central
tendency of the dataset.

Standard Error (3.07): This measures the accuracy with which the sample mean represents
the population mean. A standard error of 3.07 suggests moderate variability in the sample mean.

Median (56.5): The median value is 56.5, which indicates the middle value of the dataset
when arranged in ascending order. It is very close to the mean, suggesting a relatively
symmetrical distribution.

Mode (1): The mode is 1, indicating that 1 is the most frequently occurring value in the
dataset. This might suggest a high frequency of the value 1, indicating a potential outlier or a
specific pattern in the dataset.

Standard Deviation (31.92): This measures the spread of the dataset around the mean. A
standard deviation of 31.92 indicates substantial variability in the data.

Sample Variance (1018.75): Variance is the square of the standard deviation and provides
another measure of data spread. A variance of 1018.75 reinforces the substantial variability in the
data.
Kurtosis (-1.19): Kurtosis measures the "tailedness" of the data distribution. A kurtosis of -
1.19 indicates a platykurtic distribution, meaning the data has lighter tails and fewer outliers than
a normal distribution.

Skewness (-0.03): Skewness measures the asymmetry of the data distribution. A skewness
close to 0 (-0.03) suggests the data is symmetrical.

Range (109): The range calculated as the difference between the maximum and minimum
values (110 - 1), is 109. This indicates the spread of the dataset.

Minimum (1): The smallest value in the dataset is 1.

Maximum (110): The largest value in the dataset is 110.

Sum (6053): The sum of all values in the dataset is 6053.

Count (108): The dataset contains 108 observations.

Mean (46.56): The average age in the dataset is approximately 46.56 years, indicating the
central tendency of the age distribution.

Standard Error (2.38): This measures the precision of the sample mean as an estimate of the
population mean. A standard error of 2.38 suggests moderate variability in the sample mean.

Median (42.5): The median age is 42.5 years, which is the middle value of the dataset when
arranged in ascending order. It is slightly lower than the mean, suggesting a slight right skew in
the data.
Mode (16): The most frequently occurring age in the dataset is 16 years. This indicates a peak
or a high frequency at this particular age.

Standard Deviation (24.72): This measures the spread of the ages around the mean. A
standard deviation of 24.72 years indicates substantial variability in the ages.

Sample Variance (611.11): Variance is the square of the standard deviation and provides
another measure of data spread. A variance of 611.11 reinforces the substantial variability in the
age data.

Kurtosis (0.05): Kurtosis measures the "tailedness" of the data distribution. A kurtosis of 0.05
indicates a distribution very close to normal in terms of tailedness.

Skewness (0.82): Skewness measures the asymmetry of the data distribution. A skewness of
0.82 indicates a slight right skew, meaning there are more values on the higher end of the age
scale.

Range (101): The range, calculated as the difference between the maximum and minimum
values (112 - 11), is 101 years. This indicates a wide spread of ages in the dataset.

Minimum (11): The youngest age in the dataset is 11 years.

Maximum (112): The oldest age in the dataset is 112 years.

Sum (5028): The sum of all ages in the dataset is 5028 years.

Count (108): The dataset contains 108 observations.

Mean (166.57): The average height in the dataset is approximately 166.57 cm, indicating the
central tendency of the height distribution.

Standard Error (2.68): This measures the precision of the sample mean as an estimate of the
population mean. A standard error of 2.68 suggests moderate variability in the sample mean.

Median (175): The median height is 175 cm, which is the middle value of the dataset when
arranged in ascending order. It is higher than the mean, suggesting a slight left skew in the data.

Mode (160): The most frequently occurring height in the dataset is 160 cm. This indicates a
peak or a high frequency at this particular height.

Standard Deviation (27.87): This measures the spread of the heights around the mean. A
standard deviation of 27.87 cm indicates substantial variability in the heights.

Sample Variance (776.94): Variance is the square of the standard deviation and provides
another measure of data spread. A variance of 776.94 reinforces the substantial variability in the
height data.

Kurtosis (-1.15): Kurtosis measures the "tailedness" of the data distribution. A kurtosis of -
1.15 indicates a platykurtic distribution, meaning the data has lighter tails and fewer outliers than
a normal distribution.

Skewness (-0.11): Skewness measures the asymmetry of the data distribution. A skewness of -
0.11 suggests the data is fairly symmetrical with a slight left skew.

Range (90): The range, calculated as the difference between the maximum and minimum
values (210 - 120), is 90 cm. This indicates a wide spread of heights in the dataset.
Minimum (120): The shortest height in the dataset is 120 cm.

Maximum (210): The tallest height in the dataset is 210 cm.

Sum (17990): The sum of all heights in the dataset is 17990 cm.

Count (108): The dataset contains 108 observations.

Mean (59.49): The average weight in the dataset is approximately 59.49 kg, indicating the
central tendency of the weight distribution.

Standard Error (2.78): This measures the precision of the sample mean as an estimate of the
population mean. A standard error of 2.78 suggests moderate variability in the sample mean.

Median (55): The median weight is 55 kg, which is the middle value of the dataset when
arranged in ascending order. It is slightly lower than the mean, suggesting a right skew in the
data.

Mode (75): The most frequently occurring weight in the dataset is 75 kg. This indicates a
peak or a high frequency at this particular weight.

Standard Deviation (28.86): This measures the spread of the weights around the mean. A
standard deviation of 28.86 kg indicates substantial variability in the weights.

Sample Variance (832.68): Variance is the square of the standard deviation and provides
another measure of data spread. A variance of 832.68 reinforces the substantial variability in the
weight data.
Kurtosis (-0.96): Kurtosis measures the "tailedness" of the data distribution. A kurtosis of -
0.96 indicates a platykurtic distribution, meaning the data has lighter tails and fewer outliers than
a normal distribution.

Skewness (0.20): Skewness measures the asymmetry of the data distribution. A skewness of
0.20 suggests the data is fairly symmetrical with a slight right skew.

Range (110): The range, calculated as the difference between the maximum and minimum
values (120 - 10), is 110 kg. This indicates a wide spread of weights in the dataset.

Minimum (10): The lowest weight in the dataset is 10 kg.

Maximum (120): The highest weight in the dataset is 120 kg.

Sum (6425): The sum of all weights in the dataset is 6425 kg.

Count (108): The dataset contains 108 observations.

Mean (20.55): The average BMI in the dataset is approximately 20.55, indicating the central
tendency of the BMI distribution.

Standard Error (0.73): This measures the precision of the sample mean as an estimate of the
population mean. A standard error of 0.73 suggests low variability in the sample mean.

Median (21.2): The median BMI is 21.2, which is the middle value of the dataset when
arranged in ascending order. It is slightly higher than the mean, suggesting a slight left skew in
the data.

Mode (16.7): The most frequently occurring BMI in the dataset is 16.7. This indicates a peak
or a high frequency at this particular BMI value.

Standard Deviation (7.58): This measures the spread of the BMIs around the mean. A
standard deviation of 7.58 indicates moderate variability in the BMIs.

Sample Variance (57.51): Variance is the square of the standard deviation and provides
another measure of data spread. A variance of 57.51 reinforces the moderate variability in the
BMI data.

Kurtosis (-0.36): Kurtosis measures the "tailedness" of the data distribution. A kurtosis of -
0.36 indicates a platykurtic distribution, meaning the data has lighter tails and fewer outliers than
a normal distribution.

Skewness (-0.28): Skewness measures the asymmetry of the data distribution. A skewness of
-0.28 suggests the data is fairly symmetrical with a slight left skew.

Range (33.3): The range, calculated as the difference between the maximum and minimum
values (37.2 - 3.9), is 33.3. This indicates a wide spread of BMIs in the dataset.

Minimum (3.9): The lowest BMI in the dataset is 3.9.

Maximum (37.2): The highest BMI in the dataset is 37.2.

Sum (2219): The sum of all BMIs in the dataset is 2219.

Count (108): The dataset contains 108 observations.

b. Construct histograms and stem-and-leaf plots for the Age and BMI variables. Analyze insights,
similarities, and differences in data distribution from both analyses.

# Plotting the histograms for Age

fig, axes = plt.subplots(2, 1, figsize=(10, 8))

# Histogram for Age

axes[0].hist(df['Age'], bins=15, color='blue', edgecolor='black')
axes[0].set_title('Histogram of Age')
axes[0].set_xlabel('Age')
axes[0].set_ylabel('Frequency')
# Adjust layout
plt.tight_layout()

# Show the plot

plt.show()
# Plotting the histograms for BMI
fig, axes = plt.subplots(2, 1, figsize=(10, 8))
# Histogram for BMI
axes[1].hist(df['BMI'], bins=15, color='green', edgecolor='black')
axes[1].set_title('Histogram of BMI')
axes[1].set_xlabel('BMI')
axes[1].set_ylabel('Frequency')

# Adjust layout
plt.tight_layout()

# Show the plot

plt.show()

stem-and-leaf plot for Age

10 1234566778899
20 0112233445566778899
30 00112233455677889
40 0012233455677889
50 0012233455677889
60 001234578
70 02378
80 2378
90 2378
100 2378
110 2
stem-and-leaf plot for BMI:

0.0 3.9 3.9 5.6 5.6 5.6 5.6 8.3 8.3 8.3 8.3
10.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.3 3.3 3.3 3.3 3.3 3.3 3.3 3.3 6.69 6.69 6.69 6.69 6.69 6.69
6.69 6.69 6.69 6.69 6.69 8.7 8.7 8.7 8.7 8.7
20.0 0.0 0.0 0.0 0.0 0.0 0.0 1.19 1.19 1.19 1.19 1.19 1.19 1.19 1.19 2.5 2.5 2.5 2.5 2.69 2.69
2.69 2.69 2.69 2.69 2.69 3.39 3.39 4.19 4.19 4.19 4.19 4.19 5.0 5.0 5.0 5.0 5.3 5.3 5.3
6.1 6.1 6.1 6.1 6.1 7.0 7.0 7.0 7.0 7.3 7.3 7.3 7.5 7.5 7.5 8.89 8.89 8.89 9.1
30.0 0.8 1.19 1.19 1.19 4.2 4.2 4.2 7.2 7.2

Histogram of age distribution:

 The histogram depicts the frequency distribution of ages.

 The younger to middle-aged groups (20-60 years) have higher frequencies.

 The frequency decreases significantly for people above the age of 60.

Histogram of Age Groups:

 Peaks occur around ages 20-30 and 50-60.

 A more even distribution between 30 and 60.

 Individuals beyond the age of 70 experience lower frequencies.

Age distribution using a stem-and-leaf plot.

 The stem-and-leaf plot gives a thorough picture of individual age values.

 The data is organized into tens (e.g., 10-19, 20-29, 30-39).

 Each leaf represents a single data point from the tens group.

Age group using a stem-and-leaf plot

 Similar peaks are evident in the 20-29 and 50-59 age groups.

 The graphic depicts the actual age values, providing a detailed picture of the distribution.

 The concentration of values in the age categories 20-29 and 50-59 corresponds to the
histogram's maximum.
Similarities:
Peaks in Age Distribution:

 Both the histogram and the stem-and-leaf plot show higher frequencies in the 20-30 and
50-60 age groups.

 The distribution is less dense beyond 60 years in both representations.

General Trend:

 Both plots indicate a relatively even distribution of ages from 20 to 60, with fewer
individuals in older age groups.

Differences:
Level of Detail:

• The histogram provides a summary perspective by grouping age ranges. It is easy to see
overarching trends and patterns in age distribution.

• The stem-and-leaf plot provides a thorough view of individual age values within each
group. This graphic shows the specific ages and their frequencies.

BMI shape and distribution for histogram

 Histogram Analysis: Shape and Distribution.

 The histogram indicates a bimodal distribution with two peaks.

 The initial surge occurs at the 5-10 BMI range.

 The second peak occurs in the 20-25 BMI range.

 The incidence of BMIs between 10 and 15 has decreased significantly.

 The distribution is tilted to the right, with a long tail extending to higher BMI levels
(more than 30).

Frequency:

 The highest frequency (14) is observed in the 25-30 BMI range.

 Lower frequencies are seen in the lower BMI ranges (below 10 and above 30).
Shape and distribution for Stem-and-Leaf Plot Analysis

 The stem-and-leaf graphic shows a thorough breakdown of specific BMI values.

 Similar to the histogram, the data reveals a larger concentration of values between
5-10 and 20-25 BMI.

 Each stem contains clusters of values around 6, 8, and 9, confirming the

histogram's bimodal character.

Detailed Frequency:

 The stem-and-leaf plot displays accurate counts of individual BMI values, which
confirm the histogram's peaks.

 For example, inside the "0" stem, the leaf values include many occurrences of 3.3,
3.9, and 5.6, contributing to the histogram's bars in the lower BMI range.

 Similarly, numbers in the "2" stem, such as 2.5, 2.69, 3.3, and 4.19, correspond to
the greater frequencies in the 20-25 BMI range seen in the histogram.

Similarities
 Both the histogram and the stem-and-leaf plot show a bimodal distribution with
peaks at 5-10 and 20-25 BMI levels.

 They both have a right-skewed distribution, with fewer values in the upper BMI
levels (>30).

Differences
 The histogram gives a visual overview using bins, allowing you to immediately
determine the overall shape and distribution.

 The stem-and-leaf plot provides precise data points, allowing for accurate
identification of particular BMI values and frequencies.

 The histogram displays aggregated frequencies, but the stem-and-leaf plot depicts
each value separately, offering more clarity.
Conclusion
The histograms and stem-and-leaf plots for BMI and Age provide further insights into
data distributions. Both the histogram and the stem-and-leaf plot for BMI show a bimodal
distribution with peaks at 5-10 and 20-25 BMI, as well as a right-skewed distribution with a
larger frequency in the 25-30 range. Similarly, the histogram and stem-and-leaf plot for Age
indicate a multimodal distribution with substantial peaks around the 15-20 and 30-35 age ranges,
as well as a noticeable fall in frequency for those beyond 60. The histogram provides a rapid
visual summary of the distributions, but the stem-and-leaf plot provides detailed, accurate counts
of individual values, resulting in a full comprehension of the dataset.

2. Multivariate Analysis
a. Construct cross-tabulation tables for the following variables:
categorize BMI into different ranges to make the cross-tabulation more insightful. Common BMI
categories are:
 Underweight: BMI < 18.5
 Normal weight: 18.5 ≤ BMI < 24.9
 Overweight: 25 ≤ BMI < 29.9
 Obese: BMI ≥ 30

I. Gender and BMI

Gender Underweight Normal weight Overweight Obese

Female 24 20 8 0

Male 12 17 18 9

# Categorize BMI
def categorize_bmi(bmi):
if bmi < 18.5:
return 'Underweight'
elif 18.5 <= bmi < 24.9:
return 'Normal Weight'
elif 25 <= bmi < 29.9:
return 'Overweight'
else:
return 'Obese'

df['BMI_Category'] = df['BMI'].apply(categorize_bmi)

# Perform cross-tabulation of Gender and BMI categories

cross_tab = pd.crosstab(df['Gender'], df['BMI_Category'], margins=True)

print(cross_tab)

 Female Distribution: A higher proportion of females are underweight, while fewer are overweight
or obese.

 Male Distribution: More males fall into the overweight and obese categories compared to
females, with a relatively balanced distribution across the other BMI categories.

II. Age and Label

Label Normal Weight Obese Overweight Underweight
Age_Category
18-30 9 0 1 15
31-45 8 1 5 11
46-60 4 6 6 8
<18 2 0 0 7
>60 6 5 8 6

# Categorize Age
def categorize_age(age):
if age < 18:
return '<18'
elif 18 <= age <= 30:
return '18-30'
elif 31 <= age <= 45:
return '31-45'
elif 46 <= age <= 60:
return '46-60'
else:
return '>60'

df['Age_Category'] = df['Age'].apply(categorize_age)

# Convert the Label column to numeric values

label_mapping = {
'Underweight': 1,
'Normal Weight': 2,
'Overweight': 3,
'Obese': 4
}
df['Label_numeric'] = df['Label'].map(label_mapping)

# Perform cross-tabulation of Age category and Label

cross_tab = pd.crosstab(df['Age_Category'], df['Label'])
print(cross_tab)

 Younger age groups (especially 18-30) tend to have higher proportions of underweight
individuals.

 Middle-aged groups (31-45 and 46-60) show a more balanced distribution across normal weight,
overweight, and obese categories.

 Older age groups (>60) have higher proportions of overweight and obese individuals.

 The <18 age group shows significant numbers in the underweight category but very few in the
other categories.

III. Age and BMI

BMI Category Normal Weight Obese Overweight Underweight
Age Category
18-30 10 0 3 12
31-45 13 1 5 6
46-60 4 4 8 8
<18 3 0 0 6
>60 7 4 10 4

# Function to categorize BMI

def categorize_bmi(bmi):
if bmi < 18.5:
return 'Underweight'
elif 18.5 <= bmi < 24.9:
return 'Normal Weight'
elif 25 <= bmi < 29.9:
return 'Overweight'
else:
return 'Obese'

# Function to categorize age

def categorize_age(age):
if age < 18:
return '<18'
elif 18 <= age <= 30:
return '18-30'
elif 31 <= age <= 45:
return '31-45'
elif 46 <= age <= 60:
return '46-60'
else:
return '>60'

# Apply the functions to create new columns

df['BMI Category'] = df['BMI'].apply(categorize_bmi)
df['Age Category'] = df['Age'].apply(categorize_age)

# Create a crosstab to summarize the data

crosstab = pd.crosstab(df['Age Category'], df['BMI Category'])

# Display the crosstab

print(crosstab)

 Younger age groups (particularly 18-30) have a larger prevalence of underweight people.

 Middle-aged groups (31-45 and 46-60) had a more equal mix of normal weight,
overweight, and obese categories.

 Individuals over the age of 60 are more likely to be overweight or obese, indicating
possible health hazards connected with aging.

 The <18 age group has a high proportion of underweight and normal weight individuals,
but relatively few are overweight or obese.
Correlation Results
Variable Pair Pearson Interpretation (relationship
Correlation strength & direction)
Coefficient (r)
Age and 0.4651 moderate positive relationship
Weight
Gender and 0.3423 weak positive relationship
BMI
Age and Label 0.4524 moderate positive relationship
Height and 0.3543 weak positive relationship
BMI

Age and weight

Dropping non-numeric columns to calculate the Pearson correlation matrix
df_numeric = df.select_dtypes(include=[float, int])

# Calculate the Pearson correlation matrix

correlation_matrix = df_numeric.corr(method='pearson')

# Extract the Pearson correlation coefficient between Age and Weight

pearson_corr_age_weight = correlation_matrix.loc['Age', 'Weight']

print(f"Pearson correlation coefficient between Age and Weight: {pearson_corr_age_weight:.3f}")

Gender and bmi

gender_mapping = {
'Male': 1,
'Female': 0
}
df['Gender_numeric'] = df['Gender'].map(gender_mapping)

# Calculate the Pearson correlation coefficient between Gender and BMI

pearson_corr_gender_bmi = df['Gender_numeric'].corr(df['BMI'])

print(f"Pearson correlation coefficient between Gender and BMI: {pearson_corr_gender_bmi:.3f}")

Age and label
label_mapping = {
'Underweight': 1,
'Normal Weight': 2,
'Overweight': 3,
'Obese': 4
}
df['Label_numeric'] = df['Label'].map(label_mapping)

# Calculate the Pearson correlation coefficient between Age and Label

pearson_corr_age_label = df['Age'].corr(df['Label_numeric'])

print(f"Pearson correlation coefficient between Age and Label: {pearson_corr_age_label:.3f}")

Height and bmi

# Calculate the Pearson correlation coefficient between Height and BMI
correlation_height_bmi = df[['Height', 'BMI']].corr().iloc[0, 1]

# Create a results table

results_correlation = pd.DataFrame({
'Variable 1': ['Height'],
'Variable 2': ['BMI'],
'Pearson Correlation Coefficient': [correlation_height_bmi]
})

# Display the results

print(results_correlation)

Interpretation:
1. Age and Weight (r = 0.4651):
o Strength: Moderate
o Direction: Positive
o Interpretation: As age increases, weight tends to increase moderately.
2. Gender and BMI (r = 0.3423):
o Strength: Weak
o Direction: Positive
o Interpretation: There is a weak tendency for BMI to be associated with gender.
3. Age and Label (r = 0.4524):
o Strength: Moderate
o Direction: positive
o Interpretation: There is a moderate tendency for the BMI category (Label) to increase as
age increases. that as individuals get older, they tend to move towards higher BMI
categories such as Overweight and Obese.
4. Height and BMI (r = 0.3543):
o Strength: Weak
o Direction: Positive
o Interpretation: As height increases, BMI tends to increase weakly.

Conclusion:
According to the Pearson correlation coefficients derived for various pairings of
variables, age and weight show a moderate positive connection (r = 0.4651), suggesting that
weight increases significantly with age. Similarly, there is a moderate positive association
between age and BMI category (r = 0.4524), indicating that older people tend to fall into higher
BMI categories such as overweight and obese. Gender and BMI have a slight positive correlation
(r = 0.3423), indicating a poor relationship between BMI and gender. Finally, height and BMI
show a slight positive connection (r = 0.3543), indicating that as height grows, BMI tends to rise
slowly.
c) Create a scatter plot for one of the variable pairs from the list:
• Age and Weight

Coding:
# Scatterplot
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Age', y='Weight', data=df)
plt.title('Scatterplot of Age and Weight')
plt.xlabel('Age')
plt.ylabel('Weight')
plt.show()

• Gender and BMI

Coding:
# Scatterplot
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Gender', y='BMI', data=df)
plt.title('Scatterplot of Gender and BMI')
plt.xlabel('Gender')
plt.ylabel('BMI')
plt.show()
• Age and Label

Coding:
# Scatterplot
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Label', y='Age', data=df)
plt.title('Scatterplot of Age and Label')
plt.xlabel('Label')
plt.ylabel('Age')
plt.show()
• Height and BMI

Coding:
# Boxplot of BMI
plt.figure(figsize=(10, 6))
sns.boxplot(y='BMI', data=df)
plt.title('Boxplot of BMI Overall')
plt.ylabel('BMI')
plt.show()
Provide an interpretation of the scatter plot (e.g., the nature of the relationship, any
visible patterns, or anomalies). Compare the result with the relationship found in the
Pearson correlation (2b).

Intepretation
Types of Attributes Description
Age and Weight  Positive pattern shown in the scatter plot,
where weight increases generally
corresponding with age increases.
 The fact that the data points are widely
dispersed suggests that weight varies with
age.
 Beyond 60 years of age, there is greater
weight dispersion.
 There are outliers, particularly in the elder
age categories (70-100 years), when the
weights range greatly from 40 to over
than 100.
 Both the scatter plot and the Pearson
correlation value suggest a moderate
positive relationship between age and
weight.
Gender and BMI  No clear pattern between gender and
BMI.
 Weak positive correlation (0.342), but not
visually evident in the scatter plot.
 No anomalies or outliers; consistent data
points within each gender category.

Age and Label  Data points are dispersed within each

weight category, indicating variability in
age across all categories.
 No clear trend is visible in the scatter plot.
 A broad range of ages are represented in
each weight group, demonstrating that age
is not a determining factor in weight
categorization.
 The scatter plot does not demonstrate a
clear linear trend between the age and
weight categories; however, the Pearson
correlation suggests a moderate
relationship
Height and BMI  Correlation value of 0.354 is a weak
relationship meaning that while there is a
positive relationship, it is not highly
predictive of BMI based on height alone.
 Points are scattered without a clear linear
pattern, implying that factors other than
height may influence BMI.
 No significant outliers or anomalies are
detected, indicating a consistent
relationship across the dataset.

d) Create box plots for:

• The BMI attribute (overall).

Coding:
# Boxplot of BMI
plt.figure(figsize=(10, 6))
sns.boxplot(y='BMI', data=df)
plt.title('Boxplot of BMI Overall')
plt.ylabel('BMI')
plt.show()
Intepratation:
This box plot shows that the majority of BMI values in the dataset are concentrated between 18 and 25,
with the median value being around 22. The BMI values range from a minimum of about 5 to a
maximum of about 35, indicating the variability in BMI across the dataset. The BMI data of this
datasets is well separated because based on the diagram, the line between the box is in the middle
shows that the data is well separated.

• BMI by target class (e.g., Normal Weight, Overweight, Underweight, Obese).

Coding:
# Boxplot
plt.figure(figsize=(10, 6))
sns.boxplot(x='Label', y='BMI', data=df)
plt.title('Boxplots of BMI Based On Target Class')
plt.xlabel('Label')
plt.ylabel('BMI')
plt.show()

Intepretation:
 For Normal Weight, The Median of BMI will be around 2, it shows that the data of BMI is well
separated for the target class indicating consistent BMI values within this group.
 For Overweight, most of the data will be below than median value of 27 and it concentrated the
data is below than value of median for this class.
 For Underweight, the median is around 18, and the data is well separated because the median is
in the middle of the class, with a wider IQR, showing more variability in BMI values.
 For Obese, the median is around 31 and most of the data is concentrated above over median
value. It shows that the data is contrated above over median class.
 For all of the BMI for target class, it shows that no outlier is detected for this dataset.

Quantitative Population Ecology - SHAROV MUY BUENO!! PDF
No ratings yet
Quantitative Population Ecology - SHAROV MUY BUENO!! PDF
144 pages
Glossary of Statistical Terms and Symbols
No ratings yet
Glossary of Statistical Terms and Symbols
4 pages
1000008355
No ratings yet
1000008355
23 pages
Descriptive Statistics Using Microsoft Excel
No ratings yet
Descriptive Statistics Using Microsoft Excel
5 pages
46.Nguyễn Ngọc Xuân Trang . 2005218100
No ratings yet
46.Nguyễn Ngọc Xuân Trang . 2005218100
16 pages
Chart Title
No ratings yet
Chart Title
2 pages
04_Measures of Variations
No ratings yet
04_Measures of Variations
24 pages
Stastic ASS1
No ratings yet
Stastic ASS1
6 pages
BA bt
No ratings yet
BA bt
31 pages
Oup 6
No ratings yet
Oup 6
48 pages
Simple Data Hockey
No ratings yet
Simple Data Hockey
2 pages
Topic02. Descriptive Stats
No ratings yet
Topic02. Descriptive Stats
16 pages
Business Statstics Complete
No ratings yet
Business Statstics Complete
13 pages
Regular Diet: Men Women
No ratings yet
Regular Diet: Men Women
6 pages
Wa Nko Nalipay PR
No ratings yet
Wa Nko Nalipay PR
12 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
34 pages
Week 2b_Descriptive Statistics-Measures of Dispersion-1 Feb2019
No ratings yet
Week 2b_Descriptive Statistics-Measures of Dispersion-1 Feb2019
26 pages
Measure of Dispression
100% (1)
Measure of Dispression
36 pages
DSBDAL_Assignment no 3
No ratings yet
DSBDAL_Assignment no 3
4 pages
Sibd Questions Soved Theory
No ratings yet
Sibd Questions Soved Theory
14 pages
13.Lê Ngọc Dung - 2005210109 (1)
No ratings yet
13.Lê Ngọc Dung - 2005210109 (1)
20 pages
Assignment
No ratings yet
Assignment
11 pages
ch2
No ratings yet
ch2
49 pages
Resumo Adp
No ratings yet
Resumo Adp
5 pages
Part2 Statistics
No ratings yet
Part2 Statistics
55 pages
Week 2 - Chapter 1 Introduction To Statistics (Part 2)
No ratings yet
Week 2 - Chapter 1 Introduction To Statistics (Part 2)
45 pages
Chapter 5
No ratings yet
Chapter 5
6 pages
001_bs023
No ratings yet
001_bs023
33 pages
Unit I Bbbbbbbbbbbbbba
No ratings yet
Unit I Bbbbbbbbbbbbbba
8 pages
Measures of Dispersion Kurtosis and Skewness
No ratings yet
Measures of Dispersion Kurtosis and Skewness
19 pages
statistics
No ratings yet
statistics
10 pages
Measures of Dispersion
No ratings yet
Measures of Dispersion
18 pages
Statistics & Psychology
No ratings yet
Statistics & Psychology
47 pages
Lec3&4 02sep2016
No ratings yet
Lec3&4 02sep2016
43 pages
Practical Final PDF
No ratings yet
Practical Final PDF
58 pages
Chapter Four
No ratings yet
Chapter Four
27 pages
DWDM
No ratings yet
DWDM
18 pages
Descriptive Measure of Scale Variables
No ratings yet
Descriptive Measure of Scale Variables
33 pages
MMW Transes
No ratings yet
MMW Transes
5 pages
Lecture 2.2 - Statistics - Desc Stat and Distrib
No ratings yet
Lecture 2.2 - Statistics - Desc Stat and Distrib
48 pages
Grouped Data
No ratings yet
Grouped Data
40 pages
Assignment 1
100% (1)
Assignment 1
16 pages
Biostat Ch-5
No ratings yet
Biostat Ch-5
58 pages
Assignment 3 FBA
No ratings yet
Assignment 3 FBA
14 pages
DSBDAL - Assignment No 10
No ratings yet
DSBDAL - Assignment No 10
5 pages
Descriptive Statistics - Practical1
No ratings yet
Descriptive Statistics - Practical1
12 pages
Final Examination Answer
No ratings yet
Final Examination Answer
6 pages
Unit 3 - Measures of Central Tendency
No ratings yet
Unit 3 - Measures of Central Tendency
2 pages
Column1 Column2 Column3 Column4
No ratings yet
Column1 Column2 Column3 Column4
35 pages
Biostatistics (Descriptive Statistics)
No ratings yet
Biostatistics (Descriptive Statistics)
30 pages
Biostatistics Prelims Week 4
No ratings yet
Biostatistics Prelims Week 4
27 pages
Appendix B: Introduction To Statistics: Eneral Terminology
No ratings yet
Appendix B: Introduction To Statistics: Eneral Terminology
15 pages
Statistics For Data Analyst
No ratings yet
Statistics For Data Analyst
7 pages
MPH Biostatistics lecture 3_2016
No ratings yet
MPH Biostatistics lecture 3_2016
59 pages
8614.educational Statitics Unit 4
No ratings yet
8614.educational Statitics Unit 4
34 pages
Inquiry Investigation and Immersion Mod 1
No ratings yet
Inquiry Investigation and Immersion Mod 1
14 pages
Introduction To Statistics Assignment II: Median 4.25
No ratings yet
Introduction To Statistics Assignment II: Median 4.25
5 pages
MBF3C CH 3 REVIEW Lesson
No ratings yet
MBF3C CH 3 REVIEW Lesson
4 pages
Lecture 1 Intro
No ratings yet
Lecture 1 Intro
61 pages
Variables
100% (1)
Variables
13 pages
Module 3 - Measures of Dispersion and Shape
No ratings yet
Module 3 - Measures of Dispersion and Shape
6 pages
Statistical Foundations for Psychology
From Everand
Statistical Foundations for Psychology
James C. Ware
No ratings yet
Large Sample Standard Errors of Kappa and Weighted Kappa
No ratings yet
Large Sample Standard Errors of Kappa and Weighted Kappa
5 pages
Basic Business Statistics Concepts and Applications 12th Edition Berenson Test Bank - Download PDF
100% (5)
Basic Business Statistics Concepts and Applications 12th Edition Berenson Test Bank - Download PDF
51 pages
An Introduction To The Bootstrap
No ratings yet
An Introduction To The Bootstrap
7 pages
Unit-15 Sampling Methods and Estimation of Sample Size PDF
No ratings yet
Unit-15 Sampling Methods and Estimation of Sample Size PDF
11 pages
1621 - Stat6111 - Lnba - TK3 - W6 - S10 - Team7
No ratings yet
1621 - Stat6111 - Lnba - TK3 - W6 - S10 - Team7
3 pages
Brogden, W. (1939) - High Order Conditioning
No ratings yet
Brogden, W. (1939) - High Order Conditioning
14 pages
Unit - IV - Sampling
No ratings yet
Unit - IV - Sampling
38 pages
Comparative Analysis of Study Habits Between Males and Females
No ratings yet
Comparative Analysis of Study Habits Between Males and Females
6 pages
MSE204 Lecture Questions
No ratings yet
MSE204 Lecture Questions
24 pages
MSF HAND BOOK 24-25
No ratings yet
MSF HAND BOOK 24-25
29 pages
10. Mesiodistal crown diameters of the deciduous and permanent teeth in individuals
No ratings yet
10. Mesiodistal crown diameters of the deciduous and permanent teeth in individuals
10 pages
Q8 IM13CFinal
No ratings yet
Q8 IM13CFinal
21 pages
Chapter 4 Sampling Distributions PDF
No ratings yet
Chapter 4 Sampling Distributions PDF
74 pages
Performance Task #4 - Data Visualization (MMW)
No ratings yet
Performance Task #4 - Data Visualization (MMW)
2 pages
Students Attitudes Toward Blended Teaching Among Students of The University of Calcutta
No ratings yet
Students Attitudes Toward Blended Teaching Among Students of The University of Calcutta
8 pages
Evaluating The Role of Artificial Intelligence in Advancing
No ratings yet
Evaluating The Role of Artificial Intelligence in Advancing
13 pages
Full download (Ebook) Maximum Likelihood Estimation of Misspecified Models: Twenty Years Later, Volume 17 by T. Fomby, R. Carter Hill, Thomas B. Fomby ISBN 9780080547428, 9780762310753, 0762310758, 0080547427 pdf docx
100% (2)
Full download (Ebook) Maximum Likelihood Estimation of Misspecified Models: Twenty Years Later, Volume 17 by T. Fomby, R. Carter Hill, Thomas B. Fomby ISBN 9780080547428, 9780762310753, 0762310758, 0080547427 pdf docx
77 pages
Unit 4 Quality Control
No ratings yet
Unit 4 Quality Control
32 pages
Estimation and Hypothesis
100% (1)
Estimation and Hypothesis
32 pages
A Bass Diffusion Model Analysis - Understanding Alternative Fuel V PDF
No ratings yet
A Bass Diffusion Model Analysis - Understanding Alternative Fuel V PDF
53 pages
Sol 08
No ratings yet
Sol 08
16 pages
Worksheet Econometrics I
100% (2)
Worksheet Econometrics I
6 pages
Statistics 1
100% (1)
Statistics 1
44 pages
E. Summary of Findings, Conclusion, and Recomendations
No ratings yet
E. Summary of Findings, Conclusion, and Recomendations
3 pages
Case Study Focsa Cristina
No ratings yet
Case Study Focsa Cristina
20 pages
Tooth Size Discrepancies in An Orthodontic Population
No ratings yet
Tooth Size Discrepancies in An Orthodontic Population
7 pages
Hilda Noor Farikha - 7101418106 UTS Ekomet
No ratings yet
Hilda Noor Farikha - 7101418106 UTS Ekomet
8 pages
Grup (1 Caz/2 Martor) Sex (F/M) Predispozitie Genetica (0 Nu/1 Da) Inaltime (CM) Greutate (KG) Colesterol (MG/DL) HDL (MG/DL)
No ratings yet
Grup (1 Caz/2 Martor) Sex (F/M) Predispozitie Genetica (0 Nu/1 Da) Inaltime (CM) Greutate (KG) Colesterol (MG/DL) HDL (MG/DL)
12 pages

assigment_3_EDA

Uploaded by

assigment_3_EDA

Uploaded by

A231

STIS3033-EXPLORATORY DATA ANALYSIS

PREPARED FOR: PROF FARHAN BIN MOHD HOSSEIN

Minimum (1): The smallest value in the dataset is 1.

Maximum (110): The largest value in the dataset is 110.

Sum (6053): The sum of all values in the dataset is 6053.

Count (108): The dataset contains 108 observations.

Minimum (11): The youngest age in the dataset is 11 years.

Maximum (112): The oldest age in the dataset is 112 years.

Count (108): The dataset contains 108 observations.

Maximum (210): The tallest height in the dataset is 210 cm.

Count (108): The dataset contains 108 observations.

Minimum (10): The lowest weight in the dataset is 10 kg.

Maximum (120): The highest weight in the dataset is 120 kg.

Count (108): The dataset contains 108 observations.

Minimum (3.9): The lowest BMI in the dataset is 3.9.

Maximum (37.2): The highest BMI in the dataset is 37.2.

Sum (2219): The sum of all BMIs in the dataset is 2219.

Count (108): The dataset contains 108 observations.

# Plotting the histograms for Age

# Histogram for Age

# Show the plot

# Show the plot

stem-and-leaf plot for Age

Histogram of age distribution:

 The histogram depicts the frequency distribution of ages.

 The younger to middle-aged groups (20-60 years) have higher frequencies.

Histogram of Age Groups:

 Peaks occur around ages 20-30 and 50-60.

 A more even distribution between 30 and 60.

 Individuals beyond the age of 70 experience lower frequencies.

Age distribution using a stem-and-leaf plot.

 The stem-and-leaf plot gives a thorough picture of individual age values.

 The data is organized into tens (e.g., 10-19, 20-29, 30-39).

Age group using a stem-and-leaf plot

 The distribution is less dense beyond 60 years in both representations.

BMI shape and distribution for histogram

 The histogram indicates a bimodal distribution with two peaks.

 The initial surge occurs at the 5-10 BMI range.

 The second peak occurs in the 20-25 BMI range.

 The incidence of BMIs between 10 and 15 has decreased significantly.

 The highest frequency (14) is observed in the 25-30 BMI range.

 The stem-and-leaf graphic shows a thorough breakdown of specific BMI values.

 Each stem contains clusters of values around 6, 8, and 9, confirming the

I. Gender and BMI

Gender Underweight Normal weight Overweight Obese

# Perform cross-tabulation of Gender and BMI categories

II. Age and Label

# Convert the Label column to numeric values

# Perform cross-tabulation of Age category and Label

III. Age and BMI

# Function to categorize BMI

# Function to categorize age

# Apply the functions to create new columns

# Create a crosstab to summarize the data

# Display the crosstab

Age and weight

# Calculate the Pearson correlation matrix

# Extract the Pearson correlation coefficient between Age and Weight

print(f"Pearson correlation coefficient between Age and Weight: {pearson_corr_age_weight:.3f}")

Gender and bmi

# Calculate the Pearson correlation coefficient between Gender and BMI

print(f"Pearson correlation coefficient between Gender and BMI: {pearson_corr_gender_bmi:.3f}")

# Calculate the Pearson correlation coefficient between Age and Label

print(f"Pearson correlation coefficient between Age and Label: {pearson_corr_age_label:.3f}")

Height and bmi

# Create a results table

# Display the results

• Gender and BMI

Age and Label  Data points are dispersed within each

d) Create box plots for:

• The BMI attribute (overall).

• BMI by target class (e.g., Normal Weight, Overweight, Underweight, Obese).

You might also like