Document
Document
Understand the
shape and structure of the data and present a description of the dataset.
Dataset Description
df.info()
Answer the following questions with respect to the data set chosen.
4. Are there any surprising relationships among the variables? Develop Initial Hypothesis.
Summary:
Key Observations:
• Missing Values:
o Many columns have missing values, especially lit_rate_adult_pct (only
1877 non-null values) and gov_exp_pct_gdp.
• Outliers:
o Some values exceed logical bounds, e.g., primary school enrollment exceeding
100%.
• Imbalance:
o Data is skewed toward certain metrics (e.g., higher literacy rates or primary
school enrollment).
• Temporal Consistency:
o Missing values in some years for certain countries could affect time-series
analyses.
Relationship Analysis
• Approach:
o Use mean/median imputation for numerical columns.
o Use mode or a placeholder value for categorical variables.
o Remove rows with excessive missing data, if applicable.
2. Address Outliers
• Approach:
o Use boxplots to identify extreme outliers.
o Replace or cap values exceeding logical bounds (e.g., enrollment rates above 100%).
3. Data Transformation
• Approach:
o Scale/normalize numerical columns if necessary.
o Convert categorical variables to factors.
Q4. Refine the visualization (by adding additional variables, changing sorting or axis
scales, filtering or subsetting data, etc.) to develop better perspectives, explore
unexpected observations, or sanity check your assumptions and present the results.
• Improve visualization: