0% found this document useful (0 votes)
2 views

Document

The dataset consists of 5892 rows and 11 columns, focusing on educational metrics across various countries, including government expenditure on education, literacy rates, and school enrollment statistics. It contains both categorical and numerical data, with notable quality issues such as missing values and outliers. Initial hypotheses suggest that government investment and pupil-to-teacher ratios significantly impact educational outcomes.

Uploaded by

abhishekcbanaj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Document

The dataset consists of 5892 rows and 11 columns, focusing on educational metrics across various countries, including government expenditure on education, literacy rates, and school enrollment statistics. It contains both categorical and numerical data, with notable quality issues such as missing values and outliers. Initial hypotheses suggest that government investment and pupil-to-teacher ratios significantly impact educational outcomes.

Uploaded by

abhishekcbanaj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Q1.Identify a dataset of interest from any application domain of your interest.

Understand the
shape and structure of the data and present a description of the dataset.

Dataset Description

1. Shape of the Dataset:


o The dataset contains 5892 rows and 11 columns.
2. Columns:
o country: Name of the country.
o country_code: Code representing the country.
o year: Year of the data point.
o gov_exp_pct_gdp: Government expenditure on education as a percentage of
GDP.
o lit_rate_adult_pct: Adult literacy rate (percentage).
o pri_comp_rate_pct: Primary completion rate (percentage).
o pupil_teacher_primary: Pupil-to-teacher ratio in primary education.
o pupil_teacher_secondary: Pupil-to-teacher ratio in secondary education.
o school_enrol_primary_pct: Primary school enrollment rate (percentage).
o school_enrol_secondary_pct: Secondary school enrollment rate
(percentage).
o school_enrol_tertiary_pct: Tertiary school enrollment rate (percentage).
3. Data Types:
o The dataset includes:
▪ 2 categorical columns (country, country_code).
▪ 1 integer column (year).
▪ 8 float columns (various educational metrics).
4. Summary Statistics:
o The dataset has missing values in multiple columns, particularly in
lit_rate_adult_pct, gov_exp_pct_gdp, and other educational indicators.

df.info()
Answer the following questions with respect to the data set chosen.

1. What variables does the dataset contain?

2. How are they distributed?

3. Are there any notable data quality issues?

4. Are there any surprising relationships among the variables? Develop Initial Hypothesis.

1. What variables does the dataset contain?

The dataset contains the following variables:

• country: Name of the country.


• country_code: Short code representing the country (e.g., AFG for Afghanistan).
• year: Year of the observation.
• gov_exp_pct_gdp: Government expenditure on education as a percentage of GDP.
• lit_rate_adult_pct: Adult literacy rate (percentage).
• pri_comp_rate_pct: Primary school completion rate (percentage).
• pupil_teacher_primary: Pupil-to-teacher ratio in primary education.
• pupil_teacher_secondary: Pupil-to-teacher ratio in secondary education.
• school_enrol_primary_pct: Percentage of primary school enrollment.
• school_enrol_secondary_pct: Percentage of secondary school enrollment.
• school_enrol_tertiary_pct: Percentage of tertiary school enrollment.

2. How are they distributed?

Here’s how we’ll analyze the distributions:

• Use summary statistics for numeric variables.


• Visualize distributions for key variables (e.g., histograms, box plots).
Distribution Analysis

Summary:

• Variables like gov_exp_pct_gdp and pupil_teacher_primary are moderately


distributed, with most values concentrated near their means.
• Some metrics (e.g., school_enrol_primary_pct, school_enrol_secondary_pct)
have outliers, as seen from their maximum values exceeding expected ranges.
• Adult literacy rates (lit_rate_adult_pct) generally cluster near higher percentages,
with some countries exhibiting very low literacy.

Key Observations:

• Government expenditure on education (gov_exp_pct_gdp): Mean is ~4.3% of


GDP, with a wide spread up to 15.8%.
• Literacy rates (lit_rate_adult_pct): Most countries hover around 79.5%, but
some have values as low as 14%.
• School enrollment: Primary enrollment exceeds 100% in some cases (likely due to
late enrollments or repeats).
• Pupil-teacher ratios: Wide variation between primary and secondary levels.

3. Are there any notable data quality issues?

Yes, the dataset has a few notable data quality issues:

• Missing Values:
o Many columns have missing values, especially lit_rate_adult_pct (only
1877 non-null values) and gov_exp_pct_gdp.
• Outliers:
o Some values exceed logical bounds, e.g., primary school enrollment exceeding
100%.
• Imbalance:
o Data is skewed toward certain metrics (e.g., higher literacy rates or primary
school enrollment).
• Temporal Consistency:
o Missing values in some years for certain countries could affect time-series
analyses.
Relationship Analysis

From the correlation matrix, here are the key observations:

1. Strong Positive Correlations:


o lit_rate_adult_pct and school_enrol_secondary_pct (0.88): Countries
with higher adult literacy rates tend to have higher secondary school
enrollment.
o lit_rate_adult_pct and pri_comp_rate_pct (0.86): Higher literacy is
associated with better primary school completion rates.
o school_enrol_secondary_pct and school_enrol_tertiary_pct (0.78):
Strong linkage between secondary and tertiary enrollment rates.
2. Negative Correlations:
o pupil_teacher_primary and pri_comp_rate_pct (-0.78): A high pupil-to-
teacher ratio negatively impacts primary completion rates.
o pupil_teacher_primary and lit_rate_adult_pct (-0.82): High pupil-to-
teacher ratios may hinder overall literacy.
Initial Hypotheses

Based on the data:

1. Government investment matters: Countries with higher education expenditure as a


percentage of GDP might have better enrollment rates and literacy.
2. Quality of education impacts outcomes: Lower pupil-to-teacher ratios likely lead to
higher literacy rates and primary completion rates.
3. Secondary and tertiary education linkage: Countries with higher secondary
enrollment rates may also show higher tertiary enrollment due to a smoother
education pipeline.
4. Regional disparities: Developing countries may exhibit higher pupil-to-teacher ratios
and lower literacy or enrollment rates.

Q3.Resolve the data quality issues identified using R functions.

1. Handle Missing Values

• Approach:
o Use mean/median imputation for numerical columns.
o Use mode or a placeholder value for categorical variables.
o Remove rows with excessive missing data, if applicable.

2. Address Outliers

• Approach:
o Use boxplots to identify extreme outliers.
o Replace or cap values exceeding logical bounds (e.g., enrollment rates above 100%).

3. Data Transformation

• Approach:
o Scale/normalize numerical columns if necessary.
o Convert categorical variables to factors.
Q4. Refine the visualization (by adding additional variables, changing sorting or axis
scales, filtering or subsetting data, etc.) to develop better perspectives, explore
unexpected observations, or sanity check your assumptions and present the results.

• Filter or subset data:

• Focus on specific regions, years, or countries.


• Group countries based on development level or region.

• Add additional variables:

• Compare metrics like government expenditure and literacy.


• Add layers to visualizations (e.g., year trends or country categories).

• Improve visualization:

• Sort data for clarity.


• Use log scales for skewed data.
• Add annotations to highlight key insights.
1. Compare Literacy vs. Government Expenditure by Year
2. Trend of Enrollment Rates Over Time
3. Heatmap of Pupil-Teacher Ratio by Region and Year
4. Boxplot: Regional Variation in Enrollment Rates

You might also like