0% found this document useful (0 votes)

2 views

Document

The dataset consists of 5892 rows and 11 columns, focusing on educational metrics across various countries, including government expenditure on education, literacy rates, and school enrollment statistics. It contains both categorical and numerical data, with notable quality issues such as missing values and outliers. Initial hypotheses suggest that government investment and pupil-to-teacher ratios significantly impact educational outcomes.

Uploaded by

abhishekcbanaj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Document

Uploaded by

abhishekcbanaj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Q1.Identify a dataset of interest from any application domain of your interest.

Understand the
shape and structure of the data and present a description of the dataset.

Dataset Description

1. Shape of the Dataset:

o The dataset contains 5892 rows and 11 columns.
2. Columns:
o country: Name of the country.
o country_code: Code representing the country.
o year: Year of the data point.
o gov_exp_pct_gdp: Government expenditure on education as a percentage of
GDP.
o lit_rate_adult_pct: Adult literacy rate (percentage).
o pri_comp_rate_pct: Primary completion rate (percentage).
o pupil_teacher_primary: Pupil-to-teacher ratio in primary education.
o pupil_teacher_secondary: Pupil-to-teacher ratio in secondary education.
o school_enrol_primary_pct: Primary school enrollment rate (percentage).
o school_enrol_secondary_pct: Secondary school enrollment rate
(percentage).
o school_enrol_tertiary_pct: Tertiary school enrollment rate (percentage).
3. Data Types:
o The dataset includes:
▪ 2 categorical columns (country, country_code).
▪ 1 integer column (year).
▪ 8 float columns (various educational metrics).
4. Summary Statistics:
o The dataset has missing values in multiple columns, particularly in
lit_rate_adult_pct, gov_exp_pct_gdp, and other educational indicators.

df.info()
Answer the following questions with respect to the data set chosen.

1. What variables does the dataset contain?

2. How are they distributed?

3. Are there any notable data quality issues?

4. Are there any surprising relationships among the variables? Develop Initial Hypothesis.

1. What variables does the dataset contain?

The dataset contains the following variables:

• country: Name of the country.

• country_code: Short code representing the country (e.g., AFG for Afghanistan).
• year: Year of the observation.
• gov_exp_pct_gdp: Government expenditure on education as a percentage of GDP.
• lit_rate_adult_pct: Adult literacy rate (percentage).
• pri_comp_rate_pct: Primary school completion rate (percentage).
• pupil_teacher_primary: Pupil-to-teacher ratio in primary education.
• pupil_teacher_secondary: Pupil-to-teacher ratio in secondary education.
• school_enrol_primary_pct: Percentage of primary school enrollment.
• school_enrol_secondary_pct: Percentage of secondary school enrollment.
• school_enrol_tertiary_pct: Percentage of tertiary school enrollment.

2. How are they distributed?

Here’s how we’ll analyze the distributions:

• Use summary statistics for numeric variables.

• Visualize distributions for key variables (e.g., histograms, box plots).
Distribution Analysis

Summary:

• Variables like gov_exp_pct_gdp and pupil_teacher_primary are moderately

distributed, with most values concentrated near their means.
• Some metrics (e.g., school_enrol_primary_pct, school_enrol_secondary_pct)
have outliers, as seen from their maximum values exceeding expected ranges.
• Adult literacy rates (lit_rate_adult_pct) generally cluster near higher percentages,
with some countries exhibiting very low literacy.

Key Observations:

• Government expenditure on education (gov_exp_pct_gdp): Mean is ~4.3% of

GDP, with a wide spread up to 15.8%.
• Literacy rates (lit_rate_adult_pct): Most countries hover around 79.5%, but
some have values as low as 14%.
• School enrollment: Primary enrollment exceeds 100% in some cases (likely due to
late enrollments or repeats).
• Pupil-teacher ratios: Wide variation between primary and secondary levels.

3. Are there any notable data quality issues?

Yes, the dataset has a few notable data quality issues:

• Missing Values:
o Many columns have missing values, especially lit_rate_adult_pct (only
1877 non-null values) and gov_exp_pct_gdp.
• Outliers:
o Some values exceed logical bounds, e.g., primary school enrollment exceeding
100%.
• Imbalance:
o Data is skewed toward certain metrics (e.g., higher literacy rates or primary
school enrollment).
• Temporal Consistency:
o Missing values in some years for certain countries could affect time-series
analyses.
Relationship Analysis

From the correlation matrix, here are the key observations:

1. Strong Positive Correlations:

o lit_rate_adult_pct and school_enrol_secondary_pct (0.88): Countries
with higher adult literacy rates tend to have higher secondary school
enrollment.
o lit_rate_adult_pct and pri_comp_rate_pct (0.86): Higher literacy is
associated with better primary school completion rates.
o school_enrol_secondary_pct and school_enrol_tertiary_pct (0.78):
Strong linkage between secondary and tertiary enrollment rates.
2. Negative Correlations:
o pupil_teacher_primary and pri_comp_rate_pct (-0.78): A high pupil-to-
teacher ratio negatively impacts primary completion rates.
o pupil_teacher_primary and lit_rate_adult_pct (-0.82): High pupil-to-
teacher ratios may hinder overall literacy.
Initial Hypotheses

Based on the data:

1. Government investment matters: Countries with higher education expenditure as a

percentage of GDP might have better enrollment rates and literacy.
2. Quality of education impacts outcomes: Lower pupil-to-teacher ratios likely lead to
higher literacy rates and primary completion rates.
3. Secondary and tertiary education linkage: Countries with higher secondary
enrollment rates may also show higher tertiary enrollment due to a smoother
education pipeline.
4. Regional disparities: Developing countries may exhibit higher pupil-to-teacher ratios
and lower literacy or enrollment rates.

Q3.Resolve the data quality issues identified using R functions.

1. Handle Missing Values

• Approach:
o Use mean/median imputation for numerical columns.
o Use mode or a placeholder value for categorical variables.
o Remove rows with excessive missing data, if applicable.

2. Address Outliers

• Approach:
o Use boxplots to identify extreme outliers.
o Replace or cap values exceeding logical bounds (e.g., enrollment rates above 100%).

3. Data Transformation

• Approach:
o Scale/normalize numerical columns if necessary.
o Convert categorical variables to factors.
Q4. Refine the visualization (by adding additional variables, changing sorting or axis
scales, filtering or subsetting data, etc.) to develop better perspectives, explore
unexpected observations, or sanity check your assumptions and present the results.

• Filter or subset data:

• Focus on specific regions, years, or countries.

• Group countries based on development level or region.

• Add additional variables:

• Compare metrics like government expenditure and literacy.

• Add layers to visualizations (e.g., year trends or country categories).

• Improve visualization:

• Sort data for clarity.

• Use log scales for skewed data.
• Add annotations to highlight key insights.
1. Compare Literacy vs. Government Expenditure by Year
2. Trend of Enrollment Rates Over Time
3. Heatmap of Pupil-Teacher Ratio by Region and Year
4. Boxplot: Regional Variation in Enrollment Rates

Final Work
No ratings yet
Final Work
5 pages
Information Technology (EMIS and GIS) 1) Development of EMIS in Asia and The Pacific Region
No ratings yet
Information Technology (EMIS and GIS) 1) Development of EMIS in Asia and The Pacific Region
3 pages
Part Iia Paper 3 Project
No ratings yet
Part Iia Paper 3 Project
10 pages
Excel Portfolio Project
No ratings yet
Excel Portfolio Project
3 pages
Sample Assignment Project
No ratings yet
Sample Assignment Project
17 pages
DMEPA Toolkit Final
No ratings yet
DMEPA Toolkit Final
32 pages
Bes Project 2021
No ratings yet
Bes Project 2021
15 pages
Bes Project 2021
No ratings yet
Bes Project 2021
15 pages
ML Report (Final)
No ratings yet
ML Report (Final)
20 pages
Report_erdfv200
No ratings yet
Report_erdfv200
17 pages
Graph Chart
No ratings yet
Graph Chart
22 pages
11.-Education
No ratings yet
11.-Education
9 pages
Crosstabulation and the χ Test
No ratings yet
Crosstabulation and the χ Test
5 pages
Analysis of Indicators Using Visual Presentations
No ratings yet
Analysis of Indicators Using Visual Presentations
56 pages
GDP Assignment
No ratings yet
GDP Assignment
5 pages
ML Report
No ratings yet
ML Report
20 pages
Exploratory Visual Analysis
No ratings yet
Exploratory Visual Analysis
10 pages
FULL REPORT (DSC MATH STUD) 15nov2019
No ratings yet
FULL REPORT (DSC MATH STUD) 15nov2019
16 pages
Chapter 3-Education Data Analysis (Autosaved)
No ratings yet
Chapter 3-Education Data Analysis (Autosaved)
52 pages
Python Case Study
No ratings yet
Python Case Study
7 pages
Sustainable Development Goal (SDG) 4
No ratings yet
Sustainable Development Goal (SDG) 4
4 pages
Statistical Analysis Report
No ratings yet
Statistical Analysis Report
7 pages
Toolkit: Indicator Handbook For Primary Education: Abridged
No ratings yet
Toolkit: Indicator Handbook For Primary Education: Abridged
38 pages
What Is The Gross Enrollment Ratio in EU Member Countries?
No ratings yet
What Is The Gross Enrollment Ratio in EU Member Countries?
8 pages
Data Visualization Project SDG 4 - Education
No ratings yet
Data Visualization Project SDG 4 - Education
14 pages
TCC Thomas
No ratings yet
TCC Thomas
52 pages
EDUCATION_Sources_of_Data_EN
No ratings yet
EDUCATION_Sources_of_Data_EN
18 pages
Education Policies in A Comparative Perspective
No ratings yet
Education Policies in A Comparative Perspective
6 pages
BEST National Data 2012 2016
No ratings yet
BEST National Data 2012 2016
180 pages
Analysis of Student Loan Repayment PDF
No ratings yet
Analysis of Student Loan Repayment PDF
9 pages
COM2007 CaseStudy Sample
No ratings yet
COM2007 CaseStudy Sample
44 pages
AED BO Eng
No ratings yet
AED BO Eng
13 pages
Country Profile
No ratings yet
Country Profile
4 pages
Session I - Juzhong Zhuang - The Imperative of Improving Education Quality in Asia
No ratings yet
Session I - Juzhong Zhuang - The Imperative of Improving Education Quality in Asia
19 pages
Samreen Shah 8614
No ratings yet
Samreen Shah 8614
25 pages
Computational Techniques in Educational Planning
No ratings yet
Computational Techniques in Educational Planning
10 pages
Data Analysis: Asking Questions and Finding Answers
100% (1)
Data Analysis: Asking Questions and Finding Answers
10 pages
Download Complete Motivational Profiles in TIMSS Mathematics Exploring Student Clusters Across Countries and Time Michalis P. Michaelides PDF for All Chapters
100% (3)
Download Complete Motivational Profiles in TIMSS Mathematics Exploring Student Clusters Across Countries and Time Michalis P. Michaelides PDF for All Chapters
65 pages
Human Capital and Economic Growth - Statistical Approach
No ratings yet
Human Capital and Economic Growth - Statistical Approach
10 pages
The Determinants of Students Achievemen
No ratings yet
The Determinants of Students Achievemen
26 pages
Diving Deeper into Data Analysis: Driving K-12 Leadership and Instruction
From Everand
Diving Deeper into Data Analysis: Driving K-12 Leadership and Instruction
Dr. Colin A. Ferreira
No ratings yet
Extended Project - Education Post 12th Standard
No ratings yet
Extended Project - Education Post 12th Standard
17 pages
Global Education Metrics
From Everand
Global Education Metrics
Mason Ross
No ratings yet
Cone Jero 2021
No ratings yet
Cone Jero 2021
22 pages
Education Indicators Technical Guidelines en 0
No ratings yet
Education Indicators Technical Guidelines en 0
50 pages
Final Education Document With Table No Style
No ratings yet
Final Education Document With Table No Style
5 pages
w3 QAIE Lecture
No ratings yet
w3 QAIE Lecture
137 pages
Out of School Children Study
No ratings yet
Out of School Children Study
152 pages
Graph & Chart 2 (1)
No ratings yet
Graph & Chart 2 (1)
13 pages
33710913
No ratings yet
33710913
52 pages
Main Adhith Math AA SL Exploration (3)
No ratings yet
Main Adhith Math AA SL Exploration (3)
16 pages
Education Sector Analysis - Volume1 - 3
No ratings yet
Education Sector Analysis - Volume1 - 3
410 pages
Performance Indicators Computation
No ratings yet
Performance Indicators Computation
41 pages
Laguerta
No ratings yet
Laguerta
85 pages
Paper Literasi Comeback Stronger
No ratings yet
Paper Literasi Comeback Stronger
11 pages
Symbology and layout
No ratings yet
Symbology and layout
7 pages
MYP 2 - Unit - Plan 6th MP Mathematics
100% (1)
MYP 2 - Unit - Plan 6th MP Mathematics
9 pages
EP. 15 Year Olds With Low Literacy Levels
No ratings yet
EP. 15 Year Olds With Low Literacy Levels
4 pages
International Data
No ratings yet
International Data
23 pages
Udise Report Existing 23 24
No ratings yet
Udise Report Existing 23 24
188 pages
Year 11 Preliminary Standard Math: Analysing Data
No ratings yet
Year 11 Preliminary Standard Math: Analysing Data
32 pages
Slides Prepared by John S. Loucks St. Edward's University
No ratings yet
Slides Prepared by John S. Loucks St. Edward's University
59 pages
Math in The Modern World
No ratings yet
Math in The Modern World
164 pages
Applied Statistics in Business & Economics: David P. Doane and Lori E. Seward
No ratings yet
Applied Statistics in Business & Economics: David P. Doane and Lori E. Seward
65 pages
TSA Theory Part1
No ratings yet
TSA Theory Part1
98 pages
Mba Unit-2
No ratings yet
Mba Unit-2
2 pages
OCR GCSE 2025 Predicted Paper 6 (H)
No ratings yet
OCR GCSE 2025 Predicted Paper 6 (H)
19 pages
0580 Specimen Paper Answers Paper 2 (For Examination From 2020)
No ratings yet
0580 Specimen Paper Answers Paper 2 (For Examination From 2020)
34 pages
Exercise: Explore Data Using Data Visualization Techniques
No ratings yet
Exercise: Explore Data Using Data Visualization Techniques
41 pages
Sydney Grammar 2020 Year 10 Maths Yearly & Solutions
No ratings yet
Sydney Grammar 2020 Year 10 Maths Yearly & Solutions
34 pages
Data Exploration and Visualization - AD3301 - Important Questions With Answer - Unit 3 - Univariate Analysis
No ratings yet
Data Exploration and Visualization - AD3301 - Important Questions With Answer - Unit 3 - Univariate Analysis
8 pages
Statistics For Managers Using Microsoft Excel: 4 Edition
No ratings yet
Statistics For Managers Using Microsoft Excel: 4 Edition
60 pages
Data Analysis Calculator
No ratings yet
Data Analysis Calculator
28 pages
Unit 4 Data Science Applications
No ratings yet
Unit 4 Data Science Applications
32 pages
Nina Weighed A Random Sample of 50 Carrots From Her Shop and Recorded The Weight, in
No ratings yet
Nina Weighed A Random Sample of 50 Carrots From Her Shop and Recorded The Weight, in
20 pages
ML Project
No ratings yet
ML Project
1 page
Chapter 1 Learning Target Keys
No ratings yet
Chapter 1 Learning Target Keys
1 page
Dwdm-Unit-1 R16
No ratings yet
Dwdm-Unit-1 R16
17 pages
SBD Chapter 4
No ratings yet
SBD Chapter 4
28 pages
Statistical Method For Decision Making Assignment - January 2021
No ratings yet
Statistical Method For Decision Making Assignment - January 2021
42 pages
2 - Comparing Data Sets PDF
No ratings yet
2 - Comparing Data Sets PDF
26 pages
Stage 11 C
No ratings yet
Stage 11 C
7 pages
Farhan Dhuha Alharis (220201089)
No ratings yet
Farhan Dhuha Alharis (220201089)
28 pages
Unit+1-5+Review+Exam+(Midterm)+Multiple+Choice (1)
No ratings yet
Unit+1-5+Review+Exam+(Midterm)+Multiple+Choice (1)
14 pages
2019 WTS 12 Maths P2 Crossnight
100% (1)
2019 WTS 12 Maths P2 Crossnight
24 pages
DWM Course
No ratings yet
DWM Course
67 pages
Statistics Interview Questions
100% (2)
Statistics Interview Questions
5 pages
Credit Card Default Prediction PRESENTATION
No ratings yet
Credit Card Default Prediction PRESENTATION
12 pages
LSSGB (Simplilearn, 2014) - Lesson - 3. Measure
No ratings yet
LSSGB (Simplilearn, 2014) - Lesson - 3. Measure
121 pages

Document

Uploaded by

Document

Uploaded by

Q1.Identify a dataset of interest from any application domain of your interest.

1. Shape of the Dataset:

1. What variables does the dataset contain?

2. How are they distributed?

3. Are there any notable data quality issues?

1. What variables does the dataset contain?

The dataset contains the following variables:

• country: Name of the country.

2. How are they distributed?

Here’s how we’ll analyze the distributions:

• Use summary statistics for numeric variables.

• Variables like gov_exp_pct_gdp and pupil_teacher_primary are moderately

• Government expenditure on education (gov_exp_pct_gdp): Mean is ~4.3% of

3. Are there any notable data quality issues?

Yes, the dataset has a few notable data quality issues:

From the correlation matrix, here are the key observations:

1. Strong Positive Correlations:

Based on the data:

1. Government investment matters: Countries with higher education expenditure as a

Q3.Resolve the data quality issues identified using R functions.

1. Handle Missing Values

• Filter or subset data:

• Focus on specific regions, years, or countries.

• Add additional variables:

• Compare metrics like government expenditure and literacy.

• Sort data for clarity.

You might also like