2341 5651 1 PB
2341 5651 1 PB
Visualization, 7(3-2): Empowering the Future: The Role of Information Technology in Building Resilience - November 2023 2065-2074
INTERNATIONAL
JOURNAL ON
INFORMATICS
VISUALIZATION
INTERNATIONAL JOURNAL
ON INFORMATICS VISUALIZATION
journal homepage : www.joiv.org/index.php/joiv
Abstract— Teachers, schools, and parents contribute to equipping students with essential knowledge and skills during their education
years. When students are approaching the end of their education, they are randomly selected to participate in Program for International
Student Assessment (PISA) to assess their reading proficiency. Existing work on analyzing PISA achievement results concentrates solely
on identifying factors related to Parent or in combination with Student. Limited work has been proposed on how factors related to
Teacher and School affect the students’ reading proficiency in PISA. This study focuses on identifying the factors related to Teacher
and/or School that affect East Asian students’ reading proficiency in PISA. The PISA achievement results from East Asian students are
chosen as the domain study because they are consistently the top performers in PISA in the past decade. Decision Tree (DT), Naïve
Bayes (NB), K-Nearest Neighbors (KNN) and Random Forest (RF) are compared. Hamming score is used as the evaluation metric. The
results indicate that RF produces the best predictive models with highest Hamming score of 0.8427. Based on the findings, School-
related factors such as the number of school’s disciplinary cases, size of the school, the availability of computers with Internet facilities,
the quality and educational qualifications of teachers have higher impact on the PISA achievement results. The identified factors can
be used as a reference in assessing the current school’s teaching, learning environment, and organizing extra activities as part of
intervention programs to cultivate reading habits and enhance reading abilities among students.
Manuscript received 15 Jan. 2023; revised 14 Apr. 2023; accepted 24 Aug. 2023. Date of publication 30 Nov. 2023.
International Journal on Informatics Visualization is licensed under a Creative Commons Attribution-Share Alike 4.0 International License.
2065
flexibility to decide as to whether they want to attempt the Kong, Japan, Macau, South Korea and Taiwan are used in this
questionnaires. Next, school principals are required to answer research study.
the school questionnaire, which covers school’s management This paper is organized as follows. The literature review
and the learning environment. Finally, a set of Parent and research methodology are discussed in Section II. The
questionnaires is distributed to parents of the students who are results of the proposed methodologies are shown in Section
participating in the PISA study. This questionnaire will III. The conclusion of this research is presented in Section IV,
mainly focus on gathering feedback from parents regarding and references are provided in the last part of this paper.
the extent of their involvement in their children's studies.
The compilation of responses from PISA studies from the II. MATERIALS AND METHOD
year 2000 to 2018 can be downloaded from the Organization
for Economic Co-operation and Development (OECD) A. Literature Reviews
website. Based on the outcome of the past cycles of PISA Many studies use the compilation of responses based on
studies that have been released over the years, generally East PISA studies over the years to identify factors that affect
Asian students were top achievers in PISA assessments [5]. students’ academic achievement. Since the PISA study is
From the compilation of past cycles of PISA studies, participated by 79 countries and the size of the responses is
numerous research studies have been done to forecast huge, commonly researchers will focus on one or combination
academic achievement using data mining techniques. of the following data dimensions for analysis such as
However, majority of the research studies have concentrated geographical locations, domains, and responses from
on determining key factors that affect academic achievement different type of participants in the PISA studies.
across three domains from aspects such as such as learning Using data responses from PISA in the year 2018, the
time for a subject [6], [7], parents’ education [6], and the researchers [17] compare a series of machine learning
usage of ICT [8], [9]. algorithms to determine the relevant algorithm that can
A student’s academic achievement is not solely based on accurately predict the reading achievement of Macau
their hard work and parental support. Teachers also play students. The machine learning algorithms that are selected
important roles in molding the students because they spend for analysis include multiple tree-based ensemble machine
most of their time in schools during their schooling years [10]. learning such as Random Forest (RF), Gradient and Extreme
The presence of teachers and parents has formed a triangle of Gradient Boosting, Extra Tree, and TreeBag. The researchers
support for students during the learning process in their chose RF as their main statistical method since it has
schooling years. Despite many research studies on a outperformed other machine learning algorithms with R2 of
compilation of responses from PISA studies over the years, 0.43 and Root Mean Square Error (RMSE) of 66.17.
there are less research studies that predict students' reading Hierarchical Linear Modeling (HLM) is proposed to ascertain
proficiency using the compilation of PISA responses from each factor's impact as discovered through RF. They have
Teacher questionnaires [11], [12], School questionnaires concluded that the most crucial category is personal,
[13]–[15] or combination of both Teacher and School consisting of 13 factors that represent student characteristics.
questionnaires [16]. Specifically, to the best of our The authors also conclude that there is a need to include
knowledge, no specific research study analyzes the aspects of more factors in the analysis by considering other
Teachers and/or School and how they impacted East Asian questionnaires, not just the student questionnaire. Using
students’ reading proficiency in PISA. The research from micro-level data from PISA study in the year 2018 for China,
these aspects is crucial since the outcome of this research can the researcher [14] sets out to investigate the causes of the gap
aid teachers and school administrators in developing better in academic performance between schools in urban and rural
teaching and learning strategies and set a more conducive locations in China. As schools in urban and rural locations
learning environment for students. It is believed that better may differ significantly, the study employed Shapley flow, a
teaching and learning strategies can increase students’ graph-based approach, to analyze causal relations between a
engagement and understanding of a particular topic taught in set of factors and average PISA scores that have been obtained
school, while a conducive learning environment ensures that concerning the best and worst schools in urban and rural
students have the necessary facilities to facilitate the learning areas. The researcher has also used XGBoost and Linear
process. Regression to determine the causal structure. The Shapley
This research study aims to achieve the following three values are hypothesized to show how certain factors affect
objectives. First is to identify factors associated with Teachers how well the schools perform academically. The researcher
and/or Schools that contribute to the academic achievement has commented that intermediate learning outcomes and
of East Asian students in the Reading domain. The second is student characteristics affect the academic performance of
to determine the most appropriate predictive model for schools in urban areas. For schools in rural areas, the
predicting East Asian students’ proficiency level in Reading characteristics of schools affect the school’s academic
domain. Finally, it is to analyze how the factors associated performance. The researcher has highlighted that although the
with Teachers and/or School will impact students’ reading researcher is using the latest data from the PISA study, the
proficiency. The compilation of responses based on PISA data is limited to responses from Beijing, Shanghai, Jiangsu
study in 2018 is used in this research study because they are and Zhejiang, which do not contain responses from
the latest data released publicly by OECD. Specifically, the participants in other provinces in China.
responses from a total of 6 out of 8 East Asian countries as In a research study, the researchers [18] have sought to
listed by World Population Review, namely China, Hong identify the factors that affect high and low performers in
Global Competence test that is administered to students in
2066
Hong Kong, China. The publicly available responses from of them are selected based on the mapping with
PISA study in the year 2018 is used in the study. Support Bronfenbrenner’s bioecological framework. These 26
Vector Machine (SVM) is proposed in this study and attributes will serve as input into the HLM model. The result
compared with SVM-based recursive feature elimination shows that 26 attributes do contribute to students’ reading
Cross Validation (SVM-RFE-CV). Generally, the researchers proficiency.
have reported that SVM is a good classifier with performance There are other studies on predicting the risk of dropouts
metrics comprising of accuracy (ACC), F-score and Area using log datasets [22]–[26]. In work by [22], weighted
under curve (AUC) of more than 0.80. SVM-RFE-CV has attributes are introduced prior to SVM Classifier, resulting in
identified 30 optimal factors. This study has discovered that better performance than non-weighted attributes. In work by
students' global competency is affected by their perspective- [23], DT, RF, SVM and Deep Neural Network (DNN) are
taking capacity, adaptability, awareness of intercultural compared with RF, outperforming the others. FWTS-CNN
communication and respect for people from different cultures. [24] combines a weighted features approach with a time series
The researchers [13] have conducted research to find the and convolutional neural network (CNN). It outperforms
factors that genuinely affect Singaporean students' reading CNN when it is applied to KDD Cup 2015 dataset. DeepFM
proficiency using the responses from PISA study in the year [25] is DNN and factorization machine hybrid, achieving 99%
2015. SVM is proposed, and accuracy is used to measure the in validation data. In work by [26] multiple linear regression
effectiveness of SVM. SVM-based recursive feature (MLR), multilayer perceptron (MLP) and classification and
elimination (SVM-RFE) is used to identify and rank the regression tree (CART) are compared with MLP and CART
factors. The outcome reveals that the SVM model produces performing better than MLR. Other than predicting the risk of
an ACC of 0.78 and the most important factor is the learning dropout, there are studies on using machine learning to predict
time that students spent on test language (LMINS). graduation [27], [28][29] has created a mobile application
The researchers [9] use the Activity Region Finder (ARF) with deep learning to enhance English and Arabic vocabulary
algorithm, to uncover factors contributing to high among children. The mobile application has recorded more
achievement in the Reading domain for students from Turkey than 90% accuracy for image classification.
and China. Two sets of data from the PISA study are created In summary, many studies have been conducted in order to
by separating students’ responses from Turkey and China. identify the factors that influence students' performance in the
Feature selection using RF is applied on both sets of data with PISA assessments. However, only a few research studies
the total factors reduced to 20. RF produces an ACC in the focus on examining how the factors related to Teacher and
form of percentage which is 75% when applied to Turkish School can influence students’ performance. To our
dataset and in the Chinese dataset, RF produces an accuracy knowledge, no research analyzes the impact of factors related
in the form of percentage which is 77%. The ability to to Teacher and/or School to East Asian students' reading
comprehend text is the main factor determining the high proficiency levels. Hence, in this paper, we would like to
achievement in the Reading domain for students from Turkey investigate how factors related Teacher and/or School are
and China. associated with East Asian students’ academic achievement
There is also research work conducted using the entire in Reading domain using the compilation of responses from
PISA dataset. The researchers [19] compare the efficiency of PISA study in the year 2018.
RF and XGBoost in predicting the self-efficacy of students
from 74 different nations. The researchers have reported that B. Research Methodology
XGBoost is a slightly better predictive algorithm and Here, we describe the proposed methodology with its
students’ non-cognitive factors are the most important factors. graphical illustration as shown in Fig.1. Each step will be
XGBoost is reported to have RMSE of 9.776, R2 of 0.458 and explained as follows.
Mean Absolute Error (MAE) of 7.271, which are lower than
RF when these algorithms are applied to test data. Other
researchers [20] propose an educational data mining approach
consisting of a combination of clustering and classification
techniques to detect and analyze factors related to country,
school, and student that might affect students' academic
performance. The researchers group the schools according to
their average performance levels in Science, Mathematics and
Reading domains using k-means clustering. Three groups
have been established: low performance, high performance, Fig. 1 Proposed Methodology
and medium performance. Socioeconomic country indicators 1) Formulating Research Questions: Based on the
such as Gross Domestic Product (GDP), GDP adjusted by objectives as described in the second last paragraph of Section
Purchasing Power Parity (PPP), and GDP per capita are added I, the following research questions have been formulated.
to existing data for further analysis. C4.5 algorithm is used to
What is the supervised learning technique that is
build the decision tree, and a confusion matrix is reported. The
results reveal that socioeconomic factors are important in reliable for predicting East Asian students' reading
determining students’ academic performance. The proficiency in PISA?
Among the factors associated with Teacher and/or
researchers [21] have used RF to determine significant
attributes that contribute to the reading proficiency of Filipino School, which factors are more important in
students in PISA. RF has identified 53 attributes of which 26
2067
determining East Asian students' reading proficiency in teachers. Although the responses from Test Language
PISA? Teacher Questionnaire seem to be more relevant due to the
How are the important factors associated with Teacher association of reading with language, the PISA reading
and/or School have an impact to the East Asian framework in 2018 has been revised to focus not only on
students' reading proficiency in PISA? assessing the traditional forms of Reading but also on
assessing students’ digital literacy skill in searching for digital
2) Data Collection: As mentioned in the second last
reading sources. This refers to technical skills. Hence, the
paragraph in Section I, the compilation of responses based on
responses from both questionnaires are selected. The
PISA study in the year 2018 is used in this research study
responses from teachers who work in East Asian countries are
because they are the latest data that is released publicly by
selected. Although there are seven East Asian countries, only
OECD. The responses from the student questionnaire, School
teachers from Chinese Taipei, Macao, Hong Kong, and Korea
questionnaire and Teacher questionnaires are chosen to be
have answered the Teacher questionnaire. Hence, only
part of the dataset that will be used in this study.
teachers’ responses from four East Asian countries are
3) Data Pre-processing: The data pre-processing is selected. Firstly, the rows that capture teachers’ IDs without
applied for the first time on the original datasets that are any responses will be deleted. Such a situation happens
downloaded from the OECD website so that they can be because the teacher questionnaire is optional for teachers, and
suitably used in exploratory data analysis to gain preliminary teachers can opt not to answer the questionnaire. A total of 26
understanding on the characteristics of the datasets. Based on relevant attributes related to Teacher are selected. Additional
the preliminary insights, the data pre-processing is later re- cleaning activity is done where rows that have more than 17
applied for the second time to merge and prepare several sets missing values are dropped. The numerical columns that have
of datasets as input for more complex exploratory data missing values will be set to 0. Eventually, the data is reduced
analysis and machine learning algorithms. The following to 14,105 rows. Table II shows the list of attributes that
three paragraphs describe the data pre-processing steps that represent columns in the Teacher questionnaire dataset and
are conducted on the first round. the rationale for selecting these attributes to form the Teacher
Students’ responses from East Asian countries are questionnaire dataset.
extracted from the original data of students’ responses in the TABLE II
PISA study. The extracted data consists of 41871 rows. In LIST OF ATTRIBUTES IN THE TEACHER QUESTIONNAIRE DATASET
PISA 2018 data, each domain has its own PISA band Attribute(s) Description / Rationale
definition where levels 2 and below are classified as low CNT Country
performers while level 5 and 6 are classified as top CNTSCHID The unique ID that represents a school in a
performers. OECD does not explicitly define the remaining country
levels 3 and 4, but students who obtain these levels are CNTTCHID The unique ID that represents a teacher
from a country
assumed to be medium performers. Since machine learning
TEACHERID To analyze whether the subject taught
algorithms can only accept numeric as nominal values, the (General, Reading) could affect students’
value ‘0’ represents low performers, while the value ‘1’ performance in specific domains.
represents medium performers and the value ‘2’ represents STTMG1, To understand whether the subject taught
top performers. On the whole, a total of 4 attributes are STTMG2, overlaps with the initial education and
selected. The attribute CNT is used to filter rows of data from STTMG3 affect student proficiency level in the
East Asian countries only. Furthermore, the attributes CNT subject
and CNTSCHID allow us to perform data merging. Since our NTEACH1, To analyze whether the subjects that the
aim is to predict reading proficiency, the attribute Proficiency NTEACH2, teacher taught could improve students’
NTEACH3 assessment.
Level Read_Mean is considered. Table I shows the attributes
EXCHT, COLT To understand whether the exchange and
that represent the columns after pre-processing the student co-ordination of teaching practices could
questionnaire dataset. help students improve their results.
TABLE I SATJOB, To determine how teachers’ attitudes
LIST OF ATTRIBUTES IN THE STUDENT QUESTIONNAIRE DATASET SATTEACH towards their current job environment help
Attribute(s) Description in students’ assessment.
SEFFCM To determine how teacher, control the
CNT Country
classes environment
CNTSCHID The unique ID that represents a
SEFFREL To determine how teacher, maintain the
school in a country
CNTSTUID The unique ID that represents a positive relations with students help
students to achieve good results.
student from a country
SEFFINS To determine whether teachers provide
Proficiency Level Performance level in Read
clear instructions to students.
Read_Mean
TCOTLCOMP To determine whether teachers use
computers in teaching
There are two questionnaires: the General Teacher and the TCSTIMREAD, To determine whether the strategy used by
Test Language Teacher. The General Teacher questionnaire TCSTRATREAD language teachers could bring impact to
is to be answered by teachers who teach Science and students’ proficiency in Reading
mathematics, while the Test Language Teacher questionnaire TCICTUSE To determine whether technology could
is to be answered by language teachers. Responses from both help students in their proficiency.
questionnaires are selected to form the responses from
2068
Attribute(s) Description / Rationale significantly influencing the students’ proficiency level, a
TCDISCLIMA To determine how language teachers, total of three datasets are created from this step as follows.
manage the students’ disciplinary. A dataset having Teacher-related attributes/factors with
TCDIRINS To determine how teachers, provide student proficiency level. This dataset is formed by
instruction
merging the processed Student questionnaire dataset
FEEDBACK, To determine whether the practice of giving
FEEDBINSTR feedback to students could bring impact to with the processed Teacher questionnaire dataset.
students’ proficiency. A dataset having School-related attributes/factors with
ADAPTINSTR To determine whether students could student proficiency level. This dataset is formed by
follow teachers’ instructions. merging the processed Student questionnaire dataset
with processed School questionnaire dataset.
Since our focus is on East Asian countries, the responses A dataset having Teacher-related and School-related
from school principals in East Asian countries are selected. A attributes/factors with student proficiency level. This
total of 16 relevant attributes related to School are selected. dataset is formed by merging the processed Student
Rows with at least 13 missing values are dropped. questionnaire dataset with the processed School and
Furthermore, there are few rows with missing values for Teacher questionnaire datasets.
attributes SCHLTYPE and CLSIZE which are also deleted. The key attributes to be used when merging the datasets are
The numerical columns that have missing values will be set CNTSCHID and CNT. Further data pre-processing will also
to 0. Eventually, the data is reduced to 1069 rows. Table III be conducted where key attributes and attributes that uniquely
shows the list of attributes that represent columns in the identifies the student and teacher are removed. Table IV
school questionnaire dataset and the rationale of selecting shows the dimensions of the three datasets after pre-
these attributes. processing.
TABLE III TABLE IV
LIST OF ATTRIBUTES IN THE SCHOOL QUESTIONNAIRE DATASET DIMENSION OF THE THREE DATASETS AFTER PRE-PROCESSING
Attribute(s) Description / Rationale Shape / Data Teacher- School- Teacher-
CNT Country related related related and
CNTSCHID The unique ID that represents a school in a Factors Factors School-related
country Factors
SCHLTYPE To identify whether school type (private or No. of rows 739975 39814 699004
public) could affect student performance. No. of columns 26 17 40
SCHSIZE To determine whether the size of the school
have an impact on students' performance. 4) Exploratory Data Analysis: Similar to data pre-
CLSIZE To determine whether the class size have an processing step, the Exploratory Data Analysis (EDA) is
impact on students' performance. conducted twice. The first round is to obtain a preliminary
RATCMP1, To determine whether the quantity of understanding of the characteristics of the datasets which
RATCMP2 computers with internet access may serve as guide to further pre-process the datasets. The second
influence students' performance. round will provide more complex EDA after the data merging
PROATACE To determine whether the proportion of fully process. The results from both rounds of EDA are presented
certified teachers could result in excellent in Section III.
student performance.
PROAT5AB To determine whether the proportion of 5) Feature Selection: In this step, the most relevant
teachers with an ISCED 5A bachelor features from the Teacher and School variables are chosen
qualification could result in excellent student using a feature selection method. This study employs
performance. recursive feature elimination cross-validation (RFE-CV) to
PROAT5AM To determine whether the proportion of
teachers with an ISCED 5A master
determine the most important features (attributes) affecting
qualification could result in excellent student students' reading proficiency level. RFE-CV chooses the best
performance. features by eliminating features of low importance using
PROAT6 To determine whether the proportion of recursive feature elimination and then picking the best subset
teachers with an ISCED level 6 qualification based on the model's cross-validation score. RFE-CV is
could result in excellent student chosen as the feature selection algorithm over recursive
performance. feature elimination (RFE) because RFE requires a user to
CREACTIV To understand how creative extra-curricular specify the total number of features to be retained. RFE-CV
activities at school could assist students in will show the features that have high importance by fitting the
achieving excellent results.
STAFFSHORT, To understand how staff and education
model multiple times and at each step, removing the weakest
EDUSHORT materials shortage could affect students’ features according to the importance of the features. The
academic performance. outcome of this step helps to determine the important
STUBEHA, To investigate how school climate and attributes that should be selected to form the input dataset to
TEACHBEHA teacher’s behavior could affect overall our predictive algorithms.
students’ performance.
6) Construction of Predictive Models: Different
This paragraph describes the data pre-processing that is supervised data mining algorithms have been selected to
conducted on the second round. Since we need to identify compare the accuracy of predicting students’ reading
whether the attributes related to Teacher and/or School are proficiency level. The algorithms such as Random Forest
(RF), Decision Tree (DT), Naïve Bayes (NB) and K-Nearest
2069
Neighbors (KNN) are used to predict and compare the Fig.3 shows the proficiency level of reading for each
accuracy of students’ reading proficiency level. These country. B-S-J-Z has the highest number of students who
algorithms are chosen because they are used by most of the obtained medium and high proficiency levels in Reading. It
research to predict students’ academic performance based on can be seen that most of the students only achieve a medium
Student-related factors in the PISA dataset. Hence, in this proficiency level in Reading. It is possible that the difficulty
study, we would like to find the most suitable algorithm that of reading instruction is why most students gain reading
can produce high accuracy in predicting reading proficiency competency at the medium level. To comprehend the
level using Teacher-related and/or School-related factors. language and grammar of a document, Reading requires
Since this study is predicting multiclass labels, classification effort. For students to grasp a language, they must put in more
evaluation matrices such as Accuracy, Precision, Recall, F1- time and practice.
score are not suitable be used to evaluate and compare the
performance of each predictive model that is generated. In this
research, Hamming score [30] is used to evaluate and
compare the performance of each supervised machine
learning algorithms. Hamming score is one of the metrics
used to evaluate the performance of any classification
algorithm by calculating the percentage of its correct
predictions. Hamming score and accuracy are interchangeable
for binary and multiclass cases, but the Accuracy score is
calculated using the number of True Positives, True
Negatives, False Positives and False Negatives whereas the
Hamming score is calculated using the number of correct
predictions. Hamming score differs from Hamming loss. To
calculate Hamming score, a multi-class problem confusion
matrix must be built first. Then the total number of correctly
predicted classes will be divided by the total number of
samples used to build the predictive model. A Hamming score
Fig. 3 Reading proficiency level of each country
of more than 0.7 is regarded as a good score. The study's
outcome will be further discussed as the outcome contributes
as answer to the third research question. Fig. 4 shows the total number of teachers in each subject.
It shows that most teachers teach general subjects, including
mathematics and Science, and that only about 5223 teachers
III. RESULTS AND DISCUSSION
teach Reading or language-specific subjects. The reason as to
A. Results from EDA why the amount of language teachers are lesser than general
Here we describe the outcome from the first and second teachers maybe because teaching language is much more
round of EDA. For each graphical representation, a short complicated than general subjects. Teachers who teach
illustration is provided. Fig. 2 shows the total number of languages need to understand the grammatical structure of the
students that have participated in the PISA assessments. language. As a result, most teachers prefer to teach subjects
Beijing-Shanghai-Jiangsu-Zhejiang (B-S-J-Z) have the unrelated to languages because they are easier to teach, and
highest number of students participating in the assessment, students are more engaged in learning general subjects.
while Macao has the lowest. This is because China has the
largest population compared to other East Asian countries.
Besides, China provides free education to students for both
primary and secondary schools. As a result, many parents will
send their children to school. Hence, there will be more
students participating in PISA assessments.
2070
a high correlation between FEEDBINSTR and As reflected in Fig.7, China has the highest number of
ADAPTINSTR. Teachers who always try to let students public schools and Hong Kong has the highest number of
engage in reading (TCSTIMREAD) will likely provide and private, government-dependent schools. This diagram shows
receive feedback and always change the structure of the that most of the schools that participated in this assessment
lessons if a student has difficulty in a lesson. are public schools.
2071
From Fig. 11, the highest correlation value is CREACTIV,
which is 0.13. Schools that provide extra-curriculum can help
to increase students’ interest in Reading. The lowest
correlation value is -0.187398, which is STUBEHA. A safe
and positive learning environment can be built if the school
has minimal or absence in disciplinary cases. This will be able
to improve students’ attention and reduce anxiety.
TABLE VI
HAMMING SCORE FOR EACH PREDICTIVE MODEL THAT IS GENERATED USING
SCHOOL-RELATED ATTRIBUTES
Reading
Algorithm RF DT NB KNN
Hamming Score 0.8427 0.8420 0.6533 0.8155
TABLE VII
HAMMING SCORE FOR EACH PREDICTIVE MODEL THAT IS GENERATED USING
TEACHER AND SCHOOL-RELATED ATTRIBUTES
Reading
Algorithm RF DT NB KNN
Hamming Score 0.8201 0.8178 0.6583 0.7843
Fig. 10 Correlation values for School-related attributes towards students’
reading proficiency
2072
The tables above show the outcomes of each model's online games. A highly populated school can promote better
Hamming scores. School-related attributes have the greatest interactivity for students where many discussion activities can
influence on students' reading proficiency levels. Since the be organized. Discussion activities can enhance
input based on School-related attributes generally produces communication skills as well as motivate the students to read,
high Hamming scores for the above four predictive recognize, and use the words effectively in their
algorithms, we will only show the list of important attributes conversations. The proportion of computers available in
related to School. Additional experiments are conducted to schools (RATCMP1) also influences students' reading
determine the Hamming scores of the predictive algorithms competency. Having computers diversifies reading resources
by providing these predictive algorithms with the input data and not only confines reading resources to traditional printed
without employing feature selection. The results reveal a materials. Students can use computers to search for e-books
minor difference ranging from 0.001 to 0.0001 with slightly to supplement their reading materials and study a wide range
better Hamming scores been reported when we use the input of texts, articles, and online books that are relevant to their
data that has undergone feature selection. interests and reading levels. Furthermore, teachers can use
computers to support teaching such as organizing quizzes
C. Output from Feature Selection using Kahoot and other online games. In short, schools with
Fig. 12 shows the list of important School-related attributes adequate computer resources can optimize the potential
based on important scores. The higher the importance score, benefits of technology in education.
the more important the attribute is. From the total of 24 Following that, teacher behaviors (TEACHBEHA) may
School-related attributes, 22 School-related attributes are influence students' reading proficiency levels. If teachers are
selected after the feature selection step. regularly absent, students will miss important lessons. Aside
from teachers’ behaviors, teachers' qualifications
(PROAT5AB, PROAT5AM, PROATCE) may also influence
students' achievement. Teachers that are fully certified and
possess higher educational qualifications are adept at
developing assessments that accurately assess student
learning outcomes. They may identify areas of difficulty that
students will likely face and provide important comments to
help students better understand their weaknesses.
Other significant features or factors that influence students’
reading proficiency levels are the number of extracurricular
activities available at schools (CREATIV), shortage of staff
(STAFFSHORT) and educational materials (EDUSHORT).
Book Club and story-telling competitions are examples of
extracurricular activities that can be conducted to cultivate
Fig. 12 The list of features sorts according to importance scores
reading habits among students. Shortage of manpower and
D. Critical Analysis educational material can affect the daily operations of a
Based on Table V, VI and VII, it is clear that RF performs school and student’s learning.
well in predicting students’ reading levels. The Hamming
scores for the predictive models are higher than DT, NB, and IV. CONCLUSION
KNN. From Fig. 12, it is possible to establish that eleven This study focuses on predicting factors that affect East
features or factors have significant values greater than 0.04. Asian students reading proficiency levels using PISA data.
Significant scores of greater than 0.1 are found in student Using Teacher and School datasets, supervised machine
behavior (STUBEHA) and school size (SCHSIZE). Student learning approaches based on RF, DT, NB, and KNN are used
behavior has the greatest influence on students’ reading to predict East Asian students' reading proficiency levels. To
proficiency level. Students' misbehavior such as truancy, find the most relevant attributes from Teacher and School
tendency to skip lessons, lacking respect for teachers, dataset, RFE-CV is used as feature selection. Based on the
threatening or bullying other students, and lacking attention result of this study, School-related factors have the greatest
in class can affect other students’ academic achievement due influence on students' achievement. This can be seen from
to other students’ feeling of insecurity and disharmony in Table V, VI and VII where the results of Hamming scores for
school. A school with a higher disciplinary incidence will School-related factors are greater than the results of Hamming
create a bad classroom environment. In an environment such scores for Teacher-related and the combination of Teacher-
as high absenteeism, students who constantly miss a class related and School-related factors.
struggle to learn and may find it difficult to concentrate or Furthermore, RF outperforms DT, NB, and KNN
catch up with language-related subjects. In reality, the key to regardless of the type of datasets being applied. The school
mastering reading skills is to concentrate in lessons have management needs to provide a conducive teaching and
constant reading practice on reading materials. learning environment because it provides a sense of security
School size affects students’ academic achievement for students to pursue learning easily. With better staff
because a bigger size school will have better infrastructure quality, infrastructure and availability of essential learning
and facilities. With better support facilities, teachers can materials, school can be a great learning ground for students
deploy various teaching and learning strategies to make the when more engaging lessons can be prepared, and various
lessons more entertaining, such as using Kahoot and other exciting learning activities can be organized for students.
2073
Nevertheless, not all teachers in East Asian countries provide [15] H. Lee and J.-W. Lee, “Why East Asian students perform better in
mathematics than their peers: An investigation using a machine
responses to the teacher’s questionnaires because these
learning approach,” 2021.
questionnaires are optional for teachers. The lack of responses [16] C. Ding, “Examining the context of better science literacy outcomes
from certain East Asian countries can potentially introduce among U.S. schools using visual analytics: A machine learning
bias into our analysis. In future, other supervised data mining approach,” International Journal of Educational Research Open, vol.
3, p. 100191, 2022, doi: 10.1016/j.ijedro.2022.100191.
approaches can be utilized to improve the performance of the
[17] Y. Wang, R. King, J. Haw, and S. on Leung, “What explains Macau
predictive models. students’ achievement? An integrative perspective using a machine
learning approach ( ¿Cuál es la explicación del rendimiento de los
REFERENCES estudiantes macaenses? Una perspectiva integradora mediante la
adopción del enfoque del aprendizaje automático ),” Journal for the
[1] R. C. Anderson, “Becoming a nation of readers: The report of the Study of Education and Development, vol. 46, no. 1, pp. 71–108, Jan.
Commission on Reading.,” 1985. 2023, doi: 10.1080/02103702.2022.2149120.
[2] A. Talwar et al., “Early Academic Success in College: Examining the [18] T. Luo and Y. Peng, “The analysis of influencing factors on the value
Contributions of Reading Literacy Skills, Metacognitive Reading dimension of Asian students’ global competence - based on PISA
Strategies, and Reading Motivation,” Journal of College Reading and 2018,” in 2021 16th International Conference on Computer Science &
Learning, vol. 53, no. 1, pp. 58–87, Jan. 2023, Education (ICCSE), IEEE, Aug. 2021, pp. 1130–1134.
doi: 10.1080/10790195.2022.2137069. doi: 10.1109/ICCSE51940.2021.9569461.
[3] K. Nyarko, N. Kugbey, C. C. Kofi, Y. A. Cole, and K. I. Adentwi, [19] B. Tan and M. Cutumisu, “Employing Tree-based Algorithms to
“En4glish Reading Proficiency and Academic Performance Among Predict Students’ Self-Efficacy in PISA 2018,” in Proceedings of the
Lower Primary School Children in Ghana,” Sage Open, vol. 8, no. 3, 15th International Conference on Educational Data Mining, 2022, p.
p. 215824401879701, Apr. 2018, doi: 10.1177/2158244018797019. 634.
[4] L. Stoffelsma and W. Spooren, “The Relationship Between English [20] A. Gamazo and F. Martínez-Abad, “An Exploration of Factors Linked
Reading Proficiency and Academic Achievement of First-Year to Academic Performance in PISA 2018 Through Data Mining
Science and Mathematics Students in a Multilingual Context,” Int J Techniques,” Front Psychol, vol. 11, p. 575167, Nov. 2020, doi:
Sci Math Educ, vol. 17, no. 5, pp. 905–922, Jun. 2019, 10.3389/fpsyg.2020.575167.
doi: 10.1007/s10763-018-9905-z. [21] J. Y. Haw and R. B. King, “Understanding Filipino students’
[5] Oecd, “PISA 2018 results: Combined executive summaries,” J Chem achievement in PISA: The roles of personal characteristics, proximal
Inf Model., vol. 53, no. 9, pp. 1689–1699, 2019. processes, and social contexts,” Social Psychology of Education, vol.
[6] N. Aksu, G. Aksu, and S. Saracaloglu, “Prediction of the Factors 26, no. 4, pp. 1089–1126, Aug. 2023, doi: 10.1007/s11218-023-
Affecting PISA Mathematics Literacy of Students from Different 09773-3.
Countries by Using Data Mining Methods,” International Electronic [22] Z. Yujiao, L. W. Ang, S. Shaomin, and S. Palaniappan, “Dropout
Journal of Elementary Education, vol. 14, no. 5, pp. 613–629, 2022. Prediction Model for College Students in MOOCs Based on Weighted
[7] A. Bozak and E. C. Aybek, “Comparison of Artificial Neural Multi-feature and SVM,” Journal of Informatics and Web
Networks and Logistic Regression Analysis in PISA Science Literacy Engineering, vol. 2, no. 2, pp. 29–42, 2023,
Success Prediction.,” International Journal of Contemporary doi: 10.33093/jiwe.2023.2.2.3.
Educational Research, vol. 7, no. 2, pp. 99–111, 2020. [23] H. S. Park and S. J. Yoo, “Early Dropout Prediction in Online Learning
[8] O. Lezhnina and G. Kismihók, “Combining statistical and machine of University using Machine Learning,” JOIV : International Journal
learning methods to explore German students’ attitudes towards ICT on Informatics Visualization, vol. 5, no. 4, p. 347, Dec. 2021,
in PISA,” International Journal of Research & Method in Education, doi: 10.30630/joiv.5.4.732.
vol. 45, no. 2, pp. 180–199, Mar. 2022, [24] Y. Zheng, Z. Gao, Y. Wang, and Q. Fu, “MOOC Dropout Prediction
doi: 10.1080/1743727X.2021.1963226. Using FWTS-CNN Model Based on Fused Feature Weighting and
[9] S. Kılıç Depren and Ö. Depren, “Cross-Cultural Comparisons of the Time Series,” IEEE Access, vol. 8, pp. 225324–225335, 2020, doi:
Factors Influencing the High Reading Achievement in Turkey and 10.1109/ACCESS.2020.3045157.
China: Evidence from PISA 2018,” The Asia-Pacific Education [25] N. M. Alruwais, “Deep FM-Based Predictive Model for Student
Researcher, vol. 31, no. 4, pp. 427–437, Aug. 2022, Dropout in Online Classes,” IEEE Access, vol. 11, pp. 96954–96970,
doi: 10.1007/s40299-021-00584-8. 2023, doi: 10.1109/ACCESS.2023.3312150.
[10] C. Nunes, T. Oliveira, M. Castelli, and F. Cruz-Jesus, “Determinants [26] Y. Tong and Z. Zhan, “An evaluation model based on procedural
of academic achievement: How parents and teachers influence high behaviors for predicting MOOC learning performance: students’
school students’ performance,” Heliyon, vol. 9, no. 2, p. e13335, Feb. online learning behavior analytics and algorithms construction,”
2023, doi: 10.1016/j.heliyon.2023.e13335. Interactive Technology and Smart Education, vol. 20, no. 3, pp. 291–
[11] S. Li, X. Liu, Y. Yang, and J. Tripp, “Effects of Teacher Professional 312, Sep. 2023, doi: 10.1108/ITSE-10-2022-0133.
Development and Science Classroom Learning Environment on [27] D. Fahrudy and S. ’Uyun, “Classification of Student Graduation using
Students’ Science Achievement,” Res Sci Educ, vol. 52, no. 4, pp. Naïve Bayes by Comparing between Random Oversampling and
1031–1053, Aug. 2022, doi: 10.1007/s11165-020-09979-x. Feature Selections of Information Gain and Forward Selection,”
[12] J. G. Mora-Ruano, M. Schurig, and E. Wittmann, “Instructional JOIV : International Journal on Informatics Visualization, vol. 6, no.
Leadership as a Vehicle for Teacher Collaboration and Student 4, p. 798, Dec. 2022, doi: 10.30630/joiv.6.4.982.
Achievement. What the German PISA 2015 Sample Tells Us,” Front [28] R. Mehdi and M. Nachouki, “A neuro-fuzzy model for predicting and
Educ (Lausanne), vol. 6, p. 582773, Feb. 2021, analyzing student graduation performance in computing programs,”
doi: 10.3389/feduc.2021.582773. Educ Inf Technol (Dordr), vol. 28, no. 3, pp. 2455–2484, Mar. 2023,
[13] X. Dong and J. Hu, “An Exploration of Impact Factors Influencing doi: 10.1007/s10639-022-11205-2.
Students’ Reading Literacy in Singapore with Machine Learning [29] H. Mohd Nasir, N. M. A. Brahin, F. E. Mohd Sani @ Ariffin, M. S.
Approaches,” Int J Engl Linguist, vol. 9, no. 5, p. 52, Aug. 2019, Mispan, and N. H. Abd Wahab, “AI Educational Mobile App using
doi: 10.5539/ijel.v9n5p52. Deep Learning Approach,” JOIV : International Journal on
[14] H. Lee, “What drives the performance of Chinese urban and rural Informatics Visualization, vol. 7, no. 3, p. 952, Sep. 2023, doi:
secondary schools: A machine learning approach using PISA 2018,” 10.30630/joiv.7.3.1247.
Cities, vol. 123, p. 103609, Apr. 2022, [30] Hasty.ai, “Hamming score.” Accessed: Jun. 01, 2023. [Online].
doi: 10.1016/j.cities.2022.103609. Available: https://hasty.ai/docs/mp-wiki/metrics/hamming-score
2074