Final Chap1-5_FerdinandAzuela
Final Chap1-5_FerdinandAzuela
A MANTEL-HAENSZEL APPROACH
FERDINAND S. AZUELA
JUNE 2023
APPROVAL SHEET
Accepted and approved in partial fulfillment of the requirements for the degree of
The researcher could not have finished this academic endeavor without the
commendable assistance and guidance extended by God through His instruments. Thus, he
Dr. Paulo V. Cenas, the Executive Director, for instilling to him the value of
Dr. Arlene N. Mendoza, the adviser, for being the keyperson for inculcating the
learning attitude through her constant supervision and guidance as well as her valuable
comments and suggestions that benefited him much in the completion and success of this
study;
Dr. Melody C. De Vera, the critic reader, for being the backbone of this scholarly
journey as she generously shared her time, wisdom, and expertise in enhancing the
Dr. Michael Howard D. Morada, the chairperson of the panel, for his brilliant ideas
for the improvement of this study and for his words of encouragement; Dr. Rodelio M.
Garin, Dr. Christopher J. Cocal, Dr. Joseph B. Campit, Dr. Ana Perla B. Guzman, the
distinguished members of the committee for oral examination and Mr. Bobby F. Roaring,
the genius statistician, for imparting their unparalleled expertise, wisdom, untiring
assistance, and immeasurable effort to help the researcher improve and finish his study;
The School Heads of public and private junior high schools in the Municipality of
Victoria for allowing the researcher to float the questionnaires that allowed the researcher
in-law, for their love, care, and constant prayers for his academic life and for being the
No words of thanks can sum up the gratitude and indebtedness to his beloved and
supportive wife Dr. Angela Francesca D. Azuela who is always by my side in times that
he need her most and helped him throughout the accomplishment of this research.
All others, who in one way or another extended a helping hand for the completion
of this research.
Above all, to the Almighty God for all his wonderful blessings, kindness, and
For our Almighty God, for the gift of wisdom and for all the blessings
May this bring back to Him all the glory and honors.
FSA
TABLE OF CONTENTS
TITLE PAGE i
APPROVAL SHEET ii
ACKNOWLEDGEMENT iii
DEDICATION v
TABLE OF CONTENTS ix
ABSTRACT x
CHAPTER
1 THE PROBLEM
Definition of Terms 7
Related Literature 11
Related Studies 29
Conceptual Framework 34
Research Instrument 40
Students Profile 49
School Type 51
Summary 87
Salient Findings 88
Conclusions 91
Recommendations 92
BIBLIOGRAPHY 93
APPENDICES
(2nd Quarter)
Figure Page
1 Research Paradigm 35
This study aimed to analyze and detect biased items of the grade 7 Mathematics
Achievement Test using the Mantel Haenszel Method based on the students’ type of school
and learning styles. It was also designed to strengthen test validity as a basis for test
standardization.
A total of 207 grade 7 students were randomly selected from all public and private
schools in the municipality of Victoria. A 60-item test was constructed and subjected to
validation and reliability test. Experts’ judgement and item analysis were performed for
content validity, testing the relationship between the constructed test and students’
performance on 2nd quarter for concurrent validity using Pearson product moment
correlation and Linear Regression, and Principal Component Analysis (PCA) was used to
describe test construct validity. A descriptive and development research design was used
Eighteen test questions that were detected and eliminated were significantly biased
after tallying as greater than the critical value (3.8415) at a significance level of 0.05. After
the process of validation and removal of significantly biased questions, the revised test
version consisted of thirty-one questions that were essential, a much higher percentage of
questions with an optimum level of difficulty index, as well as its acceptable discrimination
index. An increase in concurrent validity was also evident, indicating a less homogeneous
test score. This indicates that the revised test version was more valid than the original
version.
The findings indicate that the Mantel-Haenszel chi-square analysis can be used to
detect a large amount of DIF, which may strengthen the validity and reliability of the test
questionnaire.
Chapter 1
THE PROBLEM
Framework, which found that Filipino pupils lagged behind other nations in the
international assessment for Grade 4 Mathematics. The country received a score of 297,
which is much lower than the TIMMS Scale Center point of 500, and the lowest among
the 58 countries. The said assessment was topped by the country’s East Asian neighbors,
such as Singapore, Chinese Taipei, Korea, Japan, and Hong Kong. Meanwhile, only 19
percent of Grade 4 Filipinos who took part in the survey were classified as Low
Before the release of TIMSS results, the Programme for International Student
extent to which they acquired essential information and abilities. The evaluation focused
on the core school disciplines of Reading, Mathematics, and Science as well as the student's
resulted in a 353-point score for mathematics, which was much lower than the average
score of 489. This reveals that Filipinos aged 15 trailed behind other countries in terms of
global competency.
Similarly, the low performance of Filipino learners was also evident in the National
Achievement Test (NAT) of private and public school learners from 2009 to 2014.
Statistics show that the mean percentage score in Mathematics was far from the target of
75% mastery level. The National Achievement Test 2018 results recorded the lowest MPS
in the history of the DepEd standardized exam. The recently concluded results on the
conduct of learning loss among private schools nationwide, Grades 2 to 12, brought about
by the pandemic based on the Philippine Assessment for Learning Loss Solutions
(PALLS), during the last quarter of 2022, bannered a 47.5% average score in mathematics,
much lower than the 60% passing percentage set by DepEd. The assessment consists of 75
multiple-choice type items covering the three core subjects of the previous level. This
These claims on the results of international and national assessments show accuracy
in the sense that the forms of assessments go through the process of standardization; that
is, it is consistent with what is intended to measure. The school-based results from the
assessments. The results of the assessment program will conclude on the numeracy level
of Grade 1-7 learners based on the standardized tool crafted July 22-24, 2020, and piloted
November 2020, per RM No. 194, s.2020. Based on the results of Grade 7 learners of
Victoria National High School on the post-test of PAN administration, the school year
2021-2022, it was found that 368 or 54.68% of the Grade 7 examiners were non-numerates,
while only 62 or 9.21% of the examinees were enumerated. This may imply that most
learners scored 0–49% on the test. While there are 346 (46.95%), Grade 7 examiners are
non-numerates, while only 31 (4.21%) of the examinees are numerates as they started the
school year 2022-2023. The results of the program did not justify teachers' teaching and
learning practices during the school year. In making fair and right decisions based on the
test results and selecting students who have ability and interest by DepEd standards, the
signifies poor learning outcomes, which may be caused by different factors. Thus, the
this study might provide an innovative strategy to strengthen test quality to address the
Achievement gaps occur when the results of the assessments are statistically
significant between the performance of the two groups of students, where one group
outperformed the other. Thus, the results of this study could be used as a basis for
defined through student assessment results. Therefore, the testing should be performed and
prepared. Tests and testing are important to help children develop their capacity to deal
with standardized exams. As a result, the validity of the instruments must be tested by test
developers to confirm that they are intended for students. It should be designed to assess
learners' critical thinking while considering the individual variability in IQ levels. The
evaluation instrument should be validated to obtain accurate outcomes for the learners.
With these in mind, this study provides a validated and free from item bias
Mathematics achievement test to ease the achievement gap. Bias items are those that
behave differently for people from two different cultures. Item bias exists if an item proved
to be more difficult for one group than the other or easier for one group than the other,
assuming that students' average ability level remains constant. Bias is defined as the
presence of a trait in an item that results in differing performance for people of the same
ability who come from various ethnic, sex, cultural, or religious groups. If the proportion
of correct responses was the same in both the groups, the question was considered unbiased
(Rudner et. al, 1980) that each test item should possess. Determination of the item bias of
the items in the tests is important to increase test validity and reliability. Kristjansson et.
al. (2005) pointed out that item bias is an important factor that threatens the validity of the
measurements, and that test and item bias detection methods should be used as much as
possible. Thus, test bias distorts results by allowing examinees' characteristics to influence
the measure of the main construct. Item bias is a possible threat to validity (Ackan et.al,
2019, Clauser & Mazor, 1998). Therefore, the conduct of the research on this matter is of
importance.
In this regard, teachers are likely to use test scores to assess student performance;
therefore, creating bias-free test questions is crucial. This study will be of considerable
assistance to both teachers and students as item developers and users of the instruments,
providing accurate assessments for students and a ready-made assessment tool for teachers.
Since item bias studies in the Philippine setting using different methods to analyze this
kind of bias are insufficient, the study may also provide opportunities, especially in the
educational system, to include this method in validating test items to improve the quality
yes-no. This is the most extensively used method for detecting item bias. Previous research
also has shown that, compared to other procedures, the Mantel-Haenszel approach is
effective in detecting item bias (Narayanan, 1995). Moreover, in Skaggs's (1992) research
on the consistency of detecting item bias across different test administrations, the Mantel-
Haenszel method is one of the most consistent ways to detect item bias among males and
easy to implement, does not require knowledge of Item Response Theory (IRT), and is
often used for DIF detection (Holland & Thayer, 1988; Wainer, 2010; Diaz et. al, 2021).
The purpose of this study is to use the Mantel Haenszel chi-square method to
analyze, detect and remove biased items in the Mathematics achievement test to improve
The study aimed to analyze and detect biased items of the Grade 7 Mathematics
Achievement Test using the Mantel Haenszel Method. Specifically, it sought to answer the
following questions:
2. What is the content, construct and concurrent validity, and internal consistency
3. What are the detected bias items based on the student's profile?
4. What is the content, construct and concurrent validity, and internal consistency
reliability of the revised Mathematics Test after the detection and removal of the
bias items?
5. How do the original and the revised test versions be compared in terms of
a. Content validity;
c. Concurrent validity?
6. How do the original and the revised test versions compare in terms of internal
consistency reliability?
The study aimed to analyze and detect bias items in the Grade 7 Mathematics Test
using the Mantel Haenszel Chi-square Method. This study is significant because it would
Students. Since the results of the study will lead to producing validated and item
assessments. With this, the teacher will be able to address students' difficulties specifically
instruments free from item bias for teachers, as assessment developers. The result of the
study will be a useful instrument to identify difficulties among the students and come up
with different interventions in the teaching and learning process. It will serve as a
will produce validated assessment instruments for Grade 7 Mathematics that can be used
Researchers. The study will also serve as a basis for future researchers concerned
with providing a validated assessment instrument to increase the level of mastery and
The study aimed to analyze and detect bias items in the Grade 7 Mathematics
Randomly selected students from Grade 7 enrolled in different public and private
Teacher-made achievement test items using the second quarter's most essential
concurrent validity were conducted. The internal consistency reliability using Khuder-
Richardson 20 was also tested. The results of the test were tabulated. Moreover, the results
of every test item were tallied, organized, and interpreted. Using the Mantel-Haenszel
Definition of terms
Mathematics in the second grading period. It was composed of 50% written works and
50% performance tasks in the form of summative test (DepEd order no.031 s2020)
Assessment. It is a way of assessing learners' progress of the students. In this study,
Auditory Learning Style. This refers to the learners who learn through verbal
through listening. They interpret meaning through the tone of a sound as well as through
the quickness and accentuation of speech (Mašić et. al., 2020. Gilakjani, 2012). It is
recommended that those learners make sure that they can hear well, recite information, and
Concurrent Validity. This refers to the validity conducted using the current
performance of the respondents associated with the raw scores garnered during the test
conducted.
Construct Validity. This refers to the validity proving that no items on the test
Content Validity. It refers to the content property or traits of the test items. It will
be done through checking by the experts. Content validity studies pertain to the adequacy
of the test items as a sample from a well-specified content domain and are associated with
Test Reliability. It refers to the extent to which the test scores are consistent
Item Bias. It refers to invalidity or systematic error in how a test item measures a
construct for the members of a particular group (Villas, 2019). A statistically significant
difference across two or more groups of examinees due to characteristics of the item
substantially higher or lower than that for the overall population. Item bias describes
Learning Styles. It refers to the styles of the students (i.e. Visual, Auditory, and
determining their preferred learning methods and adapting their learning to fit those
preferences (Antoniuk, 2019). Mašić et. al. (2020) and Cornett (1983), learning styles are
on the questionnaire constructed by the teacher. The Mantel-Haenszel statistic (MH) can
be used for comparing two cultural groups when the observed item scores are dichotomous
School Type. It refers to weather the school of the respondents is public or private.
Tactile Learning style. This refers to the learners who enjoy creating things in
their hands and making sense of information through touch. This suggests learning
Test Validity. This refers to the concurrent, construct, and content validity of the
Visual Learning Styles. These refer to the learners who learn by seeing (Mašić et.
al, 2020) or watching demonstrations. They prefer learning via visual channels. Visual
students need the visual stimulations of bulletin board videos and movies. They must write
This chapter presents the discussion of related literature and studies to this research.
These materials provided the researcher with some insights, theories, concepts, and ideas
which contributed to the conceptualization and formulation of the framework of the study.
Related Literature
education that possibly addressed the low achievement level of the Philippines in an
and Science Study (TIMSS) in 2003. The Philippines ranked 34th in 2nd-year high school
mathematics out of the 38 countries assessed. For grade 4 Math, the Philippines ranked
23rd among the participating countries. In 2008, even though only science high schools
participated in the advanced mathematics category, the Philippines ranked at the bottom
(Department of Education, 2010). After the assessment, the country did not participate in
any international assessment, instead relying on the results of the National Achievement
Test (NAT). The overall mean percentage scores on the NAT were low across the years.
achievement level of students in the National Achievement Test (Nat) results from 2009-
2014. (DepEd2014) Statistics show that the mean percentage score (MPS) of mathematics
subjects in NAT was far from its target of 75% mastery level. (Austria, CM C., 2020
Interactive SIM in Selected Topic in Algebra) It can also be seen in the publication last
September 26, 2019, that the Garde 6 NAT scores were at a low mastery level. The 2018
NAT results showed that the national average mean percentage score (MPS) was 37.44,
the lowest in the history of the DepEd standardized exam. In the Grade 6 NAT 2009 overall
mastery level, only 6.82% of the takers from the public schools described it as closely
approximating mastery, and it is better than the 0.36% of takers from private schools. A
total of 52.16% of the takers from public schools tallied moving towards mastery,
compared to only 20.26% of the takers from private schools. On the other hand, the higher
percentage of the average mastery among private school takers with 69.13% was tallied
compared to 37.76% of the public school takers. In addition, private school takers tallied
the highest number of low-mastery students, with 10.25% and 3.23%, respectively (Benito,
2010).
high school students in public and private schools was described. 0.13% of the takers from
the public schools tallied closely approximating mastery, while only 0.01% under this
mastery level were from private school takers. Twelve and thirty-five percent (12.35%) of
the public school takers were described as moving towards mastery compared to 5.04% of
the takers in private schools. Private schools tallied 71.69% of their takers compared to
67.89% of public school takers were described as average. Furthermore, 19.60%, 0.02%,
and 0.01% of the public school takers were described as having low mastery, very low
mastery, and absolutely no mastery, respectively. Meanwhile, 23.24% and 0.01% of private
schoolteachers were described as having low mastery and very low mastery, respectively
(Benito, 2010). Surveys of the firms and investors showed that the low performance by the
country's students and graduates in Mathematics, Science, and English may constrain
reform was pushed through the last school year 2012-2013. According to the Department
of Education, only the Philippines, Angola, and Djibouti have a 10-year basis schooling
cycle. The k to 12 program aims to make Philippine education at par with the rest of the
world with 12 years of basic schooling already a global standard. It provides sufficient time
for learners to achieve mastery of skills and concepts, develop lifelong learners holistically,
and prepare learners for tertiary education, middle-level skills development, employment,
The salient feature of the k to 12 that will contribute to the competency of the
learners and increase the achievement level of the country are strengthening early
learning through the spiral progression that showed evidently in every grade level, building
grades 1 to 3, gearing up for the future by adding two years in high school (senior high
school) and nurturing the holistically developed Filipinos in line with the Department of
communicating reasoning; learners should also demonstrate key concepts and principles in
Upon completing Grade 10, learners should demonstrate understanding and appreciation
of key skills and concepts involving numbers and number sense through the lessons from
sets and real numbers, measurement using the lesson in the conversion of units and
patterns, and algebra using the lessons on linear equations and inequalities in one and two
variables, linear functions, and a system of linear inequalities and equations in two
variables. Some other lessons in algebra are exponents and radicals, quadratic equations,
inequalities and functions, and lessons from polynomial functions and equations. Learners
should also demonstrate understanding and appreciation of key concepts and skills
congruence, inequality and similarity, basic trigonometry, and lessons from statistics and
Furthermore, the reform of education does not satisfy its goal and aims to achieve
learning standards based on the international assessment conducted by PISA and TIMSS
in 2018 and 2019, respectively. According to the OECD, PISA database 2018, the
Philippines scored 353 in mathematics which is a 136 score difference from the average
score of 489. It was shown that 15-year-old girls outperformed boys in mathematics by
scoring 12 points, unlike the typical scores across OECD countries, where 15-year-old boys
outperformed girls by five points. Only 19% of students assessed in the Philippines attained
Level 2 or higher in mathematics. Students can interpret and recognize without direct
instructions and can represent the situations mathematically belonging to this level. (PISA,
2018) While in TIMSS 2019, out of 64 countries participated Grade 4 Filipino students
ranked the last with an average scale score of 297. From the International Mathematics
Achievement score of 500, it is evident that Filipino learners are behind the 203 scale
scores. It was also tallied that among the participating countries, 27 of them where boys
outperformed girls, and there were four countries where boys outperformed girls in
Mathematics Grade 4. It was also evident in the results for Grade 8 Mathematics, Boys in
the six countries outlasted girls in terms of achievement level score, but there were seven
countries where girls outperformed boys in the performance level of Grade 8 (IEA's
TIMSS, 2019).
Reform in education does not justify the results of the international assessment, as
it shows a low achievement level among all participating countries. Thus, continuous
educational reform should be geared toward various innovations and strategies that can
sustain students' interests corresponding to their needs. To address the emerging low
curricula and instructional practices and approaches will bring a positive result in a short
important matter.
Individuals have an innate ability to learn naturally. Naturally, they can find ways
to learn as quickly as possible. Learning occurs when one observes a change in learners’
behavior resulting from what has been experienced. Some students prefer to acquire
crucial in delivering teaching and learning processes. This is evident in some studies that
effective learning can be achieved once the delivery of instruction is tailored to students’
to concentrate on, store, and remember new and/or difficult information; this refers to
learning style (Prashning, 2005). It is not an ability but a preference for the way an
individual uses his or her abilities (Sternberg 1994). There are various learning style
schemes; however, these are often categorized by sensory approaches such as visual, aural,
The present study will analyze item bias using the Mantel Haenszel method and
will focus on the student's learning style, specifically visual, auditory, and tactile learners,
as a variable in detecting biased items. Visual learners are students who learn the best
through what they see. That is why teachers were suggested to use visual aids such as
pictures, illustrations, graphs, and films, and the need to demonstrate the activity is
essential for them to learn. Studies on learning styles and the application of visual aids
noted that at least 40% of all students were generally beneficial in supporting learners’
processes and retention of information (Masic et al., 2020; Clarke et. al, 2006).
Auditory learners are students who are fond of acquiring information through
listening. Students interpret meaning through the tone of a sound as well as through
quickness and accentuation of speech. This learning style suggests that students should
hear well, recite information, and have conversations for better memorization. (Masic et al.
2020; Gilakjani 2012). Tactile learners were students who enjoyed creating things in their
hands and making sense of information through touch. Writing, highlighting, underlining,
labeling, and role playing is helping students to retain information with types of learners
Learning style is an indicator of how students like to learn (Keefe, 2011). Banaga (2016)
emphasized that learning style covers individuals' natural or habitual patterns of acquiring
and processing information in any learning situation. Ishak and Awang (2017) describe it
as a motive and strategy that involves the planning and learning system of students to learn
and achieve their aspirations. Moreover, it is also a way in which individuals absorb and
retain new information or skills, regardless of how they are described. The use of learning
styles is required to differentiate instruction. Ideally, this includes all the learning styles so
that students can learn in a way that suits them best for the day.
Assessment in Education
crafted the Basic Education-Learning Continuity Plan, which seeks to continue the
teaching-learning process that ensures the health, safety, and welfare of the learners,
teachers, and personnel of the department with the use of different learning modalities. It
also specifies the rules, appropriate guidelines, and directives through projects, programs,
cited some challenges that might be faced in the implementation of distance learning
technologies by both teachers and learners as well as stakeholders. These include Internet
connectivity among students and learners, which is the major limiting factor, the capacity
of teachers to use technology in the teaching process on the learning delivery, and early
grade levels, which should be accompanied by parents and guardians in using technology.
assessments will be employed in the BE-LCP to identify the methods to be used in learning
direct teaching, discussions, and queries about the lesson and activities. The materials focus
on the learning materials needed and references in constructing content lessons. Learning
activities, such as case studies, group discussions, and presentations, facilitate learning.
Lastly, assessments are crucial during the pandemic. Assessment is a way to assess student
progress. Through the learning continuity plan, these four aspects of the teaching-learning
Assessment plays a vital role in the teaching-learning process during the pandemic,
assess learners in distance learning programs and provide feedback and formative guidance
to students. When a teacher fails to provide regular feedback, students may fail to address
their learning levels and struggle to improve their new knowledge and skills in the self-
should be analyzed for quality and equity implications. Teachers may also use text
Another challenge for teachers in the new normal education is to rapidly change the
practices in the face-to-face teaching and learning process, such as giving daily tasks,
responsibilities, and accountabilities. The development of new alternatives and varied
and summative assessments. These challenges for the teachers brought about by the
pandemic were a roller coaster ride with the aim of the k to 12 curricula to develop learners
holistically.
The BE-LCP also managed to craft educational competencies among the learners,
as was evident from the Most Essential Learning Competencies (MELCs) produced by the
which is 14 and 17, down to 5,689 in the Most Essential Learning Competencies (MELCs).
mathematical competencies were removed in the k to 12 curricula, and 543 out of 741
summative tests such as written works and perform tasks on any learning delivery
written work (40%), performance tasks (40%), and quarterly assessments (20%). Written
work includes long tests, unit tests, or any activities that ensure students' written skills in
expressing their ideas. The performance task included a skill presentation and
demonstration. Written works may also be included in this component. Finally, quarterly
assessments are tested at the end of the quarter (DepEd order no.8 s2015). The learning
continuity plan of DepEd has been crafted to ensure the teaching and learning process amid
the pandemic. Based on the interim guidelines for grading and assessment in light of the
basic education learning continuity plan, students' mathematics achievement level was
composed of written works (50%) and performance tasks (50%) on whatever forms of
modality. Summative tests will continue in the form of written works and performance
BE-LCP of the Department of Education. Teacher-made items can be used if it was re-
evaluated to find their difficulty and discrimination index as well as reliability, or the
teacher may use test items stored and used repeatedly. There are two important
characteristics of an item that will be of interest to teachers. These are item difficulty and
discrimination indices. The difficulty of an item or item is defined as the number of students
who can answer an item correctly divided by the total number of students. Difficult items
tend to discriminate between those who know and those who do not know the answer.
Conversely, the easy items could not discriminate between the two groups of students.
Therefore, we are interested in deriving a measure that will tell us whether an item can
discriminate between these two groups of students. This measure is called the
discrimination index (Santos, 2007). Items can be used every now and subsequently.
Performance should be identified not only within the test forms but also across all test
Item difficulty indicates that the higher its value, the easier it is to answer the
question. It is important to determine whether students have learned the concept being
tested. A high difficulty score indicated that a greater proportion of the sample answered
the question correctly. A lower difficulty value indicates that a smaller proportion of the
sample understood the question and answered correctly. This may be because of the item
being coded incorrectly, ambiguity with the item, confusing language, or ambiguity with
the response option. The items suggest modifying or deleting them from a pool of items
(Boateng et. al., 2018)The desirable difficulty levels are slightly higher than midway
between chance and perfect scores for the item, that is, if the goal is maximizing item
discrimination.
how well they know about the material being tested. It reflects the degree to which the item
and the test, as a whole, measure unitary ability or attributes. This was calculated by
subtracting the proportion of examinees in the lower group from the proportion of
examinees in the upper group who received the item correctly or expectedly endorsed the
item. It enables the identification of items that are differentiated correctly between those
who are knowledgeable about a subject and those who are not (positive discriminating
items), items that are poorly designed such that more knowledgeable get them wrong and
less knowledgeable get them right (negative discriminating item), and items that fail to
differentiate between participants who are knowledgeable about a subject and those who
are not (non-discriminating item), according to Boateng (2018). Items with low indices
were often ambiguously worded and should be examined. Items with negative indices
should be examined to determine why a negative value was obtained. Boateng (2018) states
how the item discrimination index improves test items. First, items that were too easy, too
negative discriminating should be reexamined and modified. Finally, items that were
Another important factor to consider in the construction and design of test questions
is the reliability of the questionnaire. The reliability of a test refers to the extent to which
it is likely to produce consistent scores. High reliability means that the questions of a test
tend to be pulled together. This signifies that the relative scores of the students would show
little change when a parallel test was developed using similar items. Low reliability means
that the questions are unrelated to each other in terms of who answers them correctly. The
KR20 reliability analysis was used in this study to test the internal reliability consistency
of the mathematics achievement test. This indicates that the higher the number of test items,
the higher the internal consistency reliability and vice versa. Thus, a test can be made
reliable by increasing the number of test questions (Pedrajita, 2017; Ferguson & Takane,
1989). Moreover, the identification of difficulty and discrimination index is not sufficient
to ensure test fairness among students (Villas, 2019; Gatchalian & Lantajo, 2010)
The study also measured the concurrent validity of the achievement test in
second-quarter academic performance and test scores on the test. Concurrent validity
coefficients signify the homogeneity of a group of test scores. The higher the correlation,
the less homogeneous a group of scores, which means that the larger the range scores, the
larger the correlation coefficients. The more homogeneous the test becomes, the less valid
it is (Pedrajita, 2017; Ferguson and Takane, 1989). The lower the correlation, the more
on Mathematics Teaching and Learning" that "a successful educational system focuses on
students' outcomes and provides necessary support among the students in achieving them".
Strategies to create equitable classrooms with high-quality content were also provided by
Sutton and Krueger. The accurate identification of students' knowledge and mastery is one
of the strategies enumerated. The diagnosis in which the students struggle to address
appropriate learning instructions was also cited. Another strategy is to engage all learners
with higher-order thinking skills. All these strategies can be vital in providing a well-
crafted assessment that will surely identify students' difficulties and the need for calibrating
learning instructions. The literature supports the aim of the study by providing validated
questions and is free from item bias. This will surely address students' difficulties and
not automatically derived from knowledge of the subject matter, a formulation of learning
processes, although all of these are prerequisites. The ability to construct high-quality test
items requires knowledge of the techniques and principles of test construction and the skills
in their application.
This literature can be of great help in the ongoing study because it provides the
researcher with a guide in constructing test items for the achievement test in mathematics
despite all the challenges of the present situation. This emphasizes the need to provide
Item Bias
Test equity is a challenging task in the development of educational assessment
instruments. It ensures that no individuals should be disadvantaged in any way to deal with
the instruments. This is primarily achieved by ensuring that a test measures only construct-
and socio-economic status. If test equity is not achieved, a test or test item is biased toward
1983) as cited by Diaz et. al. (2021). This is related to the test fairness issue between groups
of variables and a condition in which a group of characteristics that are not related to the
considered biased if the results have disadvantages for certain students over others based
on their identified profile, such as students' ethnicity, income backgrounds, gender, and
other variables. It is a possible threat to validity (Akcan & Kabasakal, 2019; Clauser &
Mazor, 1998). Identifying test bias requires test developers and educators to determine why
one group of students tends to perform better or worse than another group on a particular
school student populations become more diverse, and exams play increasingly vital roles
in evaluating individual performance or access to opportunities. Item bias analysis does not
examine if there are general between-group variations in total score, or whether group A
members would be more likely to answer "yes" to X than members of group B. Item bias
studies, on the other hand, look for intergroup disparities at the score level, i.e., whether
group A members with a certain attitude level have the same average score on a given item
as group B members with the same attitude level. Bias is not a mere presence of a score
difference between groups. In the test items, bias was the presence of a systematic error in
the measurement. Items may be judged relatively more or less difficult for a particular
group by comparison with the performance of another group or groups drawn from the
Test bias can be categorized into construct, content, and predictive validity bias.
construct-validity bias. The results of an intelligence test, for instance, may reflect a
intellectual talents, since English-language learners are likely to meet vocabulary on the
test that they have not learned. Content-validity bias emerges when a test's subject matter
is substantially harder for one group of pupils than for another. This can happen when a
group is scored unfairly (for instance, when answers that make sense in one group's culture
are deemed correct), when questions are worded in ways that are unfamiliar to some
such as members of various minority groups, have not been given the same opportunity to
learn the material being tested. A subtype of this prejudice, known as item selection bias,
deals with the use of certain test items that are better suited to one group's linguistic and
validity, relates to how effectively a test predicts future performance for a certain student
group. For instance, a test would be regarded as "unbiased" if it accurately predicted future
test bias. Given their crucial role in determining admission to higher education institutions,
frequently generate questions regarding both test bias and fairness. For instance, even
though female students often receive higher grades in college, they frequently score lower
than male students (perhaps due to gender bias in test design) (which possibly suggests
Such bias can be measured by its magnitude at the item level through Differential
Item Functioning (DIF). DIF is an indicator that an item is potentially biased, which will
biased items (Gatchalian & Lantajo, 2010; Osterlind 1983), as cited by Villas (2019). DIF
may be rooted in item bias, which is a harmful DIF (Diaz et. al, 2021). DIF analysis is a
This study focuses on detecting and removing bias items that may be validity bias
based on the learning styles and type of school that the learners enrolled in. It may also
describe the magnitude of the bias based on the DIF analysis. Thus, the literature mentioned
Mathematics 7.
Mantel-Haenszel Test
Different methods for detecting bias have been developed over the years. Among
these methods is the Mantel-Haenszel analysis, which will be used in this study to detect
and remove bias. The Mantel-Haenszel (MH) method is a common method for detecting
differential items. It was seen as a practical means of determining bias test questions
because of its simplicity and ease of use, and it provides effect size statistics if the detected
DIF found is damaging. It is easy to understand and implement and provides a statistical
significance test (Diaz et. al, 2021; Holand & Thayer, 1988; Millsap & Everson, 1993). It
aims to test whether there is an association between group membership and item response
conditional on the total score (Ukanda et. al, 2019; Magis, et. al, 2010). It is widely
implemented in detecting DIF, as the procedure demonstrated external validity and was
performance of all groups on an item. The MH analysis yields a chi-square test with one
degree of freedom; if it is greater than the computed chi-square value of 3.84, the item will
be tagged as a biased item. The MH procedure is also used to estimate the ratio that yields
a measure of effect size, evaluating the magnitude of DIF (Pedrajita, 2017). This ratio was
transformed to produce Delta MH (DMH). A delta metric scale of odds ratio a, as suggested
by Holland and Thayer (1988) and cited by Khalid et. al. (2021), is given on the equation
Delta an MH = -2.35 ln (an MH). A positive DMH indicates DIF in favor of the focal group
proposed by Zwick and Ercikan (1989), cited by Khalid et al.. al (2021), an absolute value
of the delta MH less than 1 is a Type A item, signifies negligible DIF that also indicates
MH chi-square test is not statistically significant and considered to function properly. Type
B items were items with an absolute value of MH delta less than or equal to 1.5 tagged as
moderate DIF. Items that had the lowest delta MH and did not have alternative items could
be used. Finally, item C signifies a large amount of DIF, which is statistically significant.
These are the items with greater than 1.5 absolute value delta MH. A critical review of
these items is necessary and will only be selected in exceptional circumstances. In addition,
because the MH estimator is consistent even when the sample size per stratum is small, it
can be useful in DIF studies, even when there is a very fine partition of the ability
distribution.
These qualities of the Mantel-Haenszel method were the reason for using this
method in the present study. Thus, different studies in MH analysis assert that the results
Related Studies
Advanced Nursing by Ibrahim and Hussein (2014). The primary concern of the research
paper is to assess the Visual, Auditory, and Kinesthetic (VAK) Learning styles of the two
hundred ten (210) nursing students who are enrolled in two Nursing colleges in the
Universities of Mosul and Kirkuk. The results showed that visual learning styles were the
most tallied, with 40% of the 210 nursing students. Auditory and kinesthetic learning styles
accounted for 29.5% and 30.5%, respectively. Based on their sex, females preferred the
auditory learning style (30.3%) more than males (27.3%), while males preferred the
Apipah, et. al. (2018) analyzed a mathematical connection ability based on students'
learning style on visualization, auditory, kinesthetic (VAC) learning model with self-
assessment. The research found that among the VIII-grade students of State Junior High
School 9 Semarang, students with a visual learning style had the highest mathematical
connection ability after taking the assessments. Moreover, students with kinesthetic and
auditory learning styles had average and lowest mathematical connection abilities,
respectively.
Sakinah and Avip (2021) conducted another study that aimed to determine students'
mathematical literacy skills based on their learning styles. The study showed a result
contrary that to of the study by Apipah et al.. al. the study "An analysis of students'
mathematical skills assessed from their learning style". This revealed that the mathematical
literacy skills of students with a kinesthetic learning style were better than those of students
with visual and auditory learning styles. The results of the study revealed low literacy skills
after tallying only 14% of the students who correctly answered the mathematical literacy
question. Furthermore, it was also revealed that visual learners were able to formulate the
given mathematical problems but were lacking in the use of mathematical concepts and
interpreting mathematical problems, while the students who were auditory learners had the
Moreover, Karlimah and Risfiani (2017) emphasized the students with auditory
Students' Mathematical Connection Ability'. It was concluded that learning facilities suited
after improving the learning materials that were suited to the learners. The study suggested
having suited learning facilities for the students as well as students with this kind of
Ishartono et.al (2021) in their study on "Visual, Auditory and kinesthetic students:
How they solve PISA-Oriented Mathematics Problems", revealed there was no difference
in ability between students' visual, auditory, and kinesthetic learning styles in learning
with visual learning styles, and 30% of them were in the high category. Meanwhile, 44%
and 26% were tallied in the medium and low categories, respectively. It was also revealed
that there were only eight and two students with auditory and kinesthetic learning styles,
respectively, who tallied within the medium and low categories, respectively.
On the other hand, the study by Mašic et. al.(2020), " The relationship Between
Learning Styles, GPA, School Level, and Gender," found that the most preferred of 269
middle and high school students in Sarajevo, Bosnia, and Herzegovina was auditory
learners followed by visual and tactile learners. Most of the learners were middle school
students. At both levels, the auditory style was the most preferred learning style, while
contrary results were found for the next preferred learning style. The least preferred
learning style in middle school was visual, while in secondary school it was tactile. It was
The findings of the studies conducted by Ibrahim and Hussein (2014) and Apipah
et. al. (2018), Karlimah and Risfiani (2017), Mašic et. al.(2020), Ishartono et al. (2021),
and Sakinah and Avip (2021) identify the preferred learning styles of students at different
school levels, which may indicate to the present study that the results of tallying students’
preferred learning styles may vary. Thus, the participants in this study may not be assumed
Bhat and Prasad (2021) in their study on "Item analysis and optimizing multiple-
aimed to evaluate the difficulty level, discriminating power with functional distractors of
Multiple-choice questions (MCQs) using item analysis, analyzing the poor items for
writing flaws, and optimization. Items were categorized according to their difficulty index,
deviations, and correlations. Defective items were analyzed for proper construction and
optimization. Seventeen (17) out of 20 defective items were optimized and added to the
question bank; two items were added and modified, and one item was dropped. It was
concluded that item analysis is a valuable tool for detecting poor multiple-choice questions.
reliable, and 40-item achievement test in General Mathematics was constructed. Eight
experts performed the test for improvement and refinement. The developed achievement
test was piloted with 425 senior high school students. The items were analyzed and
subjected to a reliability test. It was found that the average item difficulty was 0.40, which
means intermediate, while there was an average item distinctiveness of 0.34, which
signifies a good item. Moreover, the test also indicated a reliability coefficient of 0.84,
which means that the internal consistency value was acceptable. The results showed that
the developed achievement test for General Mathematics is an excellent tool for classroom
assessment.
differentially functioning items in a chemistry achievement test among public and private
junior high school students in the Division of City schools in Quezon City, found out that
22 out of 50 items displayed statistically bias using the Mantel-Haenszel analysis. Each of
these ten items signified bias against private school examinees, while 12 items indicated
bias against public school examinees. The content validity of the test differed from slightly
to moderately adequate in terms of the number of items retained. The concurrent validity
of the test differed, but all were positive, indicating a moderate relationship between the
examinees' test scores and GPA in Science III. The internal consistency reliability of the
tests differed. The more differentially functioning items were eliminated, the lesser the
content and concurrent validity, and the internal consistency reliability of the test.
Eliminating functioning items diminishes the content validity, concurrent validity, and
internal consistency reliability of the test, as it decreases the length of the number of items
in the test, but could be a basis for enhancing content, concurrent, and internal consistency
reliability by replacing the eliminated DIF items. It was also concluded that Mantel-
Haenszel had a high degree of correspondence between the four methods used in this study.
entitled "Differential Item Functioning in Grade 8 Math using Logistic Regression, Mantel-
Haenszel, and Logical Data Analysis" was conducted to compare DIF methods in
determining bias items between male and female, low and high English proficiency
square value, thus flagged as DIF and characterized as large evident by delta MH that
favors high English proficient examinees. The same number of items was tallied in
detecting DIF items using the examinees' gender. 25 items were significantly different,
containing a large amount of DIF in favor of female examinees. MH analysis indicated that
all the detected DIF items were classified as large amounts of DIF. The study recommends
that the practice of DIF analysis should be incorporated into test development to ensure
test validity from a unified perspective, especially in the Philippines, where studies are
robust and useful and can be used assertively. From a test development perspective,
Mantel-Haenszel should be used to screen unfair items. It provides a magnitude of DIF and
effect size, in addition to a statistical significance test that facilitates further necessary
actions, especially for item writers and practitioners. This finding supports Michaelides'
(2010) study of the European University Cyprus, entitled "An Illustration of a Mantel-
concluded that MH should be applied in the context of test equating to flagging common
items that behave differently across cohorts of examinees. It has the advantage of
conditioning ability when comparing the performance of two groups on an item. There are
guidelines on the effect size that can be used in the decision-making process, whether to
These enumerated studies will be of great help to the researcher to validate the
claims of the present study on detecting and removing item bias in the mathematics
Conceptual Framework
The study aimed to analyze and detect biased items in the Grade 7 Mathematics
Achievement Test using the Mantel Haenszel Method. Figure 1 shows the conceptual
framework of the study. It started with the construction of the test questions. A 60-item test
was used as the main instrument of the study. To ensure the distribution of the items, the
use of Most Essential Learning Competencies (MELCs) was used in the construction of
the table of specification. For content validity, at least 3 experts were asked to validate the
test questions. Afterward, the administration of the test took place. The respondents of the
study were the Grade 7 students of public and private schools in Victoria, Tarlac. The
profile of the students like learning styles, type of school, and academic performance was
gathered through the test questionnaire. An adopted questionnaire for learning styles by
O'Brian and the University of California, Merced Students Advising and learning center
The results of the test were used for item analysis. Item difficulty and
in the previous quarter and their raw scores on the test were compared to find the concurrent
Bias items were detected and analyzed after the test administration as well. Each
item was described with its corresponding profile (Learning styles and school type). The
use of SPSS was employed to compute if the given value of MH chi-square is significantly
acceptable and not identified as bias items (critical value is less than 3.8415 at 0.05 alpha
level, df = 1). Lastly, after the elimination of item bias, and validating the test questions,
the results were compared to the original test questions. The validity was described in terms
RESEARCH METHODOLOGY
This chapter presents the research design, how the population and samples were
determined, the research instrument utilized with the procedures in the validation, how the
data were gathered, and the statistical tools used in the analysis of the gathered data.
Research Design
The study used the scheme of Descriptive and Development research design.
product. Moreover, Gall et. al (2003) stated that this method has two main objectives which
are (1) to develop a product and (2) to validate it. This study sought to develop a 30-item
validated and free-from-bias Mathematics Achievement test in Grade 7 using the Mantel
The Development method follows the following phases: Phase 1- preliminaries and
administration and Phase 3- validation and removal of bias items according to learning
styles and type of school. The final test version consists of validated and unbiased test items
The development of the achievement test was based on the Most Essential Learning
amidst pandemics. A table of specifications was done to make sure that the items were
properly distributed. The teacher-made test questions were constructed and further
validated by experts. Students learning styles and type of school were also determined to
The study also used descriptive research that seeks to analyze and detect biased
items of the Grade 7 Mathematics Achievement Test using the Mantel Haenszel Method.
Also, the study will describe the validity and reliability of the test. McCombes (2019)
define the aims of descriptive research, that is "to describe a population, situation or
phenomenon accurately and systematically. It can answer what, when, where, and how
questions, but not why questions". Also, McCombs described the appropriateness of the
use of descriptive research design, it is "when the research aimed to identify characteristics,
frequencies, trends, and categories." As descriptive research seeks to describe the situation
and how the variables are naturally distributed, the results provide the researcher the data
to illustrate the basic relationships to have a better understanding of the questions asked
(Thyer, 2009).
After the teacher-made test questions were constructed, validation will be done. It
will describe the content, construct, and concurrent validity of the instruments. The internal
consistency of the instruments will also be described using the Kuder Richardson 20. Also,
bias items will be described and eliminated using the Mantel-Haenszel method with
concern for the respondents' learning styles and the type of school they are enrolled.
The respondents of the study were meticulously chosen, with a deliberate emphasis
on the random selection of Junior High School students in the 7th grade who actively
participated in the limited face-to-face modality from the public and private schools in the
Municipality of Victoria. This meticulous approach to respondent selection was designed
to yield a balanced and representative sample of students within the chosen municipality,
thereby facilitating the derivation of insightful and meaningful conclusions from the study's
findings.
drafting the table of specifications of the achievement test to make sure that test items were
distributed evenly. The second quarter of the Grade 7 mathematics competencies was the
focus of the study. MELCs are composed of 10 weeks of competencies. A 60-item multiple
choice type of test was crafted. Three experts that consist of a head teacher, a master
teacher, and a grade 7 teacher, check the essential of each item developed and validated
To facilitate the implementation, the researcher first secured the necessary permits
and letters from the Schools Division Superintendent of Tarlac Province and to the school
head of different public and private schools in Victoria, Tarlac. A 60-item test (see
Appendix D) was given to the respondents in a 2-day scheme since the subject of
Mathematics is good for an hour per day, 4 times a week. On the first day, the students
were given an hour to answer the 30-item test and the 15-item learning style questionnaire
adopted from O'Brien and the University of California, Merced Student, and Advising and
Learning Center. On the second day, the other 30 items were given. The researcher also
make the testing efficient and effective. Below were the test administration procedures:
Pre-Administrator
1. Count the number of copies of the test questionnaire and answer sheet.
will be conducted. A 30-item test will be taken by the students each day for 60 minutes.
3. Shade the letter of your answer on the answer sheet provided. Shade legibly.
5. No erasures. You may use a clean sheet of paper for your computations.
6. You will take the 30-item test in 60 minutes. (State the time that started and the
Post Administration
Content validity through experts was described using the content validity index
before the test administration, and followed by item analysis to describe item difficulty and
discrimination after the score of the students were checked and tallied. To describe the
construct validity of the test, the use of a Principal Component was sought. The concurrent
validity of the instruments was tested by describing their relationship using the Pearson
instruments was also described using the Kuder Richardson 20. The original test version
undergoes the detection of biased items using the Mantel-Haenszel method with concern
for the respondents' learning styles and the type of school they are enrolled at. Detected
Moreover, the validation process will be repeated after the elimination of biased
items to further describe the effectivity of bias elimination on the strength of the validity
of the achievement test. Lastly, after the combined process of biased elimination and test
validity, the revised test version was produced. The validity and reliability of the revised
Research Instrument
The main instrument utilized in the conduct of the study was a 60-item teacher-
Learning Competencies (MELCs) that were validated and checked through consultations
from experts, Master teacher, Head teacher, and Mathematics teacher as well as from the
dissertation adviser of the researcher. Furthermore, construct, concurrent and content
Content Validity
The test questions were evaluated by experts (i.e. Master Teachers in Mathematics,
and Math majors). Experts validated the test questions whether it is essential or not
essential to the Most Essential Learning Competencies (MELCs), subject matter, and level
of the target respondents. Also, experts judged the questionnaire in terms of the clarity of
The use of a content validity index of items was employed to describe the validity
of each item of the standard test. Table 1 shows the number of experts and its implication
The difficulty and discrimination index was also utilized by the researcher to
After computing the difficulty and discrimination indices, items were plotted in
cross-tabulation shown in the table. Frienberg (1995) cited by Isip (2013) identified the
adequate discrimination index (D = 0.3 and above) while the difficulty index must be
within the optimum region (44.51-74.50). After plotting the items, those that fall outside
Table 2 shows the discrimination index and difficulty index of test questions. It
shows that at 44.51 to 74.50 on the difficulty index and at 0.3 up to 1.0 on the discrimination
index, questions are acceptable, and can be retained. Otherwise, questions will be changed
or reconstructed.
Construct Validity
Since the data was found suitable and eligible based on sampling adequacy, using
then the Principal Component Analysis (PCA) was used to measure the test construct
validity.
The value of samples adequacy of KMO was given on Table 3 (Shrestha, 2020):
Value Description
0.80-1.00 Adequate
0.70-0.79 Middling
0.60-0.69 Mediocre
Bartletts' test is designed to test the equality of variances across groups against the
alternative that variances are unequal for at least two groups. Bartlett's test of Sphericity
provides a chi-square output that must be significant. It indicates that the matrix is not an
identity matrix and accordingly, it should be significant (p < 0.05 alpha level).
Concurrent Validity
The performance of the respondents from the second quarter was tabulated as well
as the scores. To identify the concurrent validity of the test questions, the relationship
between the respondents' performance on the test and the respondents' performance during
the second quarter was sought. The second quarter grades were sought on the final grades
composed of 50% written works and 50% performance tasks through standardized
summative tests. The use of the linear regression analysis was also sought to describe the
variance explained by the score on the Mathematics achievement test of the students to
Table 4 shows Cronbach's coefficient of reliability which was used to measure test
questions' internal consistency. It shows that from 0.7 to 0.9 coefficient, the internal
consistency of the test questions will be acceptable while less than 0.7, the internal
a≥0.9 Excellent
the University of California, Merced Student Advising and Learning Center was utilized
3- Often
2-Sometimes
1-Seldom
Statistical Treatment
percentage was utilized. The Likert scale below is the basis for describing the over-all
1.00-1.66 Seldom
1.67-2.33 Sometimes
2.34-4.00 Often
Frequency counts and percentages were also used to describe the respondents'
school type.
Quarter, the level and progress report prescribed by the Department of Education which is
To describe the content validity of the test questions, the content validity index
(CVI) was used. The rating scale for the level of essential of the test questions to the Most
Essential Learning Competencies, subject matter, and level of the target respondents will
be:
1- not essential
2- essential
For the degree of clarity of the test questions, the rating scale below was used:
Item difficulty and discrimination index were also used for content validity.
Analysis was utilized. Kaiser's Criterion and Scree Plot Test were used to identify the
learners during the 2nd quarter and the teacher-made achievement test in mathematics for
To analyze bias items, the Mantel-Haenszel chi-square statistic was used. If the
computed absolute value for Mantel-Haenszel chi-square is less than its critical value
(3.8415) at 0.05 alpha level, with one degree of freedom, the item is acceptable and not
identified as the potentially biased item. The critical value serves as a detection threshold
for potentially biased items. On the other hand, the items having a Mantel-Haenszel chi-
square statistic value greater than the critical absolute value were flagged as biased items.
To describe the degree of bias items using Differential Item Functioning analysis, the
Mantel-Haenszel Delta (MHD) was used. A positive MHD indicates that DIF is in favor
of the focal group, and a negative value indicates DIF is in favor of the reference group
(Khalid et.al, 2021). Differential Item Functioning (DIF) analysis is one method for
examining bias at the item level. DIF analysis is a method for statistically detecting
(2016), as cited also in the study of Ukanda et. al. (2019), categorized bias items (or DIF)
Table 7 shows the detection threshold and effect size of the Mantel-Haenszel Chi-
Table 7
Detections Threshold and Effect Size of Mantel-Haenszel Chi-Square
Statistics DIF Detection Method
Detection Threshold Effect size Category Scale
(Absolute Value)
0.0-1.0 Negligible
>1.5 Large
Chapter 4
This chapter presents the data gathered, the results of the statistical analysis and the
interpretation of findings based on the objectives of the study. These are presented in tables
1. Students' Profile
In this study, Grade 7 students' learning styles, type of school and academic
Table 8
Profile of Grade 7 Students
n=207
Profile Category Frequency Percentage
Auditory 67 32.4
Learning Styles Tactile 64 30.9
Visual 76 36.7
Private 92 44.4
School Type
Public 115 55.6
75-79 26 12.6
Academic performance in Mathematics 80-84 66 31.9
nd
(2 Quarter) 85-89 69 33.3
90-100 46 22.2
Learning styles refer to the ability of learners to perceive and process information
educational outcomes such as effective teaching and learning process (Ibrahim and
Hussein, 2016)
Table 8 shows that students from Grade 7 prefer to learn by seeing visuals such as
graphs, pictures, and other visual instructional materials, as it tallied the highest number of
study respondents among the listed learning styles. This implies that students often learn
better if they work in a quiet place and easily understand and follow directions written on
the board or paper. Students sometimes see the textbook page and where the answer is
The table also shows that Grade 7 students often understood how to do something
if they were being told to them, rather than reading it to themselves. It was also shown that
students often do their best in academic subjects by listening to lectures and tapes.
Sometimes, students remember things they hear rather than see or reading. The results
manner or in the sense that it should be discussed by the teacher, rather than just reading
Grade 7 students often think better when they have the freedom to move. Students
often enjoy working with their hands or making things. They sometimes see someone else
to do the instruction before following it. The table shows that tactile learning styles tallied
the least among grade 7 students in the limited face-to-face class in the Municipality of
Among other learning styles, the visual learning style had the highest mean. This
result is also evident in the research paper entitled "Assessment of visual, auditory and
International Journal of Advanced Nursing studies by Ibrahim and Hussein (2014) wherein
visual learners dominated the study as it tallied 40% of the 210 respondents of two
universities. This might be because students, as respondents of the study, might perform
mathematical connection ability based on student's learning style in the VAC learning
model with assessment among VIII grade students of State Junior High School 9 Semerang,
that visual learning style has the highest mathematical ability. Moreover, students with
kinesthetic and auditory learning styles had average and lowest mathematical connection
abilities, respectively.
On the other hand, the study of Sakinah and Avip (2021) contradicts the claim of
Apipah et. al. (2018). In the study entitled, "An analysis of students' mathematical literacy
skills assessed from students learning style," tactile (kinesthetic) learners had been noted
to be better than visual and auditory learners even though the study was dominated by
visual learners. The skills of students with kinesthetic backgrounds were better than those
of students with auditory and visual learning styles in terms of understanding and using
mathematical concepts. They were also more likely to formulate solutions for these
problems. However, those with auditory learning lack the necessary skills to interpret and
use mathematical concepts. Additionally, those with visual-learning skills were more likely
mathematical literacy skills and ability might serve as evidence that assessment validity
and reliability are crucial with regard to the learning styles of students. Thus, the
Private and Public schools and implemented through DepEd Memorandum no. 071 s.2021.
Since the municipality of Victoria was in a low-risk category for Covid-19 cases, all public
schools and private schools were permitted to join the limited face-to-face. A total of 207
students from public and private schools qualified as respondents for the research. Fifty-
five percent (55.6%) were from public schools and four percent (44.4%) were from private
schools, as shown in Table 8. Qualified respondents were grade 7 students who were
present at the time of the research and tallied with a specific learning style.
annually in both private and public schools. It is a standardized test created to identify
students' success levels, opportunities for improvement, and five major academic subjects
after the school year. According to Department of Education data (DepEd 2014), student
National Achievement Test (NAT). Data indicate that the mean percentage score (MPS) in
the NAT was much below the desired competence level of 75% (Austria, 2020 Interactive
SIM in Selected Topics in Algebra). In addition, the Grade 6 NAT results were at a low
mastery level, as can be seen in the publication on September 26, 2019, which was released.
According to the 2018 NAT results, the national average mean percentage score (MPS)
was the lowest it has ever been for a DepEd standardized test, at 37.44.
The contributions of the results from the National Achievement Tests were students
from public and private schools in the country. In the Grade 6 NAT 2009 overall mastery
level, only 6.82% of the takers from the public schools described it as closely
approximating mastery, and it is better than the 0.36% of takers from private schools. A
total of 52.16% of the takers from public schools tallied moving towards mastery,
compared to only 20.26% of the takers from private schools. On the other hand, a higher
percentage of the average mastery among private school takers 69.13% was tallied
compared to the 37.76% of the public school takers. In addition, private school takers
tallied the highest number of low mastery students, with 10.25% and 3.23%, respectively
(Benito, 2010).
On the other hand, a comparison on the overall achievement level of second year
high school students of the public and private schools in 2009 was also described. 0.13%
of the takers from the public schools tallied closely approximating mastery, while only
0.01% under this mastery level were from private school takers. Twelve and thirty-five
percent (12.35%) of the public school takers were described as moving towards mastery
compared to 5.04% of the takers in private schools. Private schools tallied 71.69% of its
takers compared to 67.89% of public school takers were describe as average. Furthermore,
19.60%, 0.02%, and 0.01% of the public school takers were described as low mastery, very
low mastery, and absolutely no mastery, respectively. Meanwhile, 23.24% and 0.01% of
private school takers were described as having low mastery and very low mastery,
The need to improve NAT results is evident. Thus, this study may lead as a basis
for an additional process in validating standardized test wherever the students came from.
The detection and removal of bias items in any standardized achievement test, such as the
national achievement test, will greatly help increase the mathematical proficiency of the
through documentary analysis. No students who got 75 below were documented, as shown
in Table 8. 33.3% of the students tallied very satisfactory while there are 31.9 students
tallied satisfactorily. Table 6 also shows that there were 46 outstanding students and 26
components prior to the pandemic: written works (40%), performance tasks (40%), and
quarterly assessments (20%). Written work includes long tests, unit tests, or any activities
that ensure students' written skills in expressing their ideas. The performance task included
a skill presentation and demonstration. Written works may also be included in this
component. Finally, quarterly assessments are tested at the end of the quarter (DepEd order
no.8 s2015). The learning continuity plan of DepEd has been crafted to ensure the teaching
and learning process amid the pandemic. Based on the interim guidelines for grading and
assessment in light of the basic education learning continuity plan, students' mathematics
achievement level was composed of written works (50%) and performance tasks (50%) on
whatever forms of modality. Summative tests will continue in the form of written works
and performance tasks (DepEd order no.031 s2020). Thus, the performance of Grade 7
students is based on the different modules answered by modular distance learners and the
activities performed by online distance learners. Mostly, under this modality, tallied
satisfactorily.
The academic performance in mathematics of the students during the second
quarter was used to compute the concurrent validity of the achievement test comparing its
A 60-item test was constructed using the Most Essential Learning Competencies
crafted in the DepEd learning continuity plan 2020. To ensure the distribution of the items,
Content Validity. Three (3) competent content validators checked the essentials of each
item of the mathematics achievement test in Grade 7. Polit and Beck (2006), and Polit et.
al (2007) suggested that the content validity index of the questions should be 1 if there are
3 to 5 experts who validate the questionnaire. It shows that there are only 4 out of 60 items
Table 9
Content Validity of the Grade 7 Mathematics Achievement Test
n=60
Content Validation Frequency Percentage
Accepted (Essential) 56 93.3
Rejected (Non – essential) 4 6.7
Q51, Q57, Q58, and Q60 were non-essential questions based on the Most Essential
temperature, and rate. It should be noted that the items were repeated. Questions 57 and
58 illustrate the linear equation and inequality for one variable, respectively. Question 60
is concerned finding the solution of the linear equation or inequality in one variable.
Validators believed that the questions did not align with the most essential learning
competencies (MELCs). Although the results show that most of the items are accepted,
questions 8,10, 14, 22, 24, and 59 need to be revised based on their degree of clarity.
Content validity of the test items can also be done through item analysis, providing
the discrimination and difficulty index. The difficulty Index is the percentage of the total
Table 10 shows that there are 37 questions that reach the optimum level of difficulty
in the mathematics 7 achievement test, while question 16 tallied as the easiest question
with 76. 8% value. Among the 22 hard questions, Q50, Q17, and Q32 tallied the most
difficult questions (19.8 %, 20.8% and 24.2 %, respectively). Question 16 (Q16) tackles
the derivation of the laws of exponent, while question 50 (Q50) tackles solving problems
addresses the illustration of the linear equation and inequality in one variable, and question
Table 10
Summary of Difficulty Index of the Grade 7 Mathematics Achievement Test
n=60
Difficulty Index Frequency Percentage
19.51 - 44.50 (Hard) 22 36.7
44.51-74.50 (Optimum) 37 61.7
74.51-89.5 (Easy) 1 1.6
achievement test. The index of Discrimination is the difference between the percentage of
correct responses in the upper group and the percentage of correct responses in the lower
indicate good discrimination, and negative discrimination indicates that the item is easier
Table 11
Summary of Discrimination Index of the Grade 7 Mathematics Achievement Test
n=60
Discrimination Index Frequency Percentage
Change/Reconstruct(Below .30) 22 36.7
Acceptable/Retained(.30 and Above) 38 63.3
acceptably and retained a discriminatory index of 0.30 and above. Questions 52, 23, 45,
46, and 15 tallied the highest and most acceptable discrimination indices, respectively. Q52
asked students to translate English phrases into mathematical phrases and English
sentences into mathematical sentences, and vice versa. Q23, Q45, and Q46 ask the students
to use models and algebraic methods to find the: (a) product of two binomials; (b) square
of binomial, (c) product of the sum and difference of two terms; (d) cube of a binomial;
and lastly (e) product of a binomial and a trinomial. Q15 Students deal with the subtraction
of polynomials.
tallied negative discrimination, indicating that the given questions are easier for low-
scoring respondents and need to be removed and changed. Q17 asked students to illustrate
a linear equation and inequality in one variable. Q51 approximates the measures of
quantities, particularly the length, weight/mass, volume, time, angle, temperature, and rate.
Q50 asked students to solve problems involving equations and inequalities in one variable.
Questions 17 and 50 tallied as both the most difficult and had a negative discrimination
Item difficulty indicates the quality of the test items and the test as a whole. With
61.7% of the items tested as the optimum level of difficulty, the target of 50% of the items
was achieved. Q16, who tallied with a high difficulty index value, implied that a greater
proportion of the students answered the question correctly. Meanwhile, 37.7% of the 60
items indicated that there was a smaller proportion of those who understood the question
and answered correctly. Based on Boateng et al.. al. (2018), difficulty of the items may be
due to the item being coded as wrongly, ambiguity with the item, confusing language, or
ambiguity with response options. The study also suggests that a lower difficulty value
requires item modification or deletion from a pool of items. Item difficulty is relevant for
discriminating item indicates questions that differentiate between those who are
knowledgeable about a subject and those who are not (Boateng et. al, 2018). This implies
that the majority of the test questions on the achievement test in mathematics, tallying 57
questions out of 60. This was recommended by Boateng et al.. al (2018) that these items
be considered for revision as the differences could be due to the level of difficulty of the
item. On the other hand, questions that are poorly designed such that the more
knowledgeable get them wrong and less knowledgeable get them right or the negatively
discriminating items were Q17, Q50, and Q51, and should be re-examined and modified.
An item discrimination index was used to improve the test items. If an item is non-
discriminating and fails to discriminate between respondents because it may be too easy,
too hard, or ambiguous, it should be removed. Questions that are negatively discriminating
should be reexamined and modified. Items that are positively discriminated against should
levels should be slightly higher than midway between chance and perfect scores for the
item. The item will have low discrimination if it is so difficult that almost everyone gets
the wrong or guesses the answer, or so easy that almost everyone gets it right. (Office of
as middling adequacy, and Bartlett’s test of sphericity was highly significant (x2 (1770) =
3389, p < 0.001), suggesting a support for Principal Component Analysis (PCA). PCA
showed a 21-factors solution based on Kaiser’s Criterion and Scree Plot Test, as shown in
Figure 2, that provides support for the construct validity of the mathematics achievement
test.
Table 12
KMO and Bartlett’s Sphericity Test of the Original Test Version
These factors accounted for 64. 86% of the variance in scoring. Factor 1 was highly
correlated to Q23, Q24, Q27, Q45. and Q46. Factor 2 was highly correlated with Q5, Q16,
and Q34. Factor 3 was highly correlated Q20, Q30, and Q33. Factor 4 was highly correlated
with Q13 and Q37. Factor 5 was highly correlated with Q14 and Q57, while Factor 6 was
highly correlated with Q49. Factor 7 was highly correlated with Q58, while Factor 8 was
highly correlated with Q8 and Q41. Factor 9 was highly but negatively correlated with Q17
Figure 2
Scree Plot of the Original Test Version
Factor 11 was highly correlated with Q32 while factor 12 was highly correlated to
Q11 and Q56. Factor 13 was highly correlated with Q10 and Q54, and factor 14 with Q55.
Factor 15 and factor 16 were highly correlated to Q15 and Q9, respectively. Moreover,
Q18, Q20, and Q21 were highly correlated with Q60, Q1 and Q52, respectively. Factor 19
Concurrent Validity. Validity was assessed using the current performance of the
respondents associated with the raw scores garnered during the test. Using the Pearson
product-moment correlation (r-value), the relationship between the achievement test and
the second-quarter grade in Mathematics 7 was determined. A r-value of 0.500, as it was
shown in table 13, indicates that there is a moderate positive correlation between the score
on mathematics achievement test and their performance in their 2nd quarter grade. The
Table 13
Concurrent Validity of the Grade 7 Mathematics Achievement Test
Variable X Variable Y
Score
R Sig.
Academic performance in Mathematics (Second quarter) .500** .000
**Correlation was significant at the 0.01 level (2-tailed).
Pedrajita (2017) emphasized that concurrent validity is high if the test scores
obtained by the students are highly correlated with their grade point average, and low if the
test scores obtained are a low magnitude of correlation with their grade. This may indicate
that the test scores of grade 7 students on the mathematics achievement test are an average
was given to students with different learning styles that are enrolled in private and public
schools in the municipality of Victoria. The internal consistency reliability will examine
the consistency of responding to all the individual items that was derived from a single
Table 14
Internal Consistency Reliability of the Mathematics Test
Reliability Test Value Interpretation
Kuder-Richardson 20 0.875 Good
(KR20)
Table 14 shows that the constructed achievement test in mathematics has a Good
Assessment, tests with high internal consistency were items with mostly positive
relationships with total test scores. Thus, the constructed achievement test implies a
positive relationship with the students in grade 7 total test scores. A high reliability
indicates that the questions of a test tended to pull together. Students who answered a given
question correctly were more likely to answer other questions correctly. Low reliability
indicates that the questions tended to be unrelated to each other in terms of who answered
them correctly.
Students’ Profile
Item bias analysis through Differential Item Functioning (DIF) analysis examines
whether the construction of an index from two or more variables results in a bias about
different criteria. DIF is a popular and effective way to study item bias. A statistically
significant difference was observed across two or more groups of examinees due to the
characteristics of the item unrelated to the construct being measured. An item is considered
positively or negatively biased for a group within a population if the average expected item
score for that group is substantially higher or lower, respectively, than that for the overall
population.
students’ type of school, either public (focal group) or private (reference group), as shown
in Table 15. A total of 10 items displayed statistical bias as it tallied with a significant MH
chi square value based on the type of school. These are questions 2, 15, 18, 20, 22, 29, 32,
46 51, and 59. Q32 tallied the highest amount of Differential Item Functioning (DIF) with
an absolute value of 2.49, while Q15 tallied with an absolute value of 2. 48 effect size. A
statistically significant chi square value indicates a large Differential Item Functioning
Table 15
Detected Bias Items Based on School Type
Item Chi-Square p-value Item Chi-Square p-value
Q1 0.618 0.432 Q31 3.661 0.056
Q2 4.904* 0.027 Q32 8.086* 0.004
Q3 0.822 0.365 Q33 0.375 0.540
Q4 0.055 0.815 Q34 0.131 0.717
Q5 0.171 0.679 Q35 0.979 0.323
Q6 0.073 0.788 Q36 0.000 0.988
Q7 1.218 0.270 Q37 0.012 0.914
Q8 1.134 0.287 Q38 0.070 0.791
Q9 0.050 0.823 Q39 3.831 0.050
Q10 0.002 0.963 Q40 1.105 0.293
Q11 0.411 0.521 Q41 3.259 0.071
Q12 0.002 0.963 Q42 0.006 0.938
Q13 1.506 0.220 Q43 2.222 0.136
Q14 2.748 0.097 Q44 0.269 0.604
Q15 10.56* 0.001 Q45 3.760 0.052
Q16 1.096 0.295 Q46 4.305* 0.038
Q17 0.044 0.834 Q47 0.020 0.889
Q18 5.724* 0.017 Q48 1.079 0.299
Q19 3.760 0.052 Q49 0.676 0.411
Q20 5.798* 0.016 Q50 1.317 0.251
Q21 0.107 0.744 Q51 6.512* 0.011
Q22 7.042* 0.008 Q52 0.495 0.482
Q23 2.271 0.132 Q53 2.473 0.116
Q24 0.543 0.461 Q54 0.629 0.428
Q25 0.575 0.448 Q55 0.137 0.711
Q26 0.689 0.407 Q56 0.000 0.988
Q27 3.090 0.079 Q57 3.095 0.079
Q28 2.241 0.134 Q58 0.089 0.766
Q29 6.259* 0.012 Q59 6.301* 0.012
Q30 1.909 0.167 Q60 2.045 0.153
Note: Highlighted items were statistically bias (MH chi-square value greater than 3.841)
Among the detected bias items, Q15, Q32, Q46, Q51, and Q59 were interpreted as
difficult based on the item difficulty analysis. In terms of the discriminatory index, suggests
that Q15, Q32, and Q51 should be changed or reconstructed. The results only confirmed
All questions displayed statistical bias in favor of the focal group, which consisted
of public school learners. Instructional practices and resources may have contributed to
these differences in student performance. During the pandemic and post-pandemic periods,
there were more seminars/workshops provided and carried out with regard to the
development of learning materials in public schools than in private schools. Public schools
compendium of notes, and instructional support (aside from the SLMs given by DepEd) to
make teaching mathematics simpler among teachers and to make learning easier for
students. Meanwhile, private schools rely solely on modules crafted by the department and
from existing textbooks and reference books prior to the pandemic. However, the detection
of biased items regarding school type may address the problem of test equity issues.
Eliminating these biased items will ensure accurate evaluation among learners, regardless
of their school type, which will contribute to recalibrating teaching practices and
curriculum development. Thus, the need to conduct bias elimination for test standardization
Table 16 shows the detected item bias based on student's learning styles,
specifically between the auditory (reference group) and non-auditory (focal group)
learners. Only three out of 60 items on the achievement test in Mathematics 7 obtained a
significant Mantel-Haenszel chi square value. Q31 had the largest effect size among the
items, with 1.96, a large amount of DIF. Q57 and Q37 tallied with a large effect size of
DIF at absolute values of 1.82 and 1.75, respectively. Q31 approximates the measures of
quantities, particularly length, weight/mass, volume, time, angle, temperature, and rate, as
matter, and the level of the target respondents. This may imply that other factors may
contribute to their identification as biased items. This is also true for Q37, which evaluates
algebraic expressions for the given values of the variables. On the other hand, Q57
suggested that it is not essential for experts to remove MELC. Moreover, the discrimination
index of the item suggests changing or reconstructing the question; however, it reaches the
Significantly biased items Q31, Q37, and Q57 favored the focal group (non-
auditory learning style). These learning styles were either visual or tactile learners who
tended to learn through visual aids and solve problems through trial-and-error approaches.
The most essential learning competencies of these identified items require students to
approximate, evaluate, and illustrate that might be learned through a rigid explanation of
the concepts through class discussion with the use of illustrations, board works, and
examples. With the learning materials (SLM, LAS, other printed materials) given to the
learners, it was assumed that during the pandemic and post-pandemic era, visual learners
were favored over auditory learners. This aspect of teaching and learning affects learning
outcomes, especially when designing tests that maximize equity and avoid biased items.
Thus, factors emphasizing item bias highlighted the need to conduct the study to ensure a
Table 16
Detected Bias Items as Based on Learning Style (Auditory vs Non-Auditory)
Item Chi-Square p-value Item Chi-Square p-value
Q1 1.839 0.175 Q31 7.100* 0.008
Q2 0.039 0.843 Q32 0.340 0.560
Q3 3.362 0.067 Q33 0.637 0.425
Q4 0.511 0.475 Q34 0.642 0.423
Q5 0.000 0.998 Q35 0.000 0.987
Q6 1.439 0.230 Q36 0.002 0.969
Q7 0.155 0.694 Q37 5.360* 0.021
Q8 2.302 0.129 Q38 0.081 0.776
Q9 0.622 0.430 Q39 1.974 0.160
Q10 2.785 0.095 Q40 0.029 0.865
Q11 3.363 0.067 Q41 1.295 0.255
Q12 1.067 0.302 Q42 2.994 0.084
Q13 0.390 0.532 Q43 0.013 0.910
Q14 0.260 0.610 Q44 0.010 0.919
Q15 0.023 0.881 Q45 0.863 0.353
Q16 1.083 0.298 Q46 0.659 0.417
Q17 0.890 0.346 Q47 0.569 0.451
Q18 0.238 0.626 Q48 0.878 0.349
Q19 0.723 0.395 Q49 0.066 0.797
Q20 0.071 0.790 Q50 0.687 0.407
Q21 1.914 0.166 Q51 0.084 0.771
Q22 1.105 0.293 Q52 0.012 0.912
Q23 0.107 0.743 Q53 1.274 0.259
Q24 0.818 0.366 Q54 1.846 0.174
Q25 0.371 0.542 Q55 0.001 0.976
Q26 0.043 0.836 Q56 0.477 0.490
Q27 0.669 0.413 Q57 5.564* 0.018
Q28 0.089 0.766 Q58 0.477 0.490
Q29 1.189 0.276 Q59 0.009 0.925
Q30 0.434 0.510 Q60 0.000 0.994
Note: Highlighted items were statistically bias (MH chi-square value greater than 3.841)
Between visual (reference group) and non-visual (focal group) learners, there are 6
questions identified as bias items after tallying a significant MH chi square value. As shown
in Table 17, Q60 tallied the largest amount of DIF with an effect size of absolute value of
2.49, while Q9, and Q6 tallied absolute value of 2.27 and 2.17, respectively. Q11, Q31 and
Table 17
Detected Bias Items as Based on Learning Style (Visual Vs Nonvisual)
Item Chi-Square p-value Item Chi-Square p-value
Q1 0.134 0.714 Q31 7.3298* 0.007
Q2 1.006 0.316 Q32 1.938 0.164
Q3 0.029 0.864 Q33 1.075 0.300
Q4 0.154 0.695 Q34 2.194 0.139
Q5 1.369 0.242 Q35 1.218 0.270
Q6 7.982* 0.005 Q36 0.004 0.951
Q7 3.676 0.055 Q37 0.552 0.457
Q8 0.947 0.330 Q38 0.001 0.979
Q9 6.764* 0.009 Q39 2.262 0.133
Q10 0.011 0.918 Q40 0.642 0.423
Q11 4.782* 0.029 Q41 0.018 0.892
Q12 0.091 0.762 Q42 3.697 0.055
Q13 0.001 0.974 Q43 0.002 0.963
Q14 3.805 0.051 Q44 0.506 0.477
Q15 0.017 0.896 Q45 0.122 0.727
Q16 3.049 0.081 Q46 0.006 0.936
Q17 2.311 0.128 Q47 0.002 0.965
Q18 1.738 0.187 Q48 7.184* 0.007
Q19 0.051 0.821 Q49 0.264 0.607
Q20 1.984 0.159 Q50 0.026 0.872
Q21 0.001 0.972 Q51 2.719 0.099
Q22 2.201 0.138 Q52 1.738 0.187
Q23 0.805 0.369 Q53 0.269 0.604
Q24 0.055 0.814 Q54 0.046 0.830
Q25 0.183 0.669 Q55 0.032 0.859
Q26 0.400 0.527 Q56 0.057 0.811
Q27 0.472 0.492 Q57 2.993 0.084
Q28 0.249 0.618 Q58 0.003 0.959
Q29 0.580 0.446 Q59 0.037 0.847
Q30 0.019 0.890 Q60 7.439* 0.006
Note: Highlighted items were statistically bias (MH chi-square value greater than 3.841)
Question 6 (Q6) was suggested to change or reconstruct the item based on the
discrimination index, but it reached the optimum level of difficulty. Experts believe that
Q6 is essential to MELC, Subject matter, and the level of target respondent with an
acceptable clarity of the statement of the question. It tackles the illustration and
differentiates related terms in algebra: a. an, where n is a positive integer; b. constants and
and polynomials; e. number of terms, degree of the term, and degree of the polynomial.
Experts suggested that Q60 be removed from the mathematics achievement test
because the item was found to be non-essential to MELCs. It was also described that the
item is hard based on the difficulty index and needs to be changed and reconstructed in
terms of the discrimination index through item analysis. These findings are contrary to
those of Q48 and Q31. The item was acceptable in terms of content validity, reached an
optimum level based on the difficulty index, and had an acceptable discrimination index.
Questions 60 and 48 address the finding of the solution of the linear equation or inequality
respectively, are essential to MELCs, subject matter, and level of the target respondents.
The said items needed to be changed or reconstructed based on the tallied discrimination
index, but only Q11 was tagged as hard based on the difficulty index of the test questions.
Out of the six questions, five items were biased in favor of the reference group,
which consisted of visual learners. These were Q6, Q9, Q11, Q31, and Q48, while Q60
was significantly biased toward the focal group, which was a non-visual learner. The results
indicated that there was a higher score among the visual learners on the identified items.
There is a sufficient representation in mathematics teaching. Representation, as define by
Mainali (2021) and Goldin (2001), is a sign or combination of signs, characters, diagrams,
objects, pictures, or graphs, which can be utilized in teaching and learning mathematics
that can be done through verbal, graphic, algebraic and numeric. Specifically, they are
graphs. Competency questions regarding identified bias against visual learners were asked
contribute to differences in performance with regard to students' learning styles. Thus, test
equity may be addressed by eliminating biased items from the original test version.
Table 18 shows the detected bias item through Mantel-Haenszel method based on
the learning styles, of tactile and non-tactile learners. It was shown that Q18, Q29, and Q60
were tagged as part of differential item functioning. Based on the results in table 15, among
the three questions, Q60 has an absolute value of 2.172 amount of DIF based on MHD.
Q18 and Q29 had tallied amounts of DIF with absolute value of 1.55 and 1.54, respectively.
Question 60 focused on finding the solution of the linear equation or inequality in one
variable. It was interpreted as hard based on the difficulty index and a changeable level of
discrimination index was recorded. Experts also agreed to remove the questions because
they were not essential to MELC, subject matter, and target respondents of the study.
Differentiating algebraic expressions, equations, and inequalities, and solving
problems involving equations and inequalities in one variable were the focus of questions
18 and 29. Both items were acceptable difficulty indices and had an optimum
discrimination index. Experts believe that the items are essential when it comes to MELC,
subject matter, and the target respondents of the study. The questions were clearly stated
Table 18
Detected Bias Items as Based on Learning Style (Tactile Vs Non-Tactile)
Item Chi-Square p-value Item Chi-Square p-value
Q1 0.699 0.403 Q31 0.001 0.980
Q2 1.956 0.162 Q32 0.471 0.492
Q3 2.291 0.130 Q33 0.015 0.904
Q4 1.646 0.200 Q34 0.338 0.561
Q5 1.120 0.290 Q35 0.934 0.334
Q6 2.493 0.114 Q36 0.031 0.861
Q7 2.088 0.148 Q37 2.013 0.156
Q8 0.131 0.717 Q38 0.012 0.912
Q9 3.054 0.081 Q39 0.000 0.996
Q10 2.049 0.152 Q40 0.257 0.612
Q11 0.070 0.791 Q41 0.702 0.402
Q12 2.284 0.131 Q42 0.011 0.917
Q13 0.561 0.454 Q43 0.000 0.999
Q14 1.866 0.172 Q44 0.237 0.626
Q15 0.030 0.862 Q45 0.181 0.670
Q16 0.348 0.555 Q46 0.345 0.557
Q17 0.199 0.656 Q47 0.323 0.570
Q18 4.080* 0.043 Q48 2.879 0.090
Q19 0.226 0.635 Q49 0.016 0.900
Q20 1.101 0.294 Q50 0.671 0.413
Q21 1.639 0.201 Q51 1.618 0.203
Q22 0.111 0.739 Q52 2.001 0.157
Q23 0.206 0.650 Q53 0.200 0.654
Q24 0.269 0.604 Q54 3.065 0.080
Q25 0.000 0.994 Q55 0.000 0.999
Q26 0.089 0.766 Q56 0.089 0.766
Q27 0.001 0.969 Q57 0.188 0.665
Q28 0.004 0.947 Q58 0.361 0.548
Q29 4.186* 0.041 Q59 0.021 0.886
Q30 0.925 0.336 Q60 7.110* 0.008
Note: Highlighted items were statistically bias (MH chi-square value greater than 3.841)
Q18 and Q29 were statistically biased against the focal group, non-tactile learners
(either auditory or visual learners), while Q60 was statistically bias against the reference
value and it also indicates a large amount of Differential Item Functioning (DIF) that
displayed statistical bias. The results evidently show that all amount of bias contained on
the tagged items were large DIF that is also evident in the study of Villas (2019). Out of
40, 25 items on Probability and Statistics were tagged in his study as biased against females
and high English proficient examinees using the same method, Mantel-Haenszel analysis.
This may conclude that items flagged with DIF using the MH method were mostly large
and it is harmful to test items. A critical review of large DIF items is necessary and will be
Table 19 shows the summary results of detected bias items using Mantel-Haenszel
chi-square analysis on the mathematics achievement test of Grade 7 students between types
statistics (MHD), which implies that DIFs were in favor of the public school students.
These are questions 2, 15, 18 20, 22,29, 32 46, 51 and 59. In addition, Q18 and Q29 also
obtained a positive MHD between tactile and nontactile learners, signifying that DIF is in
favor nontactile learners. The Q60 tallied negative MHD values between tactile and non-
On the other hand, questions 6, 9, 11, 31,48, and 60 tallied a large amount of DIF
between the visual and non-visual learners using MH chi square statistics. Questions 6, 9,
11, 31, and 48 obtained a negative value for MHD, which signifies questions in favor of
the visual learners. Meanwhile, Q60 obtained a positive MHD value, which signifies in
favor non-visual learners. A positive value of MHD between auditory and non-auditory
learners was obtained for questions 31, 37, and 57, signifying that the biased questions
Among the detected DIF items, Q51 and Q60 were unacceptable to the content; it
is not essential to Most Essential Learning Competencies (MELCs), subject matter, and
grade 7 students. It is also subject to change or reconstruct after tallying below the 0.30
discrimination index, and tagged as a hard difficulty index tallying below 44.51. These
results imply that questions 51 and 60 are content-validity-related biases. Q57 may imply
content validity and discrimination index related bias. Content validity-related bias may
imply Q11, Q15, and Q32 after tallying below 44.51 in the difficulty index and below 0.30
in the discrimination index. With the optimum level of difficulty and acceptable content
but tallied below 0.30, the discrimination index was addressed by Q6 and Q9.
Moreover, even though they are tagged as statistically biased items with significant
values of MH chi square, Q2, Q18, Q20, Q22, Q29, Q31, Q37, and Q48 have acceptable
content validity as well as difficulty and discrimination indices. Thus, other factors may
contribute to the items that were tagged as biased, such as questions that are not
demographically and culturally holistic. It may also be due to linguistic and socioeconomic
bias in some of the respondents, as cited in the Glossary of Education Reform. The test
format may also contribute to bias, which may favor the specific learning style introduced
in the study.
The study implies that aside from the validity of content and using the indices of
assessments. A test developer must consider diversity in language, culture, and socio-
Content Validity. Of the 60 test questions in the mathematics achievement test of grade
7, 42 questions (70 %) were retained after the detection of biased items using the Mantel-
Haenszel Chi square analysis. Only one of the retained questions was suggested to be
removed by the experts because it was not essential to the MELCs, subject matter, and the
level of grade 7 respondents, as shown in Table 20. The sole question, Q58, suggests that
Table 20
Content Validity of the Grade 7 Mathematics Achievement Test
After the Detection of Bias Items
n=42
Content Validation Frequency Percentage
Accepted (Essential) 41 97.6
Rejected (Non – essential) 1 2.4
The Difficulty index of the retained questions after the detection of biased items
consisted of 26 questions, or 61.9%, which reached the optimum level, while there were
15 questions that were tagged as hard, as shown in Table 21. Only 2.4 percent or one
Table 21
Summary of Difficulty Index of the Grade 7 Mathematics Achievement Test
After the Detection of Bias Items
n=42
Difficulty Index Frequency Percentage
19.51 - 44.50 (Hard) 15 35.7
44.51-74.50 (Optimum) 26 61.9
74.51-89.5 (Easy) 1 2.4
Table 22
Summary of Discrimination Index of the Grade 7 Mathematics Achievement Test
After the Detection of Bias Items
n=42
Discrimination Index Frequency Percentage
Acceptable/Retained(.30 and Above) 29 69.0
Change/Reconstruct(Below .30) 13 31.0
Table 23
Content Validity in terms of Difficulty and Discrimination Index of the Grade 7
Mathematics Achievement Test After the Detection of Bias Items
n=42
Item Analysis
Former Item No. Difficulty Index(%) Discrimination Index
New Item No.
Q1*** 37.20% 0.24 Remove
Q3 64.30% 0.43 Q1
Q4 44.90% 0.42 Q2
Q5 69.10% 0.4 Q3
Q7 50.20% 0.48 Q4
Q8*** 33.80% 0.04 Remove
Q10 44.90% 0.34 Q5
Q12 49.30% 0.15 Q6
Q13 53.10% 0.35 Q7
Q14 40.60% 0.41 Q8
Q16 76.80% 0.49 Q9
Q17** 20.80% -0.07 Remove
Q19 48.30% 0.43 Q10
Q21** 30.90% 0.14 Remove
Q23 53.10% 0.54 Q11
Q24 42.50% 0.44 Q12
Q25 69.60% 0.43 Q13
Q26 57.00% 0.37 Q14
Q27 54.10% 0.43 Q15
Q28 41.10% 0.32 Q16
Q30 52.70% 0.32 Q17
Q33 42.00% 0.34 Q18
Q34 58.50% 0.43 Q19
Q35 72.50% 0.43 Q20
Q36 51.70% 0.45 Q21
Q38 56.00% 0.41 Q22
Q39 57.00% 0.52 Q23
Q40 64.30% 0.45 Q24
Q41*** 25.10% 0.19 Remove
Q42 49.80% 0.42 Q25
Q43*** 31.90% 0.25 Remove
Q44 58.00% 0.5 Q26
Q45 48.30% 0.48 Q27
Q47 47.80% 0.24 Q28
Q49 48.30% 0.35 Q29
Q50*** 19.80% -0.05 Remove
Q52 56.50% 0.51 Q30
Q53*** 39.10% 0.18 Remove
Q54 46.40% 0.33 Q31
Q55*** 36.70% 0.19 Remove
Q56*** 43.00% 0.24 Remove
Q58*** 43.00% 0.22 Remove
**Items with poor and negative discrimination indexes are considered removed items.
***Items with poor and unacceptable discrimination and difficulty index considered as removed items.
This shows that the majority of the items were acceptable and retained as it tallies
0.30 and above the discrimination index. Only 13 questions of the retained items were
Q56, and Q58 will be disregarded as it tallies with a poor discrimination index (below 0.30
at the same time and does not meet the optimum level at 44.51%-74.50% difficulty index.
Questions 17, and 21 poor and negative discrimination indices but a highly acceptable
difficulty index, will be excluded in the final revision of the achievement test on
mathematics 7.
Q12 and Q47 tallied with a poor discrimination index but a highly acceptable difficulty
index subject to inclusion in the final achievement test but need to be revised. The different
actions will be taken to questions 14, 16, 24, 28, and 33, whose difficulty index is tagged
as easy/hard items but a highly acceptable index of discrimination and subject to revisions,
as middling adequacy, and Bartlett’s test of sphericity was highly significant (x2 (861) =
1951, p < 0.001), suggesting a support for Principal Component Analysis (PCA). PCA
showed 14-factor solution based on Kaiser’s Criterion and Scree Plot Test that provides
support for the construct validity of the mathematics achievement test. These factors
accounted for 59. 8% of the variance in scoring. Factor 1 was highly correlated with Q19,
Q23, Q27, Q39, and Q44. Factor 2 highly correlated with Q3, Q7, and Q40.
While Q5, Q16 and Q34 were highly correlated to Factor 3. On the other hand,
Factor 4 was correlated with Q4, Q33, and Q42. Q24 and Q38 were highly correlated with
Factor 5. Q13 was highly correlated with Factor 6. Furthermore, Factor 7 was highly
correlated with Q10 and Q54. Q12 and Q56 were highly correlated to Factor 8
Table 24
KMO and Bartlett’s Sphericity Test After the Detection of Bias Items
Test Value Description
Moreover, Factor 9 was highly correlated with Q8 and Q41. Negatively high
correlation was tallied by Q1, and a positively high correlation was tallied by Q47 with
Factor 10. Q21 was highly correlated and Q17 was negatively high correlated with Factor
11 and 12, respectively. Factor 13 is highly correlated with Q53 while Factor 14 is with
Q50 and Q58. No questions were highly correlated to at least two factors. Thus,
Figure 3
Scree Plot of the Test After the Detection of Bias Items
test of grade 7 students is shown in Table 25. An r-value of 0.530 indicates that there is a
moderate positive correlation between the performance of the students during the second
quarter test and their scores on the achievement test. The scores of the students in
Mathematics achievement test account 28.1% of variance in their academic performance
Table 25
Concurrent Validity of the Grade 7 Mathematics Achievement Test
After the Detection of Bias Items
Variable X Variable Y
Score
r Sig.
Academic performance in Mathematics (Previous quarter) .530** .000
**. Correlation was significant at the 0.01 level (2-tailed).
Predajita (2017) stated that the larger the concurrent validity, the less homogeneity
among groups of test scores. The more homogeneous the test becomes, the less valid it
becomes. The detection and removal of biased items using the Mantel-Haenszel Chi square
contributed to the increase in the concurrent validity value by 0.030, which indicates a
lower test homogeneity but a more valid set of test questions. Thus, this result signifies that
the detection and removal of biased items increase the validity of a standardized test.
Internal Consistency Reliability. The test length, according to the Glossary of Education
Reform, indicates that more items have higher reliability. Pedrajita (2017) emphasized that
the KR20 reliability analysis also indicates that the greater the length of the test, the higher
the internal consistency reliability. After the detection of biased items using Mantel-
Haenszel chi square analysis, a difference of 0.031 was computed. A total of 42 of 60 items
Table 26
Internal Consistency Reliability of the Mathematics Achievement Test
After the Detection of Bias Items
Reliability Test Value Interpretation
Kuder-Richardson 20 0.844 Good
(KR20)
5. Comparison of Content, Construct and Concurrent validity of the Original and
The main target of the study was to produce a 30-item validated and reliable
achievement test in mathematics 7 free from bias based on the students’ type of school and
learning styles. After validating and detecting bias using Mantel-Haenszel Chi square
analysis, a total of 31 items from the pool of 60 items covering the second quarter of the
Content Validity. From a total of four items that were not essential to MELCs, subject
matter, and to the level of grade 7 students, no more items were tallied as non-essential by
the experts. Thus, a maximum of 100% of the items, or a total of 31 items, were accepted
Table 27
Comparison of Content Validity of the Grade 7 Mathematics Achievement
Original and Revised Test Versions
Original Test Revised Test
Content Validation (n=60) (n=31)
Frequency Percentage Frequency Percentage
Accepted (Essential) 56 93.3 31 100
Rejected (Non – essential) 4 6.7 0 0
A decrease of 11 optimum questions was reduced in the original test has been made
after the validity tests and detection of bias items using the MH Chi-square analysis,
tallying 26 optimum questions retained. Only four items were optimized after they were
tagged as hard on the difficulty index, as shown in Table 28. The sole easy question, Q16,
Table 28
Comparison of Difficulty Index of the Grade 7 Mathematics Achievement
Original and Revised Test Versions
Original Test Revised Test
Difficulty Index (n=60) (n=31)
Frequency Percentage Frequency Percentage
19.51 - 44.50 (Hard) 22 36.7 4 12.9
44.51-74.50 (Optimum) 37 61.7 26 83.87
74.51-89.5 (Easy) 1 1.7 1 3.23
Table 29 shows the reduction in the nine acceptable items based on the
discrimination index.
Table 29
Comparison of Discrimination Index of the Grade 7 Mathematics Achievement
Original and Revised Test Versions
Original Test Revised Test
Discrimination Index
Frequency Percentage Frequency Percentage
Change/Reconstruct(Below .30) 22 36.7 2 6.45
Acceptable/Retained(.30 and Above) 38 63.3 29 93.55
Two of the 22 questions were modified, and changed, measuring the same
competencies and further included during the final revision of the test. The final
achievement test for Mathematics 7 is now composed of 93.55 percent acceptable items
and 6. 45 percent of the reconstructed items were from the original pool of questions. Two
questions were subject to revision and optimization through retesting as part of the
In general, twenty-four items among the retained questions, after detecting biased
items, were tallied with a highly acceptable discrimination and difficulty index. The
questions automatically included in the achievement test in Mathematics 7 were Q3, Q4,
Q5, Q7, Q10, Q13, Q19, Q23, Q25, Q26, Q27, Q30, Q34, Q35, Q36, Q38, Q39, Q40, Q42,
Table 30
Itemized Difficulty and Discrimination Index of the Revised Version of the Grade 7
Mathematics Achievement Test
Former Item No. Content Validity New Item No.
Difficulty
Discrimination Index
Index(%)
Q3 64.30% 0.43 Q1
Q4 44.90% 0.42 Q2
Q5 69.10% 0.4 Q3
Q7 50.20% 0.48 Q4
Q10 44.90% 0.34 Q5
Q12** 49.30% 0.15 Q6
Q13 53.10% 0.35 Q7
Q14* 40.60% 0.41 Q8
Q16* 76.80% 0.49 Q9
Q19 48.30% 0.43 Q10
Q23 53.10% 0.54 Q11
Q24* 42.50% 0.44 Q12
Q25 69.60% 0.43 Q13
Q26 57.00% 0.37 Q14
Q27 54.10% 0.43 Q15
Q28* 41.10% 0.32 Q16
Q30 52.70% 0.32 Q17
Q33* 42.00% 0.34 Q18
Q34 58.50% 0.43 Q19
Q35 72.50% 0.43 Q20
Q36 51.70% 0.45 Q21
Q38 56.00% 0.41 Q22
Q39 57.00% 0.52 Q23
Q40 64.30% 0.45 Q24
Q42 49.80% 0.42 Q25
Q44 58.00% 0.5 Q26
Q45 48.30% 0.48 Q27
Q47** 47.80% 0.24 Q28
Q49 48.30% 0.35 Q29
Q52 56.50% 0.51 Q30
Q54 46.40% 0.33 Q31
*Items with hard/easy item difficulty but with an acceptable index of discrimination are considered for inclusion but subject to revision
**Items with poor discrimination index but highly acceptable difficulty index considered for inclusion
It also shows that questions 14, 16, 24, 28 and 33 were among the 16.13 percent of
the questions that needs revision to achieve their maximum level of acceptability in terms
inequalities, deriving the laws of exponent, the use models and algebraic methods to find
the square of a binomial, finding the solution of linear equation or inequality in one the
square of a binomial, finding the solution of linear equation or inequality in one the square
of a binomial, finding the solution of linear equation or inequality in one the square of a
binomial, finding the solution of linear equation or inequality in one the square of a
binomial, finding the solutions of linear equation or inequality in one the square of a
binomial, finding the solutions of the linear equation or inequality in one variable, and
converting measurements from one unit to another in both Metric and English systems,
respectively.
Among the 6.45 percent of the items that needed to be reconstructed based on the
discrimination index, as shown in Table 27, were Q12 and Q47. Students were asked to
evaluate algebraic expressions for the given values of the variables and find the solution of
Construct Validity. After the elimination of items, the revised test version’s Kaiser-
Meyer-Olkin (KMO) test yielded an index of 0.806, described as adequate, and Bartlett’s
test of sphericity was highly significant (x2 (496) = 1520, p < 0.001), suggesting support
PCA only showed a 10-factor solution based on Kaiser’s Criterion and Scree Plot
Test that provides support for the construct validity of the mathematics achievement test,
Table 31
KMO and Bartlett’s Sphericity Test After the Detection of Bias Items
Test Value Description
correlated with Q3, Q5, and Q16. Factor 2 was highly correlated with Q27 and Q45. Factor
3 tallied highly correlated with Q4 and Q38. Q5 was highly correlated with Q13, while
Q28 and Q47 were highly correlated with Factor 6. It was also found that Factor 7 was
highly correlated with Q38 and Q40. Factors 8, 9 and 10 were highly correlated with Q12,
Figure 4
Scree Plot of the Revised Test Version
The original test version showed that there are no items that measure the same
construct. While Q38 was tallied as highly correlated with two factors, Factor 4 and Factor
7, hence, the item measures the same construct in the revised test version. Thus, the item
of the mathematics achievement test 7 based on the concurrent validity of the test upon
comparing the test results and the academic performance of the students in the second
quarter. An increase of 0.053 indicates that after the exclusion of the items identified as
biased based on the Mantel-Haenszel Chi-square analysis, as well as items with much
difficulty and discrimination index, the revised version of the achievement test in
Mathematics 7 signifies that there is a higher correlational coefficient of the test scores in
the achievement test and students' academic performance in the second quarter.
Table 32
Comparison of Concurrent Validity of Grade 7 Original and Revised Mathematics
Achievement Test Versions
Variable X Original Test Revised Test
Score Score
r Sig. R Sig.
Academic performance in Mathematics .500** .000 .553** .000
(Previous quarter)
**. Correlation was significant at the 0.01 level (2-tailed).
The scores of the students in the revised test version of the Mathematics
achievement test account for 30.6% of the variance in their performance during the second
quarter as revealed by Linear Regression compared to the 25% of the variance of the
original test version. This also indicates that the group of test scores is less homogeneous
on the revised test questions; thus, the test is more valid (Pedrajita, 2016). The result
reflects that the validation process and detection of item bias using the MH chi-square
however, the revised test version decreased by 0.003 using the KR 20 analysis. This type
of reliability analysis indicates that the higher the number of tests retained (bias-free items),
the higher the internal consistency reliability (Pedrajita 2017; Glossary of Education
Reform).
Table 33
Internal Consistency Reliability of the Original and Revised Test Versions
Original Test Revised Test
Reliability Test
Value Interpretation Value Interpretation
Kuder-Richardson 0.875 Good 0.872 Good
20 (KR20)
From the 60 items on the mathematics achievement test, 31 items were retained
after conducting a single test content, construct, and concurrent validity as well as after the
detection of bias items. Pedrajita’s (2017) study has a similar claim to the study’s decrease
in reliability. its reliability of 0.71 with 50 items, a decrease of 0.14 in the reliability with
28 questions retained using the Mantel Haenszel analysis decreased. Thus, a test may be
made more reliable by increasing its length (Pedrajita, 2017; Ferguson & Takane, 1989).
The final Test Version of the achievement test composed of 31 items divided into
the 10-week Most Essential Learning Competencies (MELCs) as it was shown in Table
34. A sufficient number of items was tallied on weeks 4, 5, 9, and 10. While there is a need
for 2 items on weeks 1, 7 and 8. There is an excess item on the actual number of items on
weeks 2, 3, and 6. Thus, the output recommends increasing the number of items of the
original test version to come up with the needs of the number of items on the target number
Table 34
Table of Specification of the Revised Test Version of Achievement Test in
Mathematics 7
Learning Levels
Number
MELC’s Number (Item Distribution)
Week of
of items K C Ap An S E TOTAL
Hours
ITEMS
Approximates the measures of
quantities particularly length, 1
1 4 3 1
mass/weight, volume, time, angle,
temperature and rate.
Converts measurements from one unit
2 to another in both Metric and English 3 18 2
systems. 4 3 4
Solves problems involving conversion
2 20
of units of measurement.
Translates English phrases to
mathematical phrases and English 4
3 30
sentences to mathematical sentences, 21
and vice versa.
Illustrates and differentiates related
terms in algebra:
a. a where n is a positive integer, b. 4 3 5
constants and variables, c. literal
19
3 coefficients and numerical 5
coefficients, d. algebraic expressions,
terms and polynomials, e. number of
terms, degree of the term, and the
degree of the polynomial.
Evaluate algebraic expressions given
4 6
values of the variables
4 3 3
Add polynomials and subtract 7
4
polynomials 23
5 Derive the laws of an exponent. 9
Multiply and divide polynomials. 4 3 10 3
5
26
Uses models and algebraic methods to
find the: (a) product of two binomials;
(b) square of a binomial, (c) product of 12
6 4 4 13 27 11 31 5
the sum and difference of two terms;
(d) cube of a binomial (e) product of a
binomial and trinomial;
Solves problems involving algebraic
24 22
expressions.
Differentiates algebraic expressions,
8
7-8 equations, and inequalities 8 6 4
25
Illustrates linear equations and
inequality in one variable.
9-10 Finds the solution of a linear equation
28 14 16
or inequality in one variable.
Solves the linear equation or
inequality in one variable involving
8 6 15 6
absolute value by graphing and
algebraic method.
Solves problems involving equations 17
and inequalities in one variable 29
Total 40 31 2 6 8 5 4 6 31
Remarks: K-Knowledge, C-Comprehension, Ap-Application, An-Analysis, S-Synthesis, E-Evaluation
Chapter 5
SUMMARY, CONCLUSIONS, AND RECOMMENDATIONS
This chapter presents the summary of the study, the conclusions derived from the
findings, and the recommendations derived from the data which were shown, analyzed,
and interpreted.
Summary
The study was conducted to detect, analyze, and eliminate biased items in the
mathematics achievement test using the Mantel-Haenszel method as a basis for test
standardization.
It aimed to describe the students' profiles, such as type of school, learning styles,
and academic performance during the second quarter in Mathematics 7. This study used a
descriptive method to analyze the content, construct, and concurrent validity of a teacher-
made achievement test in mathematics. The internal consistency reliability and the
significantly biased items across the student profile are also described.
Three experts were asked to check the test content validity, and the item difficulty
and discrimination index was used to identify construct validity. The relationship between
the student's academic performance and achievement test scores was used to describe the
concurrent validity of the test. The Kuder-Richardson 20 was used to describe the internal
consistency validity of the achievement test. The Mantel-Haenszel chi-square was used to
detect the bias items, while DIF analysis using the MH Delta was used to analyze the
The respondents of the study were grade 7 students from public and private schools
them were Visual learners, tallied the most number, while Tactile learners tallied the least
with 64 or 30.9% of the total learners. Most of the qualified learners came from public
schools with 115 students or 55.6% while 44.4% or 92 students were from private schools.
Moreover, among these qualified learners, 33.3% and 31.9% with second quarter grades of
85-89 and 80-84, respectively. Only 12.6 percent or 26 students tallied 75-79 grades for
2. Three competent content validators found out that out of 60 items from the original test,
four items were non-essential to the subject matter, MELCs, and grade 7 students. These
items were Q51, Q57, Q58 and Q60. Sixty-one and seven percent (61.7%) or 37 questions
achieved an optimum level of difficulty index while there were 22 questions or 36.7% of
the test items were tagged as Hard questions. On the other hand, Q16 was the only item
tagged as Easy in terms of the Difficulty index. A total of 38 questions out of 60 items on
the original version of the test, were suggested to be retained as it tallies an acceptable
index of discrimination. Principal Component Analysis (PCA) was used to describe the
construct validity of the achievement test. A total of 21 factors describe 64.86% of the total
the original test version and the academic performance of the students in mathematics
during the 2nd quarter. It was determined that there is a moderate positive correlation with
an R-value of 0.500. In contrast, good internal consistency reliability was determined using
Haenszel Chi-Square analysis with the students' profiles as a basis. A total of 10 questions
displayed statistical bias (critical value > 3.8415 at 0.05 alpha level with one of) based on
the type of school. These are questions 2, 15, 18, 20, 22, 29, 32, 46, 51 and 59. A positive
delta MH (MHD) value statistics indicative that bias was against the public school students.
Between Auditory and non-auditory learning styles, only Q31, Q37, and Q57 out of 60
questions were tagged as significantly biased against non-auditory learners, as they tallied
with a positive MHD value. Out of 60 items, only 6 items were found statistically biased.
These were Questions 6, 9, 11, 31, 48, and 60. Q6, Q9, Q11, Q31, and Q37 were
statistically biased against visual learners after tallying negative MHD values. In contrast,
Q60 tagged bias against non-visual learners with positive MHD. Q60 was also significantly
biased against tactile learners tallying negative MHD. Q18 and Q29, together with Q60,
were significantly biased between the tactile and non-tactile learners. These questions were
value and it also indicated a large amount of Differential Item Functioning (DIF) that
displayed statistical bias. These were Q2, Q6, Q9, Q11. Q15, Q18, Q20, Q22, Q29, Q31,
4. After removing 18 statistically biased items, Q58 was suggested to be removed by the
experts after tallying a non-essential value on MELCs, subject matter, and to the level of
optimum level. Twenty-nine questions reach the acceptable value of 0.30 and above on the
Positive moderate concurrent validity is denoted by the achievement test after eliminating
statistically biased items with a 0.530 r-value. A Good interpretation of internal consistency
reliability was computed after tallying the 0.844 KR20 value of the mathematics
achievement test.
5. The revised test version is composed of 31 questions after testing the validity and
detection of bias items. Four non-essential questions were removed from the original test
based on the experts' judgment on content validity. A total of 18 hard questions were
removed, and only four questions were retained in the revised version. Twenty-six of the
31 questions on the test versions were on the optimum level of the difficulty index, while
the item tagged as easy was also retained for optimization. In terms of the discrimination
index, only of out 2 questions on the revised test version out of 22 questions below the 0.30
discrimination index on the original test versions were retained. On the other hand, the
remaining 29 questions on the revised test version out of the 38 questions with 0.30 a
discrimination index greater than or original test version were retained. PCA only showed
a 10-factor solution based on Kaiser’s Criterion and Scree Plot Test that provides support
for the construct validity of the mathematics achievement test, compared to 21 factors of
the original test version. A moderately positive correlation, with an increase of 0.053 in the
r-value on the concurrent validity of the revised test version, was evident after the detection
greater the number of items, the larger the reliability value. Thus, there was a 0.003
decrease in the revised test version in terms of the internal consistency reliability of the
achievement test.
Conclusions
As an outcome of the findings presented in the previous discussions, the following
1. Based on the results, it concludes that grade 7 students from public and private schools
prefer to learn by seeing visuals such as graphs, pictures, and other visual instructional
materials. This learning factor may conclude the contribution of biased items on the
achievement test.
2. The validity and reliability process revealed that most of the developed items were
crafted on a basis that it measures what is intended to measure with considerations on the
Most Essential Learning competencies, subject matter, and level of grade 7 students. Thus,
the majority of the questions were retained and subject to bias elimination analysis.
3. Validated test questions do not imply that they are free from bias as revealed by the
study. The detection and removal of these biased items strengthen the validity of the
4. Based on the results of the elimination of biased items, test questions' validity and
reliability strengthen as it increases the percentage of valid items that can be retained on
5. The final test version of the achievement test in Mathematics 7 concludes to have a
higher percentage of acceptable validity items than the original test version.
6. A decrease in internal consistency reliability was observed in the revised test version as
Recommendations
The following recommendations were formulated by the researcher based on the
1. Teachers, as test developers, should add more identified student profiles affecting the
2. The effectiveness of the classical way of validating a test was evident in the results of
the study. Thus, the continuously used of the process is still recommended in validating
tests.
test items is a better add-on process in test validations. Thus, teachers, as test validators,
should engage in detecting and eliminating biased items to strengthen the validity of the
questionnaire.
4. Re-administration of the achievement test by test developers after the detection of biased
items should be performed to further optimize, refine, and purify the item content of the
test.
the test.