MeasurementandEvaluationinEducation
MeasurementandEvaluationinEducation
net/publication/369479186
CITATIONS READS
10 10,184
2 authors, including:
Dr Thiru Moorthy
Regional Institute of Education(RIE) Ajmer
53 PUBLICATIONS 30 CITATIONS
SEE PROFILE
All content following this page was uploaded by Sivakumar A. on 24 March 2023.
Dr. A. Sivakumar
G. Thirumoorthy
v
vi Measurement and Evaluation in Education
♦
is specially prepared for the student and teachers of Master of Arts,
Master of Science, Master of Philosophy in Educational Technology
& Education and Doctor of Philosophy in Educational Technology &
Education. It consists of five units which deal with Measurement and
evaluation, Research instruments, Construction of test, Methods of
reliability and validity and Tool construction procedure. This book is
dedicated to all students. Suggestions and comments to improve the CONTENTS
contents of this book will be welcomed.
Dr. A. Sivakumar
M.Sc., M.Ed., M.Phil., Ph.D
Assistant Professor,
KSR College of Education, Chapter-1: Measurement and Evaluation in Education 1
and 1.1. Measurement and Evaluation 1
G. Thirumoorthy, 1.2. Concept of Evaluation 2
M.Sc., M.Ed., M.Phil, 1.3 Meaning of Evaluation 3
Assistant Professor, 1.4. Item Formats 29
Michael Job Memorial College of Education for Women
1.5. Multiple Choices 29
1.6. Matching 31
1.7. Guidelines for item Preparation 35
References 175
ix
Chapter-1
MEASUREMENT AND EVALUATION IN
EDUCATION
Nature of Measurement
• It should be quantitative in nature
• It must be precise and accurate (instrument)
• It must be reliable
• It must be valid
• It must be objective in nature
Measurement refers to the process by which the attributes or
dimensions of some physical object are determined. One exception
seems to be in the use of the word measure in determining the IQ
of a person. The phrase, “this test measures IQ” is commonly used.
Measuring such things as attitudes or preferences also applies. However,
when we measure, we generally use some standard instrument to
determine how big, tall, heavy, voluminous, hot, cold, fast, or straight
something actually is. Standard instruments refer to physical devices
such as rulers, scales, thermometers, pressure gauges, etc. We measure
1
2 Measurement and Evaluation in Education Measurement and Evaluation in Education 3
♦ ♦
to obtain information about what is. Such information may or may • Judgment forming
not be useful, depending on the accuracy of the instruments we use, • Decision making
and our skill at using them. There are few such instruments in the
• Evaluation is a concept that has emerged as a prominent process of
social sciences that approach the validity and reliability of say a 12”
assessing, testing and measuring. Its main objective is Qualitative
ruler. We measure how big a classroom is in terms of square feet,
Improvement.
we measure the temperature of the room by using a thermometer,
and we use an Ohm meter to determine the voltage, amperage, and • Evaluation is a process of making value judgements over a level
resistance in a circuit. In all of these examples, we are not assessing of performance or achievement. Making value judgements in
anything; we are simply collecting information relative to some Evaluation process presupposes the set of objectives.
established rule or standard. Assessment is therefore quite different • Evaluation is the process of determining the extent to which the
from measurement, and has uses that suggest very different purposes. objectives are achieved.
When used in a learning objective, the definition provided on the • Concerned not only with the appraisal of achievement, but also
ADPRIMA for the behavioral verb measure is: To apply a standard with its improvement.
scale or measuring device to an object, series of objects, events, or • Evaluation is continuous and dynamic. Evaluation helps in forming
conditions, according to practices accepted by those who are skilled the following decisions.
in the use of the device or scale. An important point in the definition
is that the person be skilled in the use of the device or scale. For Types of Decisions
example, a person who has in his or her possession a working Ohm • Instructional
meter, but does not know how to use it properly, could apply it to an
• Curricular
electrical circuit but the obtained results would mean little or nothing
in terms of useful information. • Selection
The process of measurement as it implies involves carrying out • Placement or Classification
actual measurement in order to assign a quantitative meaning to a • Personal
quality. Measurement is therefore a process of assigning numerals to
objects, quantities or events in other to give quantitative meaning to 1.3 MEANING OF EVALUATION
such qualities. In the classroom, to determine a child’s performance, • It is a technique by which we come to know at what extent the
you need to obtain quantitative measures on the individual scores objectives are being achieved.
of the child. If the child scores 80 in Mathematics, there is no other
• It is a decision making process which assists to make grade and
interpretation you should give it. Cannot say student has passed or
ranking.
failed. Measurement stops at ascribing the quantity but not making
According to Barrow and Mc Gee It is the process of education
value judgment on the child’s performance.
that involves collection of data from the products which can be used
1.2. CONCEPT OF EVALUATION for comparison with preconceived criteria to make judgment.
6. Evaluation includes all the means of collecting information about any numerical
below. significance.
Notice that all of A good
these scales are mutually way(notooverlap)
exclusive remember
and none all of have
of them this is that
the student’s learning. The evaluator should make use of tests, “nominal”
any sounds aA lot
numerical significance. goodlike “name”
way to remember and
all of nominal scalessounds
this is that ―nominal‖ are kind
a lot of like
observation, interview, rating scale, check list and value judgement “names”
like ―name‖ andornominal
labels. scales are kind of like ―names‖ or labels.
Measurement Scales
Measurement scales are used to categorize and/or quantify variables.
The four scales of measurement that is commonly used in statistical
analysis: nominal, ordinal, interval, and ratio scales.
Examples of Nominal Scales
Properties of Measurement Scales Note:
Examples a sub-type
of Nominal Scales of nominal scale with only two categories (e.g.
Each scale of measurement satisfies one or more of the following male/female) is called “dichotomous.” If you are a student, you can
Note: a sub-type of nominal scale with only two categories (e.g. male/female) is called
properties of measurement. use that to Ifimpress
―dichotomous.‖ youryou
you are a student, teacher.
can use that to impress your teacher.
With ordinal scales, it is the order of the values is what‘s important and significant, but
12 Measurement and Evaluation in Education Measurement and Evaluation in Education 13
♦ ♦
Ordinal Scale of Measurement example of an interval scale is Celsius temperature because the difference
Interval scales
between are nice
each value is thebecause
same. the
Forrealm
example,of statistical analysis
the difference on these data sets
between
With ordinal scales, it is the order of the values is what’s important
and significant, but the differences between each one is not really known. up. For 60example,
and 50 degrees
central istendency
a measurable
can be10measured
degrees, by as is the difference
mode, median, or mean; st
Take a look at the example below. In each case, we know that a #4 is better between 80 and 70 degrees. Time is another good example of an interval
deviation can also be calculated.
than a #3 or #2, but we don’t know–and cannot quantify–how much better scale in which the increments are known, consistent, and measurable.
it is. For example, is the difference between “OK” and “Unhappy” the Interval scales are nice because the realm of statistical analysis on
Like data
these the others, youup.
sets opens canForremember the keytendency
example, central points ofcananbe―interval
measured scale‖ pretty
same as the difference between “Very Happy” and “Happy?” We can’t say.
Ordinal scales are typically measures of non-numeric concepts by mode,
―Interval‖ median,―space
itself means or mean; standard deviation
in between,‖ which is thecan important
also be calculated.
thing to remember–i
like satisfaction, happiness, discomfort, etc. Like the others, you can remember the key points of an “interval
scales not only tell us about order, but also about the value between each item.
“Ordinal” is easy to remember because is sounds like “order” and scale” pretty easily. “Interval” itself means “space in between,” which
―Ordinal‖ is easy to remember because is sounds like ―order‖ and that‘s the key to remember
that’s the key to remember with “ordinal scales”–it is the order that is the important thing to remember–interval scales not only tell us about
with ―ordinal scales‖–it is the order that matters, but that‘s all you really get from these. Here‘sbut
order, thealso
problem
about with interval
the value scales:each
between theyitem.
don‘t have a ―true zero.‖ For ex
matters, but that’s all you really get from these.
Advanced note: The best way to determine central tendency on a there is no Here’s the problem
such thing with interval scales:
as ―no temperature.‖ Withoutthey don’t
a true have
zero, it ais “true
impossible to co
Advanced note: The best way to determine central tendency on a set of ordinal data is to use the
set of ordinal data is to use the mode or median; the mean cannot be ratios. zero.” For example,
With interval data, there
we can is no
addsuch
andthing as “nobut
subtract, temperature.”
cannot multiplyWithoutor divide. Con
mode
definedor median;
from antheordinal
mean cannot
set. be defined from an ordinal set. a true zero, it is impossible to compute ratios. With interval data, we
Ok, consider
can addthis:and10subtract,
degreesbut + 10 degrees
cannot = 20 degrees.
multiply or divide. NoConfused?
problem there. Ok, 20 degrees
twice asconsider
hot as 10 this: 10 degrees
degrees, + 10 degrees
however, because =there
20 degrees. Nothing
is no such problem
as ―nothere.
temperature‖ w
comes 20 degrees
to the is not
Celsius twice
scale. as hotthat
I hope as 10 degrees,
makes sense.however,
Bottombecause there isscales are gre
line, interval
no such thing as “no temperature” when it comes to the Celsius scale.
we cannot calculate
I hope ratios,sense.
that makes whichBottom
brings us to our
line, last measurement
interval scale…
scales are great, but we
cannot calculate ratios, which brings us to our last measurement scale…
Ratio scales are the ultimate nirvana when it comes to measurement scales because they
14 Measurement and Evaluation in Education Measurement and Evaluation in Education 15
s about the order, they tell us the exact value between units, AND ♦ they also have an ♦
Ratio Scale
ute zero–which of Measurement
allows for a wide range of both descriptive and inferential statistics to be Characteristics of Good Evaluation
Ratio scales are the ultimate nirvana when it comes to measurement Characteristics of a Good Evaluation Tool
ed. At thescales
risk of repeating
because myself,
they tell everything
us about the order, above
they tellabout
us theinterval
exact data applies to ratio
1. Objective-basedness: Evaluation is making judegement about some
value between
+ ratio scales have a units,
clear AND they also
definition of have
zero.anGood
absolute zero–which
examples of ratio variables include phenomena or performance on the basis of some pre-determined
allows for a wide range of both descriptive and inferential statistics
objectives. Therefore a tool meant for evaluation should measure
t and weight.
to be applied. At the risk of repeating myself, everything above
attainment in terms of criteria determined by instructional objectives.
about interval data applies to ratio scales + ratio scales have a clear
This is possible only if the evaluator is definite about the objectives,
definition of zero. Good examples of ratio variables include height
Ratio scales provide a wealth of possibilities when it comes to statistical analysis. These the degree of realization of which he is going to evaluate. Therefore
and weight.
each item of the tool should represent an objective.
bles can be Ratio scales provide a wealth of possibilities when it comes
meaningfully added, subtracted, multiplied, divided (ratios). Central 2. Comprehensiveness: A tool should cover all pints expected to be
to statistical analysis. These variables can be meaningfully added,
ncy can besubtracted,
measuredmultiplied,
by mode,divided
median, or mean;
(ratios). measures
Central tendencyofcan
dispersion,
be such as standard learnt by the pupils. It should also cover all the pre-determined
measured by mode, median, or mean; measures of dispersion, such objectives. This is referred to be comprehensiveness.
tion and coefficient of variation
as standard deviation can also of
and coefficient bevariation
calculated from
can also ratio scales.
be calculated 3. Discriminating power: A good evaluation tool should be able
from ratio scales. to discriminate the respondents on the basis of the phenomena
measured. Hence while constructing a tool for evaluation, the
discrimination power has to be taken care of. This may be at two
levels- first for the test as a whole and then for each item included.
4. Reliability: Reliability of a tool refers to the degree of consistency
and accuracy with which it measures what it is intended to measure.
If the evaluation gives more or less the same result every time it is
used, such evaluation is said to be reliable. Consistency of a tool
can be improved by limiting subjectivity of all kinds. Making items
on the basis of pre-determined specific objectives, ensuring that the
expected answers are definite and objective, providing clearly spelt-
out scheme for scoring and conducting evaluation under identical
and ideal condition will help in enhancing reliability. Test-retest
This Device Provides Two Examples of Ratio Scales method, split-half method and equivalent form or parallel form
(height and weight) method are the important methods generally used to determine
Device Provides
TheTwo
ratio Examples of Ratio satisfies
scale of measurement Scales all(height
four ofand weight)
the properties the reliability of a tool.
of measurement: identity, magnitude, equal intervals, and a minimum 5. Validity: Validity is the most important quality needed for an
value of zero.The weight of an object would be an example of a ratio evaluation tool. If the tool is able to measure what it is intended
The ratio scale of measurement satisfies all four of the properties of measurement:
scale. Each value on the weight scale has a unique meaning, weights to measure, it can be said that the tool is valid. It should fulfill the
ty, magnitude,
can beequal intervals,
rank ordered, andalong
units a minimum
the weightvalue
scale of
arezero.The weight of an object would
equal to one objectives for which it is developed. Validity can be defined as “
another, and the scale has a minimum value of zero. Weight scales have the accuracy with which it measures what it is intended to measure
example of a ratio scale.
a minimum value Each
of zerovalue on objects
because the weight
at restscale
can behas a unique meaning, weights can
weightless, or as the degree in which it approaches infallibility in measuring
but they cannot have negative weight. what it purports to measure Content validity, predictive validity,
nk ordered, units along the weight scale are equal to one another, and the scale has a
mum value of zero. Weight scales have a minimum value of zero because objects at rest can
ightless, but they cannot have negative weight.
16 Measurement and Evaluation in Education Measurement and Evaluation in Education 17
♦ ♦
construct validity, concurrent validity, congruent validity, factorial package, etc has been completed, and the course curriculum or package
validity, criterion-related validity, etc. are some of the important is ready to use in its final form. The object of such evaluation is to
types of validity which is needs to fulfill by a tool for evaluation. determine whether it meets its design criteria, ie whether it does the
6. Objectivity: A tool is said to be objective if it is free from personal job for which it was designed. Summative evaluation may also be
bias of interpreting its scope as well as in scoring the responses. carried out in order to compare one course, curriculum, educational
Objectivity is one of the most primary pre-requisites required for package, etc with another (or several others), e.g. to compare the
maintaining all other qualities of a good too. relative effectiveness of two different courses in the same general
area, or to determine which of a number of different textbooks is
7. Practicability: A tool, however, well it satisfies all the above
most suitable for use in a particular course. In such evaluation, the
criteria, may be useless unless it is not practically feasible. For
object is not to improve the courses or textbooks being evaluated;
example, suppose, in order to ensure comprehensiveness, it was felt
rather, it is to choose between them.
that thousand items should be given to be answered in ten hours.
This may yield valid result, but from practical point of view it s Formative and Summative Assessments in the Classroom
quite impossible.
Summative Assessments: Are given periodically to determine at
Formative and Summative Evaluation a particular point in time what students know and do not know. Many
associate summative assessments only with standardized tests such
Probably the most basic distinction is that between formative
as state assessments, but they are also used at and are an important
evaluation and summative evaluation.Defining Formative and Summative
part of district and classroom programs. Summative assessment at the
Assessments terms “formative” and “summative” do not have to be
district/classroom level is an accountability measure that is generally
difficult, yet the definitions have become confusing in the past few years.
used as part of the grading process. The list is long, but here are some
This is especially true for formative assessment. In a balanced assessment
examples of summative assessments:
system, both summative and formative assessments are an integral part
of information gathering. Depend too much on one or the other and the • State assessments
reality of student achievement in your classroom becomes unclear. • District benchmark or interim assessments
Formative evaluation: This is evaluation that is carried out while a • End-of-unit or chapter tests
course, curriculum, educational package, etc is actually being developed,
• End-of-term or semester exams
its main purpose being to find out whether it needs to be, and if so,
whether it realistically can be improved. The key feature of all such • Scores that are used for accountability for schools (AYP) and
evaluation is that it is designed to bring about improvement of the students (report card grades).
course, curriculum or educational package while it is still possible to The key is to think of summative assessment as a means to gauge, at
do so, ie while the material has not yet been put into its operational a particular point in time, student learning relative to content standards.
form. In the case of a major course that is to be run throughout a country Although the information that is gleaned from this type of assessment is
or internationally, such evaluation must clearly be carried out before important, it can only help in evaluating certain aspects of the learning
the course design is finalised, the necessary resource materials mass process. Because they are spread out and occur after instruction every
produced, and the course implemented. In the case of an educational few weeks, months, or once a year, summative assessments are tools to
package, it must be carried out before the final package is published. help evaluate the effectiveness of programs, school improvement goals,
Summative evaluation: This is evaluation that is carried out alignment of curriculum, or student placement in specific programs.
once the development phase of a course, curriculum, educational Summative assessments happen too far down the learning path to
provide information at the classroom level and to make instructional
18 Measurement and Evaluation in Education Measurement and Evaluation in Education 19
♦ ♦
adjustments and interventions during the learning process. It takes shows that the involvement in and ownership of their work increases
formative assessment to accomplish this. students’ motivation to learn. This does not mean the absence of teacher
Formative Assessment: Is part of the instructional process. When involvement. To the contrary, teachers are critical in identifying learning
incorporated into classroom practice, it provides the information needed goals, setting clear criteria for success, and designing assessment tasks
to adjust teaching and learning while they are happening. In this sense, that provide evidence of student learning.
formative assessment informs both teachers and students about student One of the key components of engaging students in the assessment
understanding at a point when timely adjustments can be made. These of their own learning is providing them with descriptive feedback as
adjustments help to ensure students achieve targeted standards-based they learn. In fact, research shows descriptive feedback to be the most
learning goals within a set time frame. Although formative assessment significant instructional strategy to move students forward in their
strategies appear in a variety of formats, there are some distinct ways learning. Descriptive feedback provides students with an understanding
to distinguish them from summative assessments. of what they are doing well, links to classroom learning, and gives
One distinction is to think of formative assessment as “practice.” specific input on how to reach the next step in the learning progression.
We do not hold students accountable in “grade book fashion” for skills In other words, descriptive feedback is not a grade, a sticker, or “good
and concepts they have just been introduced to or are learning. We job!” A significant body of research indicates that such limited feedback
must allow for practice. Formative assessment helps teachers determine does not lead to improved student learning.
next steps during the learning process as the instruction approaches There are many classroom instructional strategies that are part of
the summative assessment of student learning. A good analogy for this the repertoire of good teaching. When teachers use sound instructional
is the road test that is required to receive a driver’s license. What if, practice for the purpose of gathering information on student learning,
before getting your driver’s license, you received a grade every time they are applying this information in a formative way. In this sense,
you sat behind the wheel to practice driving? What if your final grade formative assessment is pedagogy and clearly cannot be separated from
for the driving test was the average of all of the grades you received instruction. It is what good teachers do. The distinction lies in what
while practicing? Because of the initial low grades you received during teachers actually do with the information they gather.
the process of learning to drive, your final grade would not accurately Some of the instructional strategies that can be used formatively
reflect your ability to drive a car. In the beginning of learning to drive, include the following:
how confident or motivated to learn would you feel? Would any of the • Criteria and goal Setting: With students engages them in
grades you received provide you with guidance on what you needed instruction and the learning process by creating clear expectations.
to do next to improve your driving skills? Your final driving test, In order to be successful, students need to understand and know
or summative assessment, would be the accountability measure that the learning target/goal and the criteria for reaching it. Establishing
establishes whether or not you have the driving skills necessary for a and defining quality work together, asking students to participate in
driver’s license—not a reflection of all the driving practice that leads establishing norm behaviors for classroom culture, and determining
to it. The same holds true for classroom instruction, learning, and what should be included in criteria for success are all examples of
assessment. this strategy. Using student work, classroom tests, or exemplars of
Another distinction that underpins formative assessment is student what is expected helps students understand where they are, where
involvement. If students are not involved in the assessment process, they need to be, and an effective process for getting there.
formative assessment is not practiced or implemented to its full
effectiveness. Students need to be involved both as assessors of their • Observations: Go beyond walking around the room to see if students
own learning and as resources to other students. There are numerous are on task or need clarification. Observations assist teachers in
strategies teachers can implement to engage students. In fact, research gathering evidence of student learning to inform instructional
20 Measurement and Evaluation in Education Measurement and Evaluation in Education 21
♦ ♦
planning. This evidence can be recorded and used as feedback classroom level balances formative and summative student learning/
for students about their learning or as anecdotal data shared with achievement information, a clear picture emerges of where a student
them during conferences. is relative to learning targets and standards. Students should be able to
• Questioning Strategies: Should be embedded in lesson/unit articulate this shared information about their own learning. When this
planning. Asking better questions allows an opportunity for deeper happens, student-led conferences, a formative assessment strategy, are
thinking and provides teachers with significant insight into the valid. The more we know about individual students as they engage in
degree and depth of understanding. Questions of this nature engage the learning process, the better we can adjust instruction to ensure that
students in classroom dialogue that both uncovers and expands all students continue to achieve by moving forward in their learning.
learning. An “exit slip” at the end of a class period to determine
Uses of Evaluation
students’ understanding of the day’s lesson or quick checks during
instruction such as “thumbs up/down” or “red/green” (stop/go) cards Students who are completing or have completed a course have
are also examples of questioning strategies that elicit immediate observed a teacher for many hours and are in a position to provide
information about student learning. Helping students ask better potentially useful information concerning the teacher’s effectiveness.
questions is another aspect of this formative assessment strategy. Some of this information might be difficult or costly to obtain through
other channels. Student evaluations provide student perspectives on
• Self and peer Assessment: Helps to create a learning community
teacher performance which may be intrinsically valuable for:
within a classroom. Students who can reflect while engaged in
metacognitive thinking are involved in their learning. When students 1. Continually sensitizing or reminding teachers that students are
have been involved in criteria and goal setting, self-evaluation is a their customers.
logical step in the learning process. With peer evaluation, students 2. Encouraging faculty members to devote the time and effort necessary
see each other as resources for understanding and checking for for good teaching.
quality work against previously established criteria. 3. Serving as a diagnostic tool to identify weakness in teacher
• Student record keeping: Helps students better understand their effectiveness; and
own learning as evidenced by their classroom work. This process 4. Aiding in self-improvement.
of students keeping ongoing records of their work not only engages The basic goal for the use of student evaluations of teachers is to
students, it also helps them, beyond a “grade,” to see where they contribute to high quality in teaching. Student evaluations alone will
started and the progress they are making toward the learning goal. not provide sufficient information to judge faculty performance in all
All of these strategies are integral to the formative assessment dimensions of teaching, but student evaluations can provide a triggering
process, and they have been suggested by models of effective middle mechanism for the identification of superior and unsatisfactory teachers.
school instruction. In this regard, student evaluations can play only a part in helping to
make useful distinctions among teachers for:
Balancing Assessment
1. Promotion and tenure decisions
As teachers gather information/data about student learning, several
categories may be included. In order to better understand student learning, 2. Salary increases and
teachers need to consider information about the products (paper or 3. Improvement or removal of unsatisfactory teachers.
otherwise) students create and tests they take, observational notes, and Student evaluations of teachers may be influenced by such factors as:
reflections on the communication that occurs between teacher and student • Course rigor
or among students. When a comprehensive assessment program at the • Class size
22 Measurement and Evaluation in Education Measurement and Evaluation in Education 23
♦ ♦
• Gender composition of the students in the class Since the mid-1960s, the number of alternative approaches to
• The student’s expected grade in the course conducting evaluation efforts has increased dramatically. Factors such
as the United States Elementary and Secondary Education Act of 1965 that
• Course content
required educators to evaluate their efforts and results, and the growing
• Whether the course is a required or an elective course public concern for accountability of human service programs contributed
• Class level to this growth. In addition, over this period of time there has been an
• The instructor’s professorial rank international movement towards encouraging evidence based practice
in all professions and in all sectors. Evidence Based Practice (EBP)
Types of Evaluation requires evaluations to deliver the information needed to determine what
There are two types of school evaluation: the best way of achieving results is.
• Self-evaluation: This is an internal process of school self-reflection, Steps in Evaluation Approach
whereby the school carries out a systematic examination of the
outcomes of its own agreed courses of action. The school may Evaluation on a broad level is helpful in examining the influence
use an external adviser to assist the self evaluation. This person of courses of action on:
may be the school existing facilitator, or a critical friend (i.e. an • Core issues such as mission, vision, and school aims
outside person chosen by the school, or a teacher not involved • Learning and teaching
in the particular issue being self-evaluated). Such a person may • Perceived changes in the climate or environment facing the school
bring objectivity to the exercise. The focus of these guidelines is
• Planning structures e.g. task groups, steering group etc.
on self-evaluation.
Specifically self-evaluation enables the school to:
• External Evaluation: This is an evaluation carried out by an external
• Measure the progress of implementation of courses of action
body (e.g. Dept. of Education & Science, the schools trustees in
relation to issues such as Religious Formation, Finance, and Plant • Examine the impact of these on:
Management). The School Development Plan can be a valuable • The whole school
resource in this context as it can give the school the confidence • The classroom
to participate in such external evaluations.
• The individual student and teacher
Meaning and Definition of Evaluation Approach • Identify areas of success, or areas which require adjustment for
Evaluation approaches are conceptually distinct ways of thinking future success
about, designing, and conducting evaluation efforts. Many of the • Establish ongoing effective planning
evaluation approaches in use today make unique contributions to • Write the Annual Report. Apart from being a requirement of
solving important problems, while others refine existing approaches some trustees groups in voluntary secondary schools, this is now
in some way. a requirement under S.20 of the Education Act 1998.
Classification systems intended to sort out unique approaches
from variations on a theme are presented here to help identify some Preliminary Steps in Self-evaluation
basic schools of thought for conducting an evaluation. After these The engagement of the stakeholders in the school planning process,
approaches are identified, they are summarized in terms of a few where appropriate to an issue, is important. Stakeholders are also known
important attributes. as the school partners. They comprise:
24 Measurement and Evaluation in Education Measurement and Evaluation in Education 25
♦ ♦
• Patrons - Owners, and Trustees • Analysis self-evaluation profile
• Board of Management -Appointed by the patron after nomination • Critical incident analysis
by the owners, parents, & teachers as appropriate Desk Research - use of documentary evidence e.g. Homework
• Staff - Teachers and Support Staff journals, copies, exam results, rolls, etc.
Field Research - surveying school partners as appropriate:
• Parents – Parents’ Association and general parent body
• Questionnaires -closed & open
• Students – Students’ Council and general student body
• Checklists - narrow & sharpen focus
• Local Community - Supporters of and participants in the education
services of the school. • Interviews - structured & unstructured, individual & group
In advance of undertaking self-evaluation successfully the school • Standard Forms - promote consistency of data recording
may address the following through the appropriate partners. Ideally • Logs - diaries, video recordings etc.
this occurs during the design stage of SDP:
• SCOT Analysis - good basis for group discussion
• Philosophy: Set of beliefs among the partners about the intrinsic
• Evaluation Grids - records interaction between variables.
value of selfevaluation
• Procedures: Means of successfully putting philosophy into Further Tools
action Apart from the desk research and field research tools which may
• Criteria: Statements of desired outcomes used as the basis for have already been used during the review stage of the planning process,
measuring success the following field research tools are useful:
• Evidence: Information collected to indicate level of success based 1. Force Field Analysis
on criteria 2. Spot Check
Self-Evaluation Tools 3. Critical Incident Analysis
Quantitative 4. Self-Evaluation Profile
5. Summative Evaluation Tool
• Desk Research
• Closed Questionnaires 1. Force Field Analysis: The user is asked to identify three things
which help and three things which hinder the successful outcome
• Checklists
of a specific issue e.g. Ability to understand the teacher
• Standard Forms
• Logs, Diaries, Recordings etc Use
• Evaluation Grids Critical Incident • It is useful as a means of identifying progress of implementation as
well as providing information on the individual/classroom experience.
Qualitative
Advantages
• SCOT Analysis
• The teacher can administer this tool quite easily in her/his classroom
• Open questionnaires
• It gives a quick view of the issues affecting the student, and can
• Interviews act as a catalyst for more extensive evaluation
• Force field analysis
• It is easily adapted to suit different issues.
26 Measurement and Evaluation in Education Measurement and Evaluation in Education 27
♦ ♦
Disadvantages Use
• The collation and analysis of response may be difficult because • It is useful as a means of testing the ‘on the ground’ reality of
of the open nature of the responses. policy implementation, i.e. how a school handles problems that
2. Spot Check: The user is asked to circle her/his response to a range arise.
of closed questions relevant to the issue e.g. A specific lesson in
Advantages
your subject.
• It can provide information on the quality of relationships in the
Use school
• It yields an immediate response from the students. • It can inform those with responsibility for implementation of the
realities of implementation on the ground.
Advantages
Disadvantages
• It is a useful tool for measuring the match between teacher and
student perception of what is going on in the class • It requires special skills on the part of the teacher
• The task group/teacher has complete flexibility in framing the • It can be time-consuming.
questions to be asked and the language used in the asking 4. Self-Evaluation Profile: The user is asked to circle her/his response
• The template can be adapted to suit any particular set of information to a range of closed questions relevant to the issue e.g. Classroom
that one is seeking. Management
Disadvantages: Use
• Validity of response could be a problem. • It is useful for self-evaluation of an action plan, which can be
3. Critical Incident Analysis: The user discusses a chosen incident broken down into sub-issues
with the individual/group in order to flesh out the consequences • It is useful as a way of identifying issues for in-depth evaluation.
of a specific course of action e.g. Back-answering a teacher. A
particular incident that created conflict in the school is taken. The Advantages
individual/group, with the assistance of a teacher, looks at the
incident in relation to the following questions: • It yields information simultaneously on two aspects of
implementation:
• What happened?
• The effect of the issue now, and
• Who was involved?
• The effect of the issue over time
• What action was taken? • It is capable of being adapted to suit any issue.
• How effective was the action?
Disadvantages
• What was the response to the action taken?
• What other action(s) could have been taken? • It allows the respondent to deal only with pre-determined issues.
• What would have assisted those involved to do things 5. Summative Evaluation Tool: The user draws together quantitative
differently? and qualitative information, which has been collected.
28 Measurement and Evaluation in Education Measurement and Evaluation in Education 29
♦ ♦
Use • Observational Methods
• It is useful for in-depth evaluation of specific issues. • Query Methods.
An evaluation method must be chosen carefully and must be
Advantages suitable for the job.
• It provides the necessary reliability, and validity that other tools Limitations
may not have
While several states are implementing some form of standards-based
• It is a comprehensive way of evaluating any issue. reform, there is very little empirical evidence to prove that standards,
Disadvantages: assessment, and high-stakes accountability programs are effective in
• It is not time friendly improving public schools.
1. Recent reports on the standards-based reform movement suggest
Techniques of Evaluation
that in many schools the careless implementation of standards and
Evaluation assessment may have negative consequences for students.
• Tests usability and functionality of system 2. Vague and unclear standards in several subject areas in several
• Occurs in laboratory, field and/or in collaboration with users states complicate matters and do not serve as concrete standards
defining what students should know and be able to do.
• Evaluates both design and implementation
3. Top-down standards imposed by the federal or state government
• Should be considered at all stages in the design life cycle are also problematic. They impose content specifications without
taking into account the different needs, opportunities to learn, and
Goals of Evaluation skills that may be appropriate for specific districts or regions.
• Assess extent of system functionality
1.4. ITEM FORMATS
• Assess effect of interface on user
Just as there are several types of tests available to help employers
• Identify specific problems
make employment decisions, there are also several types of test formats.
Evaluation tests the usability, functionality and acceptability of
In this section, the pros and cons of general types of test item formats
an interactive system.
are described. Also, some general guidelines for using different types
• Evaluation may take place: of test item formats are provided.
• In the laboratory
1.5. MULTIPLE CHOICES
• In the field.
Multiple choice questions are composed of one question (stem)
• Some approaches are based on expert evaluation:
with multiple possible answers (choices), including the correct answer
• Analytic Methods and several incorrect answers (distractors). Typically, students select
• Review Methods the correct answer by circling the associated number or letter, or filling
• Model-based methods. in the associated circle on the machine-readable response sheet.
• Some approaches involve users: Example: Distractors are:
• Experimental Methods (A) Elements of the exam layout that distract attention from the questions
30 Measurement and Evaluation in Education Measurement and Evaluation in Education 31
♦ ♦
(B) Incorrect but plausible choices used in multiple choice questions True/False
(C) Unnecessary clauses included in the stem of multiple choice True/false questions are only composed of a statement. Students
questions respond to the questions by indicating whether the statement is true
or false. For example: True/false questions have only two possible
Answer: B answers (Answer: True).
Like multiple choice questions, true/false questions:
Students can generally respond to these type of questions quite
quickly. As a result, they are often used to test student’s knowledge • Are most often used to assess familiarity with course content and
of a broad range of content. Creating these questions can be time to check for popular misconceptions
consuming because it is often difficult to generate several plausible • Allow students to respond quickly so exams can use a large number
distractors. However, they can be marked very quickly. of them to test knowledge of a broad range of content
• Are easy and quick to grade but time consuming to create
Tips for Writing good Multiple Choice items True/false questions provide students with a 50% chance of guessing
Avoid Do use the right answer. For this reason, multiple choice questions are often
used instead of true/false questions.
In the stem: In the stem:
Long / complex sentences Your own words – not statements
Trivial statements straight out of the textbook Tips for Writing Good True/False Items
Negatives and double-negatives Single, clearly formulated
Ambiguity or indefinite terms, problems Avoid Do use
absolute statements, and broad In the choices: Negatives and double-negatives Your own words
generalization Plausible and homogeneous Long / complex sentences The same number of true and false
Extraneous material: distractors Trivial material statements (50 / 50) or slightly more false
Item characteristics that provide a Statements based on common Broad generalizations statements than true (60/40) – students
clue to the answer misconceptions student misconceptions Ambiguous or indefinite terms are more likely to answer true
In the choices: True statements that do not One central idea in each item
Statements too close to the correct answer the questions
answer Short options – and all same 1.6. MATCHING
Completely implausible responses length
‘All of the above,’ ‘none of the Correct options evenly Students respond to matching questions by pairing each of a set
above’ distributed over A, B, C, etc. of stems (e.g., definitions) with one of the choices provided on the
Overlapping responses (e.g., if ‘A’ Alternatives that are in logical or exam. These questions are often used to assess recognition and recall
is true) numerical then ‘C’ is also true) and so are most often used in courses where acquisition of detailed
order
knowledge is an important goal. They are generally quick and easy to
At least 3 alternatives
create and mark, but students require more time to respond to these
questions than a similar number of multiple choice or true/false items.
Suggestion: After each lecture during the term, jot down two or
Example: Match each question type with one attribute:
three multiple choice questions based on the material for that lecture.
Regularly taking a few minutes to compose questions, while the material 1. Multiple Choice (a) Only two possible answers
is fresh in your mind, will allow you to develop a question bank that 2. True/False (b) Equal number of stems and choices
you can use to construct tests and exams quickly and easily. 3. Matching (c) Only one correct answer but at least three choices
32 Measurement and Evaluation in Education Measurement and Evaluation in Education 33
♦ ♦
Tips for Writing Good Matching Items There can also be some serious weaknesses in the matching item
format, which could make an entire section of test items invalid. Some
Avoid Do use
things to look out for:
Long stems and options Short responses 10-15 items on only one page
1. Cued answers. A competent test-taker can usually get one or more
Heterogeneous content Clear directions
(e.g., dates mixed with people) Logically ordered choices (chronological, items correct “for free”, by using the process of elimination. A
Implausible responses alphabetical, etc.) group of ten items with ten options often means that a student
needs to know, at most, the answers to nine of the items.
2. Non-homogenous options. Many, many groups of matching items
Guidelines for Using Multiple Choice or True-False Test Items are practically worthless because they mix totally unrelated things
It is generally best to use multiple-choice or true-false items when: together as options. In such cases, a skilled student can use the
• You want to test the breadth of learning because more material process of elimination to dramatically increase his score, and very
can be covered with this format. little valid testing has taken place.
• You want to test different levels of learning. 3. Excessively large groups of items or options. Since each item has
the entire set of options as answer possibilities, a student may
• You have little time for scoring.
become overwhelmed with the amount of choices from which to
• You are not interested in evaluating how well a test taker can select the correct answer.
formulate a correct answer.
• You have a clear idea of which material is important and which The True/False Item Format
material is less important. The true/false (T/F) format is limited in usefulness compared with
• You have a large number of test takers. most other formats, but is still common. A few reasons for its refusal
to fade into oblivion are the relative ease of writing a true/false item
The Matching Item Format and the ease and objectivity of scoring it. There are more problems
As was mentioned earlier, the matching format can be considered a than benefits, however:
type of multiple choice. The matching format is common in curriculum- 1. T/F items tend to focus on trivial facts, rather than significant
based tests. It is sometimes used to good advantage and sometimes concepts. As a result, they tend to be either too easy or unreasonably
very poorly done. Some of the strengths of the matching format are: difficult.
1. It is easy to construct. Since options are used for more than one 2. T/F items are much more likely to be ambiguous or “tricky” to
item, not nearly as much effort needs to be put into constructing answer. Often the answer turns on a single word. A student may
each individual item. need to analyze multiple words in the item to catch the one that
2. It is compact in size. An individual item usually takes only a is incorrect.
fraction of the space occupied by one conventional MC item. 3. T/F items are too rewarding for guessers, since a random answer
3. It is usually time efficient for the test taker. He only needs to analyze has a 50% chance of being correct. On a curriculum-based test,
one set of options for multiple items, provided the matching group where a passing score typically is 75% - 80%, a chance of 50%
is competently designed. may not be enough to boost the overall test grade. On a norm-
4. It is very useful for working with groups of homogenous items, referenced achievement test, guessing with a chance of 50% may
for example, matching states with their capitals. significantly affect the overall score.
34 Measurement and Evaluation in Education Measurement and Evaluation in Education 35
♦ ♦
Suggestions for True/False Items Directions: Assuming that the information below is true, it is
possible to establish other facts using the ones in this paragraph as a
1. Avoid vague, indefinite, or broad terms in favor of precise statements.
basis for reasoning. This is called drawing inferences.
Good test items must be unambiguous, and T/F items even more so.
Write the proper symbol in the space provided. Use only the
2. If the correctness of a statement hinges on a particular word or information given in the paragraph as a basis for your responses…
phrase, highlight or emphasize that word or phrase. T – if the statement may be inferred as TRUE
3. Avoid negative statements if at all possible. Negative statements F – if the statement may be inferred as UNTRUE
are harder to decode, particularly those with two negatives. N –if no inference can be drawn about it from the paragraph
4. Include similar numbers of true and false items and make them
Interpretive Exercise
similar in length.
5. Group T/F items under a common statement, story, illustration, • Usually begins with verbal, tabular or graphic information which
graph, or other material. This reduces the amount of ambiguity is the basis for 1 or more multiple choice questions.
possible, since the items come from a specific frame of reference. • map, passage from a story, a poem, a cartoon
6. Avoid generalizations such as all, always, never, or none, since • Can challenge students at various levels of understanding
they usually trigger a false statement. Also avoid qualifiers like
• application, analysis, synthesis, evaluation
sometimes, generally, often, and can be, since they are often
indicators of a true statement. • Exercise contains all information needed to answer questions
• Readily adaptive to the more important outcomes of disciplines.
Interpretative Exercises
An interpretive question exercise consists of a series of objective 1.7. GUIDELINES FOR ITEM PREPARATION
items based on a common set of data. The data may in the form of (i) The items should be worded very carefully. A minor change in
written materials, tables, charts, graphs, maps, or pictures. The series of the wording is likely to create major differences in the meaning
related test items may also take various forms but are most commonly as illustrated below:
multiple-choice or true-false items. Because all students are presented (a) Do you approve of seggregation of children? Yes/ No
with a common set of data, it is possible to measure a variety of complex
(b) You do approve of seggregation of children, don’t you? Yes/No
learning outcomes. The students can be asked to identify relationships
© Don’t you approve the seggregation of children? Yes/No
in data, to recognize valid conclusions, to appraise assumptions and
inferences, to detect proper applications of data, and the like. The following (d) You don’t approve of seggregation of children, do you? Yes/ No
are examples that are presented in a variety of school subjects at the Obviously the meaning of the above questions differs from item to
elementary and secondary levels. item. The item numbers c and d are highly suggestive. If accompanied
by a sincere, earnest nod of the head, item. b would cause many
Example 1 respondents to agree, and of items a and d were accompanied by
Ability to Recognize Inferences an unbelieving look, many individuals who approve of segregation
would deny it.
In interpreting written material, it is frequently necessary to draw
inferences from the facts given. The following exercise measures the (ii) Item should be worded very clearly so that the expected responses
extent to which students are able to recognize warranted and unwarranted would be specific in nature.
inferences drawn from a passage:
36 Measurement and Evaluation in Education
♦
(iii) Words of the item should be used in the usual meaning. Those
which are likely to be interpreted in more than one way, or
misinterpreted, should be avoided. If necessary ambiguous
words should be defined and qualifying terms should be given.
(iv) Items should avoid descriptive words (adjective and adverbsa0 Chapter-2
such as frequently, occasionally and rarely, because such words
have no universal meaning RESEARCH INSTRUMENTS
(v) Avoid the use of double negatives. Statements such as –‘ I do
not think that pupils will not do home works’. ‘The educated
persons do not feel coeducation is not good’, and so on, mislead
respondents. 2.1. KINDS OF INSTRUMENTS
(vi) Items of double barreled nature should be avoided. On the basis of the merits and limitations of the interview techniques
(vii) Items requiring comparison or rating should give the point of it is used in many ways for research and non-research purposes. This
reference. technique was used in common wealth teacher training study to know
(viii) Clear and complete directions should be provided for all the traits must essentials for success in teaching. Apart from being an
individual items and also for groups of items having certain independent data collection tool, it may play an important role in the
common characteristics. In preparing the direction one should preparation of questionnaires and check lists which are to be put to
observe the golden mean between extreme incompleteness and extensive use.
detail on the one hand and extreme incompleteness and vagueness
2.1.1. Questionnaire
on the other hand.
Questionnaire is a self report data collection instrument that each
research participant fills out as part of a research study. Researchers use
questionnaire to obtain information about the thoughts, feelings, attitudes
beliefs, values, perceptions, personality and behavioral intentions of
research participants.
According to John W. Best (1992) a questionnaire is used when
factual information is desired, when opinion rather than facts are desired,
an opinionnaire or Attitude scale is used.
Forms/Kinds of Questionnaire
The researcher can construct questions in the form of a closed,
open pictorial and scale items.
1. Close form Questionnaire that calls for short check responses as
the, restricted or close form type. They provide for marking a Yes
or No a short response or checking an item from a list of suggested
responses.
37
38 Measurement and Evaluation in Education Research Instruments 39
♦ ♦
Advantages of the close form data from children and adults who had not developed reading ability.
Pictures often capture the attention of respondents more readily than
1. It is easy to fill out.
printed words, lessen subjects’ resistance and stimulate the interest in
2. It takes little time by respondents questions. “To get better answers, ask better questions”
3. It is relatively objective
4. Easy to tabulate and analyze Characteristics of A Good Questionnaire:
5. Answers are standardized. • Questionnaire should deal with important or significant topic to
create interest among respondents.
Limitations of the close form • It should seek only that data which can not be obtained from other
It fails to reveal the respondents’ motives and does not always get sources.
information of sufficient scope and in depth and may not discriminate • It should be as short as possible but should be comprehensive.
between the finer shades of meaning.
• It should be attractive.
The open form • Directions should be clear and complete.
The open form or unstructured type of questionnaire calls for a • It should be represented in good Psychological order proceeding
free response in respondents own words. from general to more specific responses.
• Double negatives in questions should be avoided.
Advantages of the Open Form Questionnaire
• Putting two questions in one question also should be avoided.
1. Open end questions are flexible.
• It should avoid annoying or embarrassing questions.
2. They can be used when all possible answer categories are not
• It should be designed to collect information which can be used
known.
subsequently as data for analysis.
3. They are preferable to complex issues that cannot be condensed.
• It should consist of a written list of questions.
4. They allow more opportunity for creativity, thinking and self
• The questionnaire should also be used appropriately.
expression.
Designs of Questionnaire
Limitation
After construction of questions on the basis of it’s characteristics
1. There is possibility of collection of worthless or irrelevant it should be designed with some essential routines like:
information.
• Background information about the questionnaire.
2. Data collected through open end questionnaire are not often
• Instructions to the respondent.
standardized from person to person.
• The allocation of serial numbers and
3. Coding is difficult and subjective.
• Coding Boxes.
Pictorial Form
Background Information about
Some questionnaires present respondents with drawings and
photographs rather than written statement from which to choose answers. The Questionnaire Both from ethical and practical point of view,
This form of questionnaire is particularly suitable tool for collecting the researcher needs to provide sufficient background information about
40 Measurement and Evaluation in Education Research Instruments 41
♦ ♦
the research and the questionnaire. Each questionnaires should have a • It is easier to arrange.
cover page, on which some information appears about: • It supplies standardized answers
• The sponsor • It encourages pre-coded answers.
• The purpose • It permits wide coverage.
• Return address and date • It helps in conducting depth study.
• Confidentiality
Disadvantages
• Voluntary responses and
• Thanks • It is reliable and valid, but slow.
• Pre-coding questions can deter them from answering.
Instructions to the Respondent
• Pre-coded questions can bias the findings towards the researcher.
It is very important that respondents are instructed to go presented
• Postal questionnaire offer little opportunities to check the truthfulness
at the start of the questionnaire which indicates what is expected from
of the answers.
the respondents. Specific instructions should be given for each question
where the style of questions varies through out the questionnaire. For • It can not be used with illiterate and small children.
Example – Put a tick mark in the appropriate box and circle the relevant
2.1.2. Opinionnaire
number etc.
“Opinion polling or opinion gauging represents a single question
The Allocation of Serial Numbers approach. The answers are usually in the form of ‘yes’ or ‘no’. An
Whether dealing with small or large numbers, a good researcher needs undecided category is often included. Sometimes large number of
to keep good records. Each questionnaire therefore should be numbered. response alternatives if provided.” - Anna Anastusi
The terms opinion and attitude are not synonymous, through
Coding Boxes sometimes we used it synonymously. We have till now discussed that
When designing the questionnaire, it is necessary to prevent later attitudes scale. We have also discussed that attitudes are impressed
complications which might arise at the coding stage. Therefore, you opinions. You can now understand the difference between opinionnaire
should note the following points: and attitude scale, when we discuss of out opinionnaire, it is characteristics
and purposes. Opinion is what a person says on certain aspects of the
• Locate coding boxes neatly on the right hand side of the page.
issue under considerations. It is an outward expression of an attitude
• Allow one coding box for each answer. held by an individual. Attitudes of an individual can be inferred or
• Identify each column in the complete data file underneath the estimated from his statements of opinions. An opinionnaire is defined
appropriate coding box in the questionnaire. as a special form of inquiry. It is used by the researcher to collect
Besides these, the researcher should also be very careful about the the opinions of a sample of population on certain facts or factors the
length and appearance of the questionnaire, wording of the questions, problem under investigation. These opinions on different facts of the
order and types of questions while constructing a questionnaire problem under study are further quantified, analysed and interpreted.
Advantages of Questionnaire Purpose
Questionnaire is economical. In terms of materials, money and Opinionnaire are usually used in researches of the descriptive type
time it can supply a considerable amount of research data. which demands survey of opinions of the concerned individuals. Public
42 Measurement and Evaluation in Education Research Instruments 43
♦ ♦
opinion research is an example of opinion survey. Opinion polling enables • Used by some investors as a critical part of their investment process
the researcher to forecast the coming happenings in successful manner. • Can aid in mitigating claims of negligence in public liability claims
by providing evidence of a risk management system being in place.
Characteristics
• an ornithological checklist, a list of birds with standardized names
• The opinionnaire makes use of statements or questions on different that helps ornithologists communicate with the public without the
aspects of the problem under investigation. use of scientific names in Latin.
• Responses are expected either on three point or five point scales. • A popular tool for tracking sports card collections. Randomly
• It uses favourable or unfavourable statements. inserted in packs, checklist cards provide information on the contents
• It may be sub-divided into sections. of sports card set.
• The gally poll ballots generally make use of questions instead of Format
statements.
Checklists are often presented as lists with small checkboxes down
• The public opinion polls generally rely on personal contacts rather the left hand side of the page. A small tick or checkmarks drawn in
than mail ballots. the box after the item has been completed. Other formats are also
sometimes used. Aviation checklists generally consist of a system and
2.1.3. Check List
an action divided by a dashed line, and lacks a checkbox as they are
A checklist is a type of informational job aid used to reduce failure often read aloud and are usually intended to be reused.
by compensating for potential limits of human memory and attention.
It helps to ensure consistency and completeness in carrying out a task. Check List in Education
A basic example is the “to do list.” A more advanced checklist would A simple checklist of what schools can do to instill good behaviour
be a schedule, which lays out tasks to be done according to time of in the classroom has been developed and published today by Charlie
day or other factors. Taylor - the head teacher of a special school with some of the toughest
behaviour issues and the government’s expert adviser on behaviour.
Applications
The behaviour checklist - entitled ‘Getting the simple things right’ -
• Pre-flight checklists aid in aviation safety to ensure that critical follows Charlie Taylor’s recent behaviour summit, where outstanding
items are not forgotten head teachers from schools in areas of high deprivation gathered to
• Use in medical practice to ensure that clinical practice guidelines discuss the key principles for improving behaviour. What soon became
are followed. An example is the Surgical Safety Checklist developed clear was how much similarity there was between the approaches that
for the World Health Organization by Dr. Atul Gawande. Evidence the head teachers followed. Many of them emphasised the simplicity
to support surgical checklists is tentative but limited. of their approach but they agreed that most important of all was
• Used in quality assurance of software engineering, to check process consistency.
compliance, code standardization and error prevention, and others.
Actions from the checklist include
• Often used in industry in operations procedures.
• used in civil litigation to deal with the complexity of discovery • Ensuring absolute clarity about the expected standard of pupils’
and motions practice. An example is the open-source litigation behaviour
checklist. • Displaying school rules clearly in classes and around the building.
Staff and pupils should know what they are
44 Measurement and Evaluation in Education Research Instruments 45
♦ ♦
• Ensuring that children actually receive rewards every time they social service, clerical and many other areas of interest have been
have earned them and receive a sanction every time they behave analysed informs of activities. In terms of specific activities, a
badly person’s likes and dislikes are sorted into various interest areas
• Taking action to deal with poor teaching or staff who fail to follow and percentile scores calculated for each area. The area where a
the behaviour policy person’s percentile scores are relatively higher is considered to
be the area of his greatest interests, the area in which he would
• Ensuring pupils come in from the playground and move around
be the happiest and the most successful. As a part of educational
the school in an orderly manner
surveys of many kinds, children’s interest in reading, in games,
• Ensuring that the senior leadership team like the head and assistant in dramatics, in other extracurricular activities and in curricular
head are a visible presence around the school during the day, work etc. is studied. One kind of instrument, most commonly used
including in the lunch hall and playground, and are not confined in interest measurement is known as Strong’s Vocational Interest
to offices Inventory. It compares the subject’s pattern of interest to the interest
patterns of successful individuals in a number of vocational fields.
2.1.4. Inventory
This inventory consists of the 400 different items. The subject has
• Inventory is a list, record or catalog containing list of traits, to tick mark one of the alternatives i. e. L(for like), I(indifference)
preferences, attitudes, interests or abilities used to evaluate personal or D(Dislike) provided against each item. When the inventory is
characteristics or skills. The purpose of inventory is to make a list standardized, the scoring keys and percentile norms are prepared
about a specific trait, activity or programme and to check to what on the basis of the responses of a fairly large number of successful
extent the presence of that ability types of Inventories like individuals of a particular vocation. A separate scoring key is
• Internet Inventory and therefore prepared for each separate vocation or subject area. The
• Personality Inventory subject’s responses are scored with the scoring key of a particular
vocation in order to know his interest or lack of interest or lack of
• Interest Inventory interest in the vocation concerned. Similarly his responses can be
• Persons differ in their interests, likes and dislikes. Internets are scored with scoring keys standardized for other vocational areas.
significant element in the personality pattern of individuals and play In this way you can determine one’s areas of vocational interest.
an important role in their educational and professional careers. The Another well known interest inventories, there are also personality
tools used for describing and measuring interests of individuals inventories to measure the personality.
are the internet inventories or interest blanks. They are self report
instruments in which the individuals note their own likes and 2.1.5. TEST
dislikes. They are of the nature of standardized interviews in which Test is a systematic procedure for observing persons and describing
the subject gives an introspective report of his feelings about certain them with either a numerical scale or a category system. Thus test may
situations and phenomena which is then interpreted in terms of give either qualitative or quantitative information. Test commonly refers
internets. to a set of items or questions under specific conditions.
• The use of interest inventories is most frequent in the areas of
Types of Test
educational and vocational guidance and case studies. Distinctive
patterns of interest that go with success have been discovered • Essay type
through research in a number of educational and vocational fields. • Objective type
Mechanical, computational, scientific, artifice, literary, musical,
46 Measurement and Evaluation in Education Research Instruments 47
♦ ♦
Essay Type • Require less time for typing, duplicating or printing, can be written
It is an item format that requires the student to structure a rather on board
long written response up to several paragraphs • Can be used as device for measuring and improving language and
expression skills
Characteristics of Essay Test
Limitations
• Generally essay tests contain more than one question in the test
Lack of consistency in judgments even among competent
• Essay tests are to be answered in writing only
examiners
• Essay test tests require completely long answers
• They have holo effect
• Essay tests are attempted on the basis of recalling the memory
• Question to question carry effect
Types of Essay Test • Examinee to examinee carry effect
Selective recall (basis given) • Language mechanic effect
• Evaluation recall (basis given) • Limited content validity
• Comparison of two things on a single designated basis • Some examiners are too strict and some are too lenient
• Comparison of two things in general • Difficult to score objectively
• Decisions (for and against) • Time consuming
• Explanation of the use exact meaning of some word, phrase or • Lengthy enumeration of memorized facts
statement
• Summary of some unit of the text or of some article Suggestions for Construction of Essay Tests
• Analysis • Ask questions that require the examinee to show command of
• Illustrations or examples essential knowledge
• Application of rules, laws, or principles to new situations • Make questions as explicit as possible
• Discussions
• Should be no choice in questioning question paper
• Criticism
• Test constructor should prepare ideal answers to all questions
• Inferential thinking
• Intimate the examinee about desired length of the answers
Advantages • Make each question relatively short but increase number of
• Can measure complex learning outcomes questions
• Emphasize integration and application of thinking and problem • Test constructor should get his test reviewed y one ao more
solving colleagues
• Can be easily constructed • Questions should be so worded that all examinees interpret them
• Examinee free to respond in the same way as the examiner wants. Short answer items require
the examinee to respond to the item with a word, short phrase,
• No guessing as in objective item number or a symbol.
48 Measurement and Evaluation in Education Research Instruments 49
♦ ♦
Characteristics Forms of Objective Type Tests
• The test has supply response rather than select or identify (a) Two choice items
• In the form of question or incomplete statement
1. true/false items
• The test can be answered by a word, a phrase, a number or symbol
2. Completion type (if two choices are given against each blank)
Forms of Short Answer Items (b) More than two choice items
• Question form
1. Matching items
• Identification or association form
2. MCQs.
• Completion form True/False Tests (Shooting Questions)
Advantages A true false item consists of a statement or proposition which the
• Very easy to construct examinee must judge and mark as either true or false.
• Low probability of guessing the answer because it has to be supplied Advantages
by the examinees rather than select identify from the given answers
• They are good to test the lowest level of cognitive taxonomy • It takes less time to construct true false items
(knowledge, terminology, facts) • High degree of objectivity
A Likert scale typically contains an odd number of options, usually 5 to 7. One end is labeled
A Likert scale typically contains an odd number of options, usually
as the most positive end while the other one is labeled as the most positive one with the label of
A5 to 7. One
Likert scaleend is labeled
typically contains as thenumber
an odd most positive
of options, end while
usually 5 to 7.the
Oneother
end is one
labeled
‗neutral‘ in the middle of the scale.
asisthelabeled as the most positive one with the label of ‘neutral’ in the of
most positive end while the other one is labeled as the most positive one with the label
middle
‗neutral‘ of middle
in the
The phrases the scale.
‗purely
ofnegative‘
the scale.
and ‗mostly negative‘ could also have been ‗extremely disagree‘
The phrases
and ‗slightly disagree‘. ‘purely negative’ and ‘mostly negative’ could also
The
have been ‘extremely disagree’negative‘
phrases ‗purely negative‘ and ‗mostly could also
and ‘slightly have been ‗extremely disagree‘
disagree’.
and ‗slightly disagree‘.
3. Semantic Differential Scale (Max Diff) Another very commonly used scale in questionnaires is the side-
3. Semantic Differential Scale (Max Diff) by-side matrix.
Another very A common
commonly usedand powerful
scale application
in questionnaires is the of the side-by-side
side-by-side matrix. A
3. Semantic Differential Scale (Max Diff) matrixandispowerful
common the importance/satisfaction
application of the side-by-side type ofisquestion.
matrix the importance/satisfaction type
First, ask the respondent how important an attribute is, then
of question.
ask them how satisfied they are with your performance in this
area.First,
QuestionPro’s
ask the respondentlogic and loop
how important functions
an attribute also
is, then allow
ask them howyou to run
satisfied they
arethrough
with your this question
performance in thismultiple times with
area. QuestionPro‘s other
logic and loopalternatives thatyou
functions also allow theto
runrespondent might multiple
through this question consider. timesThis
with yields benchmark
other alternatives that thedata that will
respondent might
allow This
consider. youyields
to compare
benchmark datayour thatperformance
will allow you to against
compare your other competing
performance against
alternatives.
other competing alternatives.
A semantic scale is a combination of more than one continuum. It usually contains an odd
number of radio buttons with labels at opposite ends. Max Diff scales are often used in trade-off
analysis such as conjoint.
60 Measurement and Evaluation in Education Research Instruments 61
♦ ♦
Basic Rules
• Include basic information
• Student’s name
• Rater’s name
• Rater’s position
• Setting in which student was observed
• Rating period (From ___ to ___)
• Date scale was completed
• Other information important to you
• Decide on odd or even number of responses
• Decide whether or not to group items with same content together
HereHereisis an
anexample
example
of dataof data
from from an importance/satisfaction
an importance/satisfaction question. The importance • Allow space for comments after each item
question. The
rating is the lineimportance ratingratings
and the performance is theareline
the and
bars. the
Withperformance ratings
this type of data, you can • Allow space for comments at end of scale
areactually
the bars.
see whereWithyourthis typeneeds
company of todata, you
increase its can
effortsactually see meet
to more closely where yourof
the needs
• Write specific directions, including the purpose of scale and how
company needs to increase its efforts to more closely meet the needs
the customer.
to complete
of the customer.
While
Whilethere
there are many
are many online
online surveysurvey
tools andtools
onlineand online
survey survey
software software
to choose from, will • Put labels at tope of response choices (on every page)
find that not all of them have these different types of scales available to them.
to choose from, will find that not all of them have these different types
Principles for Preparation
of scales available to them.
CONSTRUCTION OF RATING SCALE
• Use action oriented precise verbs.
Construction of RatingRATING
STEPS IN CONSTRUCTING Scale SCALE
• Each item should deal with important content area
Steps1. inDecide
Constructing
what areas wantRating
to measureScale
• Question can be as long as possible, but answer should be short.
1. Decide what
2. For each areaswhat
area decide want to measure
characteristics want to measure
• Use precise, simple and accurate language relation to the subject
2. For eacha range
3. Define area for
decide what characteristics want to measure
each characteristic matter area.
3. Define a range for each characteristic
Decide how many points on the scale • Provide the necessary space for answers below each question asked.
State extremes- very good and very bad
• Decide how many points on the scale
Scale Construction Techniques
• State extremes- very good and very bad
• Arbitrary approach- scales on ad hoc basis.
• State points between these extremes
• Consensus approach- panel of judges evaluate
4. Arrange items to form the scale
• Item analysis approach-individual items into test.
5. Design directions
• Cumulative scales- ranking of items.
6. Pilot test scale
• Factors scales-inter correlation of items.
7. Make needed revisions, based on pilot test
62 Measurement and Evaluation in Education
♦
2.1.9. SCORE CARD
Score card is a device similar to a check list in certain respects,
and to a rating scale in others. It contains a list of items that pertain to
various aspects regarding a phenomenon or situation about an individual,
institution, organization or object. It gives predetermined values to the Chapter-3
presence of ( or an assigns rating of) each aspect or characteristic. The
respondents check the items regarding the aspects presented in the CONSTRUCTION OF TEST
situation (or rate them). At the stage of appraisal, the investigator counts
the point values, and gives the total weighted score to the phenomenon,
Usually, the jury technique is employed where a number of persons
are asked to make the assessment and the average score is assigned 3.1. TEST CONSTRUCTION
to the phenomenon.
A test item must focus the attention of the examinee on the principle
Score card are frequently used or construct upon which the item is based. Ideally, students who answer
a test item incorrectly will do so because their mastery of the principle
• To estimate the socio economic status of the family, or construct in focus was inadequate or incomplete. Any characteristics
communities, etc. of a test item which distract the examinee from the major point or
• To assess institutions in general, and their overall or specific focus an item reduce the effectiveness of that item. Any item answered
contribution in particular correctly or incorrectly because of extraneous factors in the item, results
• To evaluate a literary or academic work, text book and so on. in misleading feedback to both examinee and examiner.
The socio economic status of an individual includes the following
items with their aspects and point value for each (or rating assigned Test Construction
to each) Writing items requires a decision about the nature of the item or
• Total income question to which we ask students to respond, that is, whether discreet
or integrative, how we will score the item; for example, objectively or
• Material possessions- land properties, modern amenities
subjectively, the skill we purport to test, and so on. We also consider
• Educational background of family the characteristics of the test takers and the test taking strategies
• Occupation respondents will need to use. What follows is a short description of
The construction of a score card involves three main steps. these considerations for constructing items. Test construction is based
• To identify the important aspects of a phenomenon which are to upon practical and scientific rules that are applied before, during and
be evaluated after each item until it finally becomes a part of the test. The following
are the stages of constructing a test as followed by the Center.
• To select the important aspects of the phenomenon
• To assign point values to each ( or to the rating of each) Preparation
The evaluation of phenomenon using a score card suffers Each year the Center attracts a number of competent specialists
from a limitation, namely, certain intangibles connected with the in testing and educationalists to attend in workshops that deal with the
phenomenon do not lend themselves to ratings, and this vitiate the theoretical and applied aspects of the test.
total assessment.
63
64 Measurement and Evaluation in Education Construction of Test 65
♦ ♦
The theoretical aspects includ Justifications have to be provided if an item is deemed invalid. All
data is entered into the Center’s computer.
1. The concepts on which a test is constructed
2. The goals and objectives of the test. Item Entry
3. The general components, parts and sections of a test All items are entered into the computer marked with the relevant
4. The theoretical and technical foundations to write test items. judgment except those deemed invalid.
8. It is the important characteristics of measuring instrument. 4. Internal consistency reliability is a measure of reliability used to
evaluate the degree to which different test items that probe the
9. It refers to the accuracy or precision of a measuring instrument. same construct produce similar results.
According to Hopkin reliability means, the consistency with which
a test measures whatever it measures.Reliability is the degree to which (a) Average inter-item correlation is a subtype of internal
an assessment tool produces stable and consistent results. consistency reliability. It is obtained by taking all of the
items on a test that probe the same construct (e.g., reading
Types of Reliability comprehension), determining the correlation coefficient for
each pair of items, and finally taking the average of all of
1. Test-retest reliability: Is a measure of reliability obtained by these correlation coefficients. This final step yields the average
administering the same test twice over a period of time to a group inter-item correlation.
of individuals. The scores from Time 1 and Time 2 can then be
correlated in order to evaluate the test for stability over time. (b) Split-half reliability: Is another subtype of internal consistency
Example: A test designed to assess student learning in psychology reliability. The process of obtaining split-half reliability is begun
could be given to a group of students twice, with the second by “splitting in half” all items of a test that are intended to
administration perhaps coming a week after the first. The obtained probe the same area of knowledge (e.g., World War II) in order
correlation coefficient would indicate the stability of the scores. to form two “sets” of items. The entire test is administered to a
group of individuals, the total score for each “set” is computed,
2. Parallel forms reliability: Is a measure of reliability obtained and finally the split-half reliability is obtained by determining
by administering different versions of an assessment tool (both the correlation between the two total “set” scores.
versions must contain items that probe the same construct, skill,
knowledge base, etc.) to the same group of individuals. The scores Types of Validity
from the two versions can then be correlated in order to evaluate
the consistency of results across alternate versions. Example: If you 1. Face Validity: Ascertains that the measure appears to be assessing
the intended construct under study. The stakeholders can easily
80 Measurement and Evaluation in Education Construction of Test 81
♦ ♦
assess face validity. Although this is not a very “scientific” type of discipline. If the measure can provide information that students are
validity, it may be an essential component in enlisting motivation lacking knowledge in a certain area, for instance the Civil Rights
of stakeholders. If the stakeholders do not believe the measure is Movement, then that assessment tool is providing meaningful
an accurate assessment of the ability, they may become disengaged information that can be used to improve the course or program
with the task. Example: If a measure of art appreciation is created requirements.
all of the items should be related to the different components 5. Sampling Validity: (similar to content validity) ensures that the
and types of art. If the questions are regarding historical time measure covers the broad range of areas within the concept under
periods, with no reference to any artistic movement, stakeholders study. Not everything can be covered, so items need to be sampled
may not be motivated to give their best effort or invest in this from all of the domains. This may need to be completed using a
measure because they do not believe it is a true assessment of art panel of “experts” to ensure that the content area is adequately
appreciation. sampled. Additionally, a panel can help limit “expert” bias
2. Construct Validity: Is used to ensure that the measure is actually (i.e. a test reflecting what an individual personally feels are the
measure what it is intended to measure (i.e. the construct), and most important or relevant areas). Example: When designing an
not other variables. Using a panel of “experts” familiar with the assessment of learning in the theatre department, it would not
construct is a way in which this type of validity can be assessed. be sufficient to only cover issues related to acting. Other areas
The experts can examine the items and decide what that specific of theatre such as lighting, sound, functions of stage managers
item is intended to measure. Students can be involved in this process should all be included. The assessment should reflect the content
to obtain their feedback. Example: A women’s studies program area in its entirety.
may design a cumulative assessment of learning throughout the
major. The questions are written with complicated wording and 3.2.3. Objectivity
phrasing. This can cause the test inadvertently becoming a test of • Objectivity is also referred to as rater reliability
reading comprehension, rather than a test of women’s studies. It
• Objectivity is the close agreement between scores assigned by
is important that the measure is actually assessing the intended
two or more judges
construct, rather than an extraneous factor.
3. Criterion-Related Validity: Is used to predict future or current Factors Affecting Objectivity
performance - it correlates test results with another criterion of
• The clarity of the scoring system
interest. Example: If a physics program designed a measure to
assess cumulative student learning throughout the major. The new • The degree to which judges can assign scores accurately (fairly,
measure could be correlated with a standardized measure of ability no bias)
in this discipline, such as an ETS field test or the GRE subject A test is objectivity, if it receives the same score when it is examined
test. The higher the correlation between the established measure by the same examiner of two different occasions. By objectivity we
and new measure, the more faith stakeholders can have in the new mean the degree to which personal element or judgement is eliminated
assessment tool. in scoring. A measuring instrument is said to be highly objective of
the score assigned by different but equally competent scorers is not
4. Formative Validity: When applied to outcomes assessment it is
affected by the judgement, personal opinion or bias. Objectivity can be
used to assess how well a measure is able to provide information to
determined by finding the co-efficient of correlation between scores,
help improve the program under study. Example: When designing
assigned to a group of papers by the same examiner on two occasions.
a rubric for history one could assess student’s knowledge across the
It is called the co-efficient of objectivity.
82 Measurement and Evaluation in Education Construction of Test 83
♦ ♦
Usability In a well- prepared study, it is not uncommon for the interviewer
Usability means the degree to which the tests are used without instructions to be several times longer than the interview questions.
much expenditure of time, money and effort. It also means practicability. Naturally, the more complex the concepts and constructs, the greater
Factors that determine usability are: administrability, scorability, is the need for clear and complete instructions. The instruments
interpretability, economy and proper mechanical makeup of the test. should be made easier to administer by giving close attention to its
Administrability means that the test can be administered with ease, design and layout. A long completion time, complex instructions,
clarity and uniformity. Directions must be made simple, clear and participant’s perceived difficulty with the survey, and their rated
concise. Time limits, oral instructions and sample questions are specified. enjoyment of the process also influence design.
Provisions for preparation, distribution, and collection of test materials 3. Scoring and interpretability: This aspect of practicality is relevant
must be definite. Scorability is concerned on scoring of test. A good when persons other than the test designers must interpret the results.
test is easy to score thus: scoring direction is clear, scoring key is It is usually, but not exclusively, an issue with standardized tests. In
simple, answer is available, and machine scoring as much as possible such case, the designer of the data collection instrument provides
be made possible. Test results can be useful if after evaluation it is several key pieces of information to make interpretation possible:
interpreted. Correct interpretation and application of test results is very • A statement of the functions the test was designed to measure and
useful for sound educational decisions. An economical test is of low the procedures by which it was developed.
cost. One way to economize cost is to use answer sheet and reusable
• Detailed instructions for administration.
test. However, test validity and reliability should not be sacrificed for
economy. Proper mechanical make-up of the test concerns on how • Scoring keys and instructions.
tests are printed, what font size are used, and are illustrations fit the • Norms for appropriate reference group.
level of pupils/students. • Evidence about reliability.
Usability is an important criterion in assessing the value of a test. • Evidence regarding the inter-correlations of sub-scores.
It depends upon a number of factors such as case of administration,
case of scoring, case of interpretation and use of scores, low cost • Evidence regarding the relationship of the test to other measures.
satisfactory format etc. • Guides for test use.
1. Economy: More items give more reliability, but in the interest
of limiting the interview or observation time the number of
measurement questions should be reduced. The choice of data
collection method is also often dictated by economic factors. The
rising cost of personal interviewing first led to an increased use of
telephone surveys and subsequently to the current rise in internet
surveys. In standardized tests, the cost of test materials alone can
be such a significant expense that it encourages multiple reuse. For
the fast and economical scoring, computer scoring and scanning
are increasingly used.
2. Convenience in administration: A questionnaire or a measurement
scale with a set of detailed but clear instructions, with examples,
is easier to complete correctly than one the lacks these futures.
Methods of Reliability and Validity 85
CHAPTER-IV
♦
So how do we determine whether two observers are being consistent
in their observations? You probably should establish inter-rater reliability
METHODS OF RELIABILITY AND VALIDITY
outside of the context of the measurement in your study. After all, if
you use data from your study to establish reliability, and you find that
ILITY METHODS
Chapter-4 reliability is low, you’re kind of stuck. Probably it’s best to do this as
a side study or pilot study. And, if your study goes on for a long time,
METHODS
r general classes OF RELIABILITY
of reliability estimates, each of which AND VALIDITY
estimates reliability in a you may want to reestablish inter-rater reliability from time to time
They are: to assure that your raters aren’t changing.
There are two major ways to actually estimate inter-rater
reliability. If your measurement consists of categories -- the raters
Rater or Inter-Observer ReliabilityUsed to assess the degree to which different
are checking off which category each observation falls in -- you can
4.1.
observers give RELIABILITY
consistent METHODS
estimates of the same phenomenon. calculate the percent of agreement between the raters. For instance,
etest ReliabilityThere
Used are four general
to assess classes ofofreliability
the consistency a measureestimates,
from one each
time of
to let’s say you had 100 observations that were being rated by two
which estimates reliability in a different way. They are: raters. For each observation, the rater could check one of three
r.
• Inter-Rater or Inter-Observer Reliability: Used to assess the categories. Imagine that on 86 of the 100 observations the raters
el-Forms ReliabilityUsed to assess
degree to which the raters/observers
different consistency of give
the results of two
consistent tests
estimates checked the same category. In this case, the percent of agreement
ucted in the sameof waythefrom
samethe
phenomenon.
same content domain. would be 86%. OK, it’s a crude measure, but it does give an idea
of how much agreement exists, and it works no matter how many
al Consistency• Reliability
Test-RetestUsed
Reliability: Used
to assess thetoconsistency
assess the consistency of a measure
of results across items
from one time to another. categories are used for each observation.
a test. Let's discuss each of these in turn. The other major way to estimate inter-rater reliability is appropriate
• Parallel-Forms Reliability: Used to assess the consistency of the
when the measure is a continuous one. There, all you need to do is
results of two tests constructed in the same way from the same
r Inter-ObservercontentReliability calculate the correlation between the ratings of the two observers. For
domain.
instance, they might be rating the overall level of activity in a classroom
• Internal Consistency on a 1-to-7 scale. You could have them give their rating at regular
ver you use humans as a part
Reliability: Usedof to
your
time intervals (e.g., every 30 seconds). The correlation between these
assess
procedure, you have the consistency
to worry about whether ratings would give you an estimate of the reliability or consistency
of results across itemsare
u get are reliable or consistent. People between the raters.
within a test. Let’s You might think of this type of reliability as “calibrating” the
r their inconsistency. Weof are
discuss each theseeasily
in observers. There are other things you could do to encourage reliability
turn. repetitive tasks. We
We get tired of doing between observers, even if you don’t estimate it. For instance, I used
misinterpret. to work in a psychiatric unit where every morning a nurse had to do
Inter-Rater or Inter-
a ten-item rating of each patient on the unit. Of course, we couldn’t
Observer Reliability
w do we determine whether count on the same nurse being present every day, so we had to find a
Whenever you usetwo observers
humans as a partare being
of your consistentprocedure,
measurement in their
way to assure that any of the nurses would give comparable ratings.
You probably youshould
have toestablish
worry about whether the
inter-rater results you
reliability get areofreliable
outside or consistent.
the context of the The way we did it was to hold weekly “calibration” meetings where we
People are notorious for their inconsistency. We are easily distractible.
n your study. After all, if you use data from your study to establish reliability, and We would have all of the nurses ratings for several patients and discuss why
get tired of doing repetitive tasks. We daydream. We misinterpret. they chose the specific values they did. If there were disagreements,
eliability is low, you're kind of stuck. Probably it's best to do this as a side study or
the nurses would discuss them and attempt to come up with rules for
nd, if your study
84 goes on for a long time, you may want to reestablish inter-rater
Split-Half Reliability
In split-half reliability we randomly divide all items that purport to measure the same
construct into two sets. We administer the entire instrument to a sample of people and calculate
90 Measurement and Evaluation in Education Methods of Reliability and Validity 91
♦ ♦
items designed to measure the same construct. This is relatively easy
to achieve in certain contexts like achievement testing (it’s easy, for
have lots instance,
of items, Cronbach's
to constructAlpha tends
lots of to be addition
similar the most problems
frequently for
useda estimate
math of intern
test), but for more complex or subjective constructs this can be a real
consistency.
challenge. If you do have lots of items, Cronbach’s Alpha tends to be
the test-retest
The most frequently usedisestimate
estimator of internal
especially feasibleconsistency.
in most experimental and quas
The test-retest estimator is especially feasible in most experimental
experimental designs that use a no-treatment control group. In these designs you always have
and quasi-experimental designs that use a no-treatment control group.
control group that designs
In these is measured
you on two occasions
always (pretestgroup
have a control and posttest). the main problem wit
that is measured
on two
this approach occasions
is that you don't(pretest
have anyand posttest).about
information the main problem
reliability untilwith this the postte
you collect
approach is that you don’t have any information about reliability until
and, if the reliability estimate is low, you're pretty much sunk.
you collect the posttest and, if the reliability estimate is low, you’re
pretty much sunk.
4.2. SPLIT HALF
Comparison of Reliability Estimators
Comparison of Reliability Estimators 4.2. SPLIT HALF
Each of the reliability estimators has certain advantages and One way
Oneto way
test the reliability
to test of a testof
the reliability is atotest
repeat
is totherepeat
test. This is not
the test. always possibl
This
disadvantages. Inter-rater reliability is one of the best ways to estimate is not always
Each of the reliability Another approach, which possible. Another
is applicable approach, which
to questionnaires, is applicable
is to divide to even and od
the test into
reliability when yourestimators
measure ishas an certain advantages
observation. and itdisadvantages.
However, requires Inter-rater
questionnaires, is to divide the test into even and odd questions and
reliability multiple
is one ofraters
the bestor observers. As an alternative,
ways to estimate you could
reliability when look at the
your measure is an observation. questions and compare the results.
compare the results.
correlation of ratings of the same single observer repeated
However, it requires multiple raters or observers. As an alternative, you could look at the on two Example 1: 12 students take a test with 50 questions. For each
different occasions. For example, let’s say you collected videotapes Example student
1: 12 students takescore
the total a testiswith 50 questions.
recorded For each
along with the student
sum ofthethetotal score is recorde
scores
correlationofofchild-mother
ratings of the same single
interactions and observer
had a rater repeated
code the onvideos
two different
for how occasions. For along withforthe
thesum
evenof questions
the scores and
for the sum
even of
questions andfor
the scores thethe
sumoddof question
the scores for the od
oftensay
example, let's theyou
mother smiled
collected at the child.
videotapes To establish inter-rater
of child-mother interactionsreliability
and had a rater code the as shown in Figure
question as shown in Figure 1.
1. Determine whether the
Determine whether the test
test is
is reliable
reliable by
by using
using the split-ha
videos for how often the mother smiled at the child. To establish inter-rater them
you could take a sample of videos and have two raters code reliability you could the split-half methodology.
independently. To estimate test-retest reliability you could have a single methodology.
take a sample
rater of
codevideos and have
the same videostwoonraters code them
two different independently.
occasions. You mightTo use
estimate test-retest
reliability the
youinter-rater
could haveapproach
a single especially
rater code if theyou
samewere interested
videos on twoindifferent
using aoccasions. You
team of raters and you wanted to establish that they yielded consistent
might use the inter-rater approach especially if you were interested in using a team of raters and
results. If you get a suitably high inter-rater reliability you could then
you wanted to establish
justify allowingthatthemthey
to yielded consistent results.
work independently If you
on coding get a suitably
different videos. high inter-rater
reliability You
you might
could use
thenthejustify
test-retest approach
allowing them when
to workyouindependently
only have a single
on coding different
rater and don’t want to train any others. On the other hand, in some
videos. You might use the test-retest approach when you only have a single rater and don't want
studies it is reasonable to do both to help establish the reliability of
to train anytheothers.
ratersOnor the other hand, in some studies it is reasonable to do both to help establish
observers.
the reliability ofThetheparallel
raters orforms estimator is typically only used in situations
observers.
where you intend to use the two forms as alternate measures of the
Thesame thing.
parallel Bothestimator
forms the parallel forms and
is typically all used
only of theininternal consistency
situations where you intend to use
estimators have one major constraint -- you have to have multiple Figure 1 – Split-half methodology for Example 1
the two forms as alternate measures of the same thing. Both the parallel forms and all of the Figure-1: Split-half methodology for Example 1
The statistical test consists of looking at the correlation coefficient (cell G3 of Figure 1). If it
internal consistency estimators have one major constraint -- you have to have multiple items
high then the questionnaire is considered to be reliable.
designed to measure the same construct. This is relatively easy to achieve in certain contexts like
92 Measurement and Evaluation in Education Methods of Reliability and Validity 93
♦ ♦
The statistical test consists of looking at the correlation coefficient WeFigure
first 2split
– Datathe questions
for Example 2 into the two halves: Q1-Q5 and Q6-
(cell G3 of Figure 1). If it is high then the questionnaire is considered Q10, asWe
shown
first splitin
theFigure 3. the two halves: Q1-Q5 and Q6-Q10, as shown in Figure 3.
questions into
to be reliable.
r = CORREL(C4:C15,D4:D15) = 0.667277
One problem with the split-half reliability coefficient is that since
only half the number of items is used the reliability coefficient is
reduced. To get a better estimate of the reliability of the full test, we
apply the Spearman-Brown correction, namely:
2r
=ρ = 0.800439
1+ r
Brown correction) for data in ranges R1 and R2 Figure-3: Split-half coefficient (Q1-Q5 v. Q6-Q10)
SPLITHALF(R1, type) = split-half measure for or the scores in E.g. the formula in cell B23 is =SUM(B4:F4) and the formula in
the first half of the items in R1 vs. the second half of the items if type cell E.g.
C23theisformula
=SUM(G4:K4). The coefficient
in cell B23 is =SUM(B4:F4) 0.64451
and the formula (cell
in cell C23 H24) can The
is =SUM(G4:K4).
= 0 and the odd items in R1 vs. the even items if type = 1. be calculated as in(cell
coefficient 0.64451 Example 1.calculated
H24) can be Alternatively, the1.coefficient
as in Example cancoefficient
Alternatively, the be
The SPLIT_HALF function ignores any empty cells and cells calculated
can bebycalculated
the worksheet by the formula
worksheet=SPLIT_HALF(B23,B37,C23:C37)
formula =SPLIT_HALF(B23,B37,C23:C37) or
with non-numeric values. This is no so for the SPLITHALF function. or =SPLITHALF(B4:K18,0).
=SPLITHALF(B4:K18,0).
For Example 1, SPLIT_HALF(C4:C15, D4:D15) =. 800439. We can also split the questionnaire into odd and even questions,
Example 2: Calculate the split half coefficient of the ten question We can in
as shown alsoFigure
split the questionnaire
4. into odd and even questions, as shown in Figure 4.
questionnaire using a Likert scale (1 to 7) given to 15 people whose
results are shown in Figure 2.
1. A test can be divided into two equal halves in a number of ways 2 × rhalf −test
and the coefficient of correlation in each case may be different. reliability =
1 + rhalf −test
96 Measurement and Evaluation in Education Methods of Reliability and Validity 97
♦ ♦
For example, if the half-test correlation (for a 30-item test) between 2 × 5half −test r 2 × .50 1.50
the 15 odd-numbered and 15 even-numbered items on a test turned out reliability
= = = = .666 ≈ .67
to be. 50, the full-test (30-item) reliability would be. 67 as follows: 1 + 5half −test 1 + .50 1.50
2 × rhalf −test 2 × .50 1.00 We might want to use this last strategy if we were trying to figure out
reliability
= = = = .666 ≈ .67
1 + rhalf −test 1 + .50 1.50 how short we could make our test and still maintain decent reliability.
[For more on the Spearman-Brown formula, see Brown, 1996, pp.
[ p. 9 ] 194-196, 204-205, or Brown with Wada, 1999, pp. 220-223, 233-234.]
The Spearman-Brown prophecy formula can be used for adjusting
However, there is another version of the formula, which can be split-half reliability, but more importantly, it can be used for answering
applied to situations other than a simple doubling of the number of items: what-if questions about test length when you are designing or revising
n×r a language test. Unfortunately, the Spearman-Brown formula is limited
reliability =
1 + (n − 1)r to estimating differences on one dimension (usually the number of
items, or raters). For those interested in doing so on more than one
Using the more complex formula, we get the same answer as we dimension, generalizability theory (G-theory) provides the same sort
did with the simpler formula for the split-half reliability adjustment of answers, but for more dimensions (called facets in G-theory). For
example as follows: instance, in Brown (1999), I used G-theory to examine (separately
n×r 4.2 × .50 4.2 × .50 4.2 × .50 2.10 and together) the effects on reliability of various numbers of items
reliability
= = = = = = .807 ≈ .81
1 + (n − 1)r 1 + (4.2 − 1).50 1 + (3.2).50 1 + 1.60 2.60 and subtests on the TOEFL, and numbers of languages among the
persons taking the test.
We can also use the more complex formula to estimate what the
reliability for that same test would be if it had 60 items by using n = 4.4. TEST-RETEST RELIABILITY
4 (for the number of times we must multiply 15 to get 60; 4 x 15 = Test-retest reliability is the degree to which scores are consistent
60) as follows: over time. It indicates score variation that occurs from testing session
n×r .33 × .50 .33 × .50 .33 × .50 .165 to testing session as a result of errors of measurement. Problems:
reliability
= = = = = = .248 ≈ .25
1 + (n − 1)r 1 + (.33 − 1).50 1 + ( −.67).50 1 + ( −.335) .665 Memory, Maturation, Learning.
This method involves (i) repetition of a test on the same group
Or we can estimate what the reliability would be for various fractions immediately or after a lapse of time, and (ii) computation of correlation
of the test length. For instance, we could estimate the reliability for a between the first and the second set of scores. The correlation co-efficient
63 item test by using n = 4.2 (for the number of times we must multiply thus obtained indicates the extent or magnitude of the agreement between
15 to get 63; 4.2 x 15 = 63) as follows: the two sets of scores and is often called the coefficient of stability.
n×r 4 × .50 4 × .50 4 × .50 2.00 The estimate of reliability in this case vary according to the length of
reliability
= = = = = = .80
1 + (n − 1)r 1 + (4 − 1).50 1 + (3).50 1 + 1.50 2.50 time-interval allowed between the two administrations. The product
moment method of correlation is a significant method for estimating
We can even estimate the reliability for a shorter version of the reliability of two sets of scores. Thus, a high correlation between two
test, say a 5-item version by using a decimal fraction, that is, n =. 33 sets of scores indicates that the test is reliable. In other words, it shows
(for the number of times we must multiply 15 to get 5;. 33 x 15 = 4.95 that the scores obtained in first administration resemble with the scores
or about 5), as follows: obtained in second administration of the same test.
98 Measurement and Evaluation in Education Methods of Reliability and Validity 99
♦ ♦
In this method the time interval plays an important role. Immediate 4.6. ALTERNATE OR PARALLEL FORMS METHOD
repetition of a test may involve (i) immediate memory effects (ii) practice This method involves the administration of equivalent or parallel
effects (iii) confidence effects, induced by familiarity of contents. Intervals forms of the test instead of repetition of a single test. The two equivalent
of six months or long may show ‘maturity effect’. The factors of intervening forms are so constructed as to make them similar (but not identical)
learning and unlearning may lead to lowering of self-correlation. Owing in context, mental process involved, number of items, difficulty level
to difficulties in controlling conditions which influence scores on retest, and in other aspects. Parallel tests have equal mean scores, variances
the test-retest method is generally less useful than are the other methods. and intercorrelations among items. That is, two parallel forms must
be homogeneous or similar in all respects, but not a duplication of
Advantages
test items. The subjects take one form of the test and then as soon as
1. It is generally used for estimating reliability coefficient. possible, the other form. The reliability coefficient may be looked upon
2. It is worthy to use in different situations conveniently. as the coefficient correlation between the scores on two equivalent
forms of test.
3. A test of an adequate length can be used after an interval of many
days between successive testing, Advantages
Limitations 1. Memory, practice and carryover effects are minimised and not
affect the scores.
1. If the test is repeated immediately or after a little time gap, there
may be possibility of carry-over effect, transfer effect, memory 2. The reliability coefficient obtained by this method is a measure
effect, practice effect and confidence effect induced by familiarity of both temporal stability and consistency of response to different
with the material will almost certainly effect scores when the test item samples or test forms.
is administered for a second time. 3. It is useful for the reliability of achievement tests.
2. Index of reliability so obtained is less accurate.
Limitations
3. If the interval between tests is rather long (more than six months)
growth factor and maturity affect the scores and tenders to lower 1. Practice and carry over factors cannot be completely controlled.
down the reliable index. 2. When the tests are not exactly equal the comparison between two
4. On repeating the same test on the same group second time, makes sets of scores obtained from these tests may lead to erroneous
the students disinterested and thus they do not like to take part decisions.
wholeheartedly. 3. Administration of two forms simultaneously creates boredom.
4. The testing conditions while administering the Form B may not
4.5. EQUIVALENT-FORMS OR ALTERNATE-FORMS RELIABILITY
be the same.
Two tests that are identical in every way except for the actual items
5. Test scores of second form of the test are generally high.
included. Used when it is likely that test takers will recall responses made
during the first session and when alternate forms are available. Correlate 4.7. RATIONAL EQUIVALENCE METHOD
the two scores. The obtained coefficient is called the coefficient of
It is a method based on consistency of responses to all items.
stability or coefficient of equivalence. Problem: Difficulty of constructing
This method enables to compute the inter-correlation of the items of
two forms that are essentially equivalent.
the test and correlations of each item with all the items of the test. In
Both of the above require two administrations.
this method, it is assumed that all items have same or equal difficulty
100 Measurement and Evaluation in Education Methods of Reliability and Validity 101
♦ ♦
value, correlation between the items are equal, all the items measure With all that in mind, here’s a list of the validity types that are
essentially the same ability and the test is homogeneous in nature. typically mentioned in texts and research papers when talking about
Like split-half method this method also provides a measure of interval the quality of measurement:
consistency. The most popular formula is Kuder-Richardson:
4.8. CONSTRUCT VALIDITY
n ó 2 t-∑ pq
r1t = x • Translation validity
(n-1) ó2t
• Face validity
Where r1t = reliability coefficient of the whole test.
• Content validity
n = number of items in the test
= the SD of the test scores • Criterion-related validity
ót
p = the proportion of the group answering a test item correctly • Predictive validity
q = (1-p) = the proportion of the group answering a test item • Concurrent validity
incorrectly.
• Convergent validity
Advantages • Discriminant validity
1. This coefficient provides some indicators of how internally I have to warn you here that I made this list up. I’ve never heard of
consistent or homogeneous the items of the test are. “translation” validity before, but I needed a good name to summarize
what both face and content validity are getting at, and that one seemed
2. Split-half method simply measures the equivalence but rational
sensible. All of the other labels are commonly known, but the way I’ve
equivalence method measures both equivalence and homogeneity.
organized them is different than I’ve seen elsewhere.
3. It neither requires administration of two equivalent forms of tests Let’s see if we can make some sense out of this list. First, as
nor it requires to split the tests into two equal halves. mentioned above, I would like to use the termconstruct validity to be
the overarching category. Construct validity is the approximate truth
Limitations
of the conclusion that your operationalization accurately reflects its
1. The coefficient obtained by this method is generally some what construct. All of the other terms address this general issue in different
lesser than the coefficients obtained by other methods. ways. Second, I make a distinction between two broad types: translation
2. It the items of the tests are not highly homogeneous, this method validity and criterion-related validity. That’s because I think these
will yield lower reliability coefficient. correspond to the two major ways you can assure/assess the validity of
an operationalization. In translation validity, you focus on whether the
3. Kuder-Richardson and split-half method are not appropriate for
operationalization is a good reflection of the construct. This approach is
speed test.
definitional in nature -- it assumes you have a good detailed definition of
The population of interest in a study is the “construct” and the
the construct and that you can check the operationalization against it. In
sample is your operationalization. If we think of it this way, we are
criterion-related validity, you examine whether the operationalization
essentially talking about the construct validity of the sampling!). Second,
behaves the way it should given your theory of the construct. This is
I want to use the term construct validity to refer to the general case
a more relational approach to construct validity. it assumes that your
of translating any construct into an operationalization. Let’s use all of
operationalization should function in predictable ways in relation to
the other validity terms to reflect different ways you can demonstrate
other operationalizations based upon your theory of the construct. (If
different aspects of construct validity.
102 Measurement and Evaluation in Education Methods of Reliability and Validity 103
♦ ♦
all this seems a bit dense, hang in there until you’ve gone through the something that’s not always true. For instance, we might lay out all of
discussion below -- then come back and re-read this paragraph). Let’s the criteria that should be met in a program that claims to be a “teenage
go through the specific validity types. pregnancy prevention program.” We would probably include in this
domain specification the definition of the target group, criteria for
Translation Validity deciding whether the program is preventive in nature (as opposed to
I just made this one up today! (See how easy it is to be a treatment-oriented), and lots of criteria that spell out the content that
methodologist?) I needed a term that described what both face and should be included like basic information on pregnancy, the use of
content validity are getting at. In essence, both of those validity types abstinence, birth control methods, and so on. Then, armed with these
are attempting to assess the degree to which you accurately translated criteria, we could use them as a type of checklist when examining
your construct into the operationalization, and hence the choice of our program. Only programs that meet the criteria can legitimately be
name. Let’s look at the two types of translation validity. defined as “teenage pregnancy prevention programs.” This all sounds
fairly straightforward, and for many operationalizations it will be. But
4.9. FACE VALIDITY for other constructs (e.g., self-esteem, intelligence), it will not be easy
In face validity, you look at the operationalization and see whether to decide on the criteria that constitute the content domain.
“on its face” it seems like a good translation of the construct. This is
probably the weakest way to try to demonstrate construct validity. For 4.11. CRITERION-RELATED VALIDITY
instance, you might look at a measure of math ability, read through In criteria-related validity, you check the performance of your
the questions, and decide that yep, it seems like this is a good measure operationalization against some criterion. How is this different from content
of math ability (i.e., the label “math ability” seems appropriate for validity? In content validity, the criteria are the construct definition itself
this measure). Or, you might observe a teenage pregnancy prevention -- it is a direct comparison. In criterion-related validity, we usually make
program and conclude that, “Yep, this is indeed a teenage pregnancy a prediction about how the operationalization will perform based on our
prevention program.” Of course, if this is all you do to assess face theory of the construct. The differences among the different criterion-
validity, it would clearly be weak evidence because it is essentially a related validity types is in the criteria they use as the standard for judgment.
subjective judgment call. (Note that just because it is weak evidence
doesn’t mean that it is wrong. We need to rely on our subjective judgment 4.12. PREDICTIVE VALIDITY
throughout the research process. It’s just that this form of judgment In predictive validity, we assess the operationalization’s ability
won’t be very convincing to others.) We can improve the quality of to predict something it should theoretically be able to predict. For
face validity assessment considerably by making it more systematic. instance, we might theorize that a measure of math ability should
For instance, if you are trying to assess the face validity of a math be able to predict how well a person will do in an engineering-based
ability measure, it would be more convincing if you sent the test to a profession. We could give our measure to experienced engineers and
carefully selected sample of experts on math ability testing and they see if there is a high correlation between scores on the measure and
all reported back with the judgment that your measure appears to be their salaries as engineers. A high correlation would provide evidence
a good measure of math ability. for predictive validity -- it would show that our measure can correctly
predict something that we theoretically think it should be able to predict.
4.10. CONTENT VALIDITY
In content validity, you essentially check the operationalization 4.13. CONCURRENT VALIDITY
against the relevant content domain for the construct. This approach In concurrent validity, we assess the operationalization’s ability
assumes that you have a good detailed description of the content domain, to distinguish between groups that it should theoretically be able to
104 Measurement and Evaluation in Education Methods of Reliability and Validity 105
♦ ♦
distinguish between. For example, if we come up with a way of assessing prepared to measure whether students can perform multiplication,
manic-depression, our measure should be able to distinguish between and the people to whom it is shown all agree that it looks like a
people who are diagnosed manic-depression and those diagnosed good test of multiplication ability, this demonstrates face validity of
paranoid schizophrenic. If we want to assess the concurrent validity the test. Face validity is often contrasted with content validity and
of a new measure of empowerment, we might give the measure to construct validity.
both migrant farm workers and to the farm owners, theorizing that our Some people use the term face validity to refer only to the validity
measure should show that the farm owners are higher in empowerment. of a test to observers who are not expert in testing methodologies. For
As in any discriminating test, the results are more powerful if you are instance, if a test is designed to measure whether children are good
able to show that you can discriminate between two groups that are spellers, and parents are asked whether the test is a good test, this
very similar. measures the face validity of the test. If an expert is asked instead,
some people would argue that this does not measure face validity.
4.14. CONVERGENT VALIDITY This distinction seems too careful for most applications. Generally,
In convergent validity, we examine the degree to which the face validity means that the test “looks like” it will work, as opposed
operationalization is similar to (converges on) other operationalizations to “has been shown to work”.
that it theoretically should be similar to. For instance, to show the
convergent validity of a Head Start program, we might gather evidence
that shows that the program is similar to other Head Start programs.
Or, to show the convergent validity of a test of arithmetic skills, we
might correlate the scores on our test with scores on other tests that
purport to measure basic math ability, where high correlations would
be evidence of convergent validity.
CHAPTER-5
TOOL CONSTRUCTION PROCEDURE
• Item analysis data are not synonymous with item validity. An At this stage, various decisions are taken and certain activities
external criterion is required to accurately judge the validity of are completed. But before doing all these it is suggested by Stanley
test items. By using the internal criterion of total test score, item and Hopkins (p. 172) that certain principles should be taken into
analyses reflect internal consistency of items rather than validity. consideration. These principles of planning are as follows:
• The discrimination index is not always a measure of item quality. • Principle of adequate provision—outcomes of instruction
There is a variety of reasons an item may have low discriminating • Principle of emphasis of the course (approximately)
power:(a) extremely difficult or easy items will have low ability to • Principle of purpose
discriminate but such items are often needed to adequately sample • Principle of conditions under which the test is administered
114 Measurement and Evaluation in Education Tool Construction Procedure 115
♦ ♦
The first principle is quite clear. It simply says that the test should be abilities and skills. These are further divided into five areas which are
so designed that it measures the most important outcome of instruction or as follows:
objectives. The second principle says that if the test has to cover a large • Knowledge
amount of materials, it would be necessary to determine which part or
1. Knowledge of specifics
aspect should receive that weightage in terms of number of items. Proper
sampling has to be done. This should reflect relative importance of various 2. knowledge of terms
components of the course. The third principle explains that the purpose 3. knowledge of specific facts
of the test should be made clear and kept in mind. It means clarifying
4. knowledge of ways and means of dealing with specifics
whether test will employ relative, absolute or criterion-related standards
of achievement. The validity and reliability of the tests are linked with 5. knowledge of conventions
these purposes. Different methods of providing reliability and validity 6. knowledge of trends and sequences
are used depending upon these purposes. The fourth principle requires 7. knowledge of classifications and categories
that decisions should be taken regarding the number of items to be finally
kept in the test, time to be allowed for completing the test, how frequently 8. knowledge of criteria
test will be administered, what will be the format of the test, what kinds 9. knowledge of methodology
of items will be included in the test, how the responses of the examinees 10. knowledge of the universal and abstractions in a field
will be recorded, and so on.
11. knowledge of principles and generalizations
Steps of Planning 12. knowledge of theories and structures
Planning of a performance test involves the following steps: • Intellectual abilities and skills
• Identifying the learning outcomes (i) Comprehension (understanding the meaning)
• Defining the outcomes in terms of specific observable behaviours
(a) Translation
• Outlining the subject matter content
(b) Interpretation
• Preparing a table of specifications
• Using the table of specifications (c) Extrapolation
(i) Application
Identifying Learning Outcomes
(ii) Analysis
In constructing an objective-type performance test the first order of
business is to identify the instructional objectives, which are intended (a) Analysis of elements
to be measured. This is a difficult job. However, one useful guide is (b) Analysis of relationships
the taxonomy of educational objectives. Learning objectives in other (c) Analysis of organizational principles
areas such as skills, attitudes and interests are measured by rating
(i) Synthesis
scales, checklists, anecdotal records, inventories and similar non-testing
procedures. It is only cognitive domain objectives, i.e., knowledge, and (a) Production of a unique communication
intellectual abilities and skills, which can be measured through paper (b) Production of a plan or proposed set of operations
and pencil test. These cognitive objectives of the Bloom’s taxonomy (c) Deviation of a set of abstract relations
are divided into two major areas: (1) knowledge and (2) intellectual
(i) Evaluation
116 Measurement and Evaluation in Education Tool Construction Procedure 117
♦ ♦
(a) Judgments in terms of internal evidence The figures in the cells indicate the number of items (weightage)
(b) Judgments in terms of external criteria to be given to each area and the outcome.
All these objectives are arranged in order of increasing complexity.
Using Table of Specifications
Subdivisions within each area are also in order of increasing complexity.
The whole structure of objectives is hierarchical in nature. This taxonomy While preparing the performance test, items have to be constructed
is useful in planning the performance test. according to the table of specification. Thus, the table serves as a guide,
a blue-print for constructing the test.
Defining Objectives in Specific Terms
Preparing the Test
Identifying the learning objectives, defineding there in specific
behavioural terms which provide evidence that the outcomes have Some of the most important points to be kept in mind while preparing
been achieved. For this purpose the objectives are written in sentences, the test as mentioned by Stanley (1964) are as follows:
which use action verbs such as ‘recognizes’ and ‘identifies’. • Have more items in the first draft of the test than decided to be
kept in the final form
Outlining Subject Matter Content • Most of the items should have 50 per cent difficulty levels
Outlining Subject Matter Content
Taxonomical learning objectives are general and may apply to any • After a gap of some time make a critical revision of the test
topic Taxonomical
or area oflearning
the subject matter.
objectives A performance
are general and may apply totest is designed
any topic to
or area of the
measure these objectives, which cover students’ abilities and skills, • The items should be arranged in ascending order of difficulty
subject matter. A performance test is designed to measure these objectives, which cover
their mental
students‘ abilitiesdevelopment, their
and skills, their mental reactionstheir
development, as reactions
well asasthewellknowledge
as the knowledge of • A regular sequence in the pattern of correct responses should be
the
of thesubject matter
subject matter being
being taught.taught.
Therefore,Therefore,
it is essential it
to is essential
identify how muchto identify
and which avoided
how
aspectsmuch and which
of the subject aspects
matter will ofby
be covered the
thesubject
test. Only matter will be
major elements needcovered
to be listed.by • The direction to the examiners should be as clear, complete and
the test. Only major elements need to be listed. concise as possible
Preparing a Table of Specifications There are a variety of item types, which can be chosen for constructing
Preparing a Table of Specifications a performance test, such as completion type, true–false, matching and
Having identified the learning outcomes and outlined the course content a table of
Having identified the learning outcomes and outlined the course multiple-choice type. Of all these, multiple-choice type tends to provide the
specifications is prepared, which relates outcomes and indicates the relative weight to be
content a table of specifications is prepared, which relates outcomes highest quality items. They provide a more adequate measure of learning
assigned to each of the areas of the subject matter. The table ensures that the test will measure a
and indicates the relative weight to be assigned to each of the areas outcomes than the other items. In addition, they can measure a variety of
representative sample of the learning outcomes and the subject matter content. Table
of the subject matter. The table ensures that the test will measure a outcomes ranging from simple to complex. Hence, the following section
10.1 roughly illustrates how it is done.
representative sample of the learning outcomes and the subject matter is devoted to listing a few rules of constructing multiple-choice items.
content. Table 10.1 roughly illustrates how it is done.
Rules of Constructing Multiple-choice Items
Table-10.1:
Table Specification
10.1 Specification of course
of course content of learningcontent
outcomes of learning outcomes
Following are the important rules:
• Present a single, clearly formulated problem in the stem of the
item
• State the stem in simple, clear language
• Avoid repeating the same material over again alternatives
• State the stem in positive form, unless not possible
• If using negative form of stem, underline the negative wording
The figures in the cells indicate the number of items (weightage) to be given to each area and
the outcome.
118 Measurement and Evaluation in Education Tool Construction Procedure 119
♦ ♦
• Make certain that the intended answer is correct or clearly the best Purpose of item-analysis: The purpose is to find out how the items
• Make all alternatives grammatically consistent with the stem and and the distractors of the items in the test are working. In other words,
parallel in form toPurpose
find out which item
of item-analysis: is good
The purpose is toand
find which is items
out how the bad andso the
that bad items,
distractors of the
which
items are
in the testineffective
are working. Inmay be eliminated
other words, and,itemfinally,
to find out which a test
is good and whichof good
is bad so
• Avoid verbal clues, which might enable students to select the
items
that bad may be constructed.
items, which are ineffective may be eliminated and, finally, a test of good items may be
correct answer or to eliminate an incorrect alternative
Bases of item-analysis: The item-analysis is done based on the
constructed.
Similar to wording in both the stem and the correct answer, stating
following:
the correct answer in textbook language or stereotyped phraseology, Bases of item-analysis: The item-analysis is done based on the following:
stating the correct answer in greater detail, including absolute terms • Item-difficulty index
(e.g., all, any, never, only) in the distracters, including two responses • Item validityindex
Item-difficulty index
that are all inclusive and including two responses that have the same Item validity index
• Effectiveness of the distractors index
Effectiveness of the distractors index
meaning should be avoided. These three statistical indices constitute the criteria on the basis
• Avoid use of the alternatives ‘all of the above’, and ‘none of the ofThese
which threean item indices
statistical is selected
constituteoftherejected.
criteria on the basis of which an item is selected of
above’. Procedure of item-analysis: There are a number of item-analysis
rejected.
• Vary the position of the correct answer in a random manner. procedures that might be applied. Dowine (1967) presents a detailed
Procedure ofofitem-analysis:
discussion these. But,There are a number
the most simple,ofpopular
item-analysis
andprocedures
effectivethat might be
procedure
• Control the difficulty of the item by varying the problem either in applied. Dowine (1967) presents a detailed discussion of these. But, the most simple, popular and
is described here and is illustrated by taking an example of 47 test papers
the stem or by changing the alternative. effective procedure is described here and is illustrated by taking an example of 47 test papers
arranged serially from top to bottom according to the magnitude of the
(For explanation and examples of these rules consult N. E. Gronlund, arranged serially from top to bottom according to the magnitude of the total scores. Following
total scores. Following steps are involved: By taking approximately
Constructing Achievement Tests, Prentice-Hall, inc. Englewood Cliffs, steps are involved: By taking approximately one-third of the total from top and one-third from
one-third of the total from top and one-third from bottom two top and
NJ, 1968). bottom two top and bottom groups are formed and the middle group is set aside. Out of a group
bottom groups are formed and the middle group is set aside. Out of
Having written out the items, they are arranged and assembled in the of 47, a group of top 15 and another group of bottom 15 are formed. The 17 papers in the middle
a group of 47, a group of top 15 and another group of bottom 15 are
form of a complete test. Then, the thus-completed test is reviewed and are set aside.
formed. The 17 papers in the middle are set aside.
shortcomings are removed. After this the directions for the examinees
are prepared. For For
eacheach
item, item,
numbernumber of instudents
of students top and in top and
bottom groupsbottom
selectinggroups selecting
each alternative is
each alternative is counted and recorded. The results
counted and recorded. The results for each item are presented item wise on a sheet offor each item
paperare
as
The directions should contain information about: (i) the purpose
presented item wise on a sheet of paper as follows:
follows:
of the test, (ii) time for completing the test, (iii) ways of recording the
answers (separate answer sheet or in the text booklet), (iv) instruction
for not to guess, and so on.
Thus, the test and instruction prepared, is then reproduced either
in the form of photocopy, cyclostyled material or printed material.
Such kinds of tables are prepared for all the items. After this
Such kinds of tables are prepared for all the items. After this item-statistics (difficulty index
5.2. ITEM ANALYSIS item-statistics (difficulty index and validity index) for all the items
and validity index) for all the items are calculated as follows:
For the purpose of item analysis the preliminary form of the test are calculated
Item difficulty: Thisas follows:
is calculated by applying the formula:
is administered to a representative sample to the population for which Item difficulty: This is calculated by applying the formula:
it is meant. Then, it is scored and all the test booklets are arranged in
a pile according to the size of the scores serially, the topmost score
being at the top.
175
176 Measurement and Evaluation in Education References 177
♦ ♦
• Brennan, R. L. (1992). Elements of generalizability theory (rev. ed.). • Ewell, P. (2002). An emerging scholarship: A brief history of assessment.
Iowa City, IA: ACT, Inc. In T. W. Banta, & Associates (Eds.), Building a scholarship of assessment
• Brennan, R. L. (1995). The conventional wisdom about group mean (pp. 3–25). San Francisco, CA: Jossey–Bass.
scores. Journal of Educational Measurement, 14, 385–396. • Feldt, L. S. & Brennan, R. L. (1989). Reliability. In R. L. Linn (Ed.),
• Brennan, R. L. (1998a). Misconceptions at the intersection of measurement Educational Measurement (3rd ed.) (pp. 105–146). New York: American
theory and practice. Educational Measurement: Issues and Practice, 17 Council on Education and MacMillan.
(1), 5–9, 30. • Feuer, M. J., Holland, P. W., Green, B. F., Bertenthal, M. W., & Hemphill,
• Brennan, R. L. (1998b). Raw-score conditional standard errors of F. C. (Eds.) (1999). Uncommon measures. Washington, DC: National
measurement in generalizability theory. Applied Psychological Academy of Sciences.
Measurement, 22, 307–331. • Fisher, R. A. (1925). Statistical methods for research workers. London:
• Brennan, R. L. (2001a). An essay on the history and future of reliability Oliver & Bond.
from the perspective of replications. Journal of Educational Measurement, • Glas, C. A. W., & van der Linden, W. J. (2003). Computerized adaptive
38, 295–317. testing with item cloning. Applied Psychological Measurement, 27,
• Brennan, R. L. (2001b). Generalizability theory. Springer-Verlag. Brennan, 247–261.
R. L. (2001c). Some problems, pitfalls, and paradoxes in educational • Green, B. F. (2008). Book review: Educational measurement (4th ed.).
measurement. Educational Measurement: Issues and Practice, 20 (4), 6-18. Journal of Educational Measurement, 45, 195–200.
• Center of Inquiry in the Liberal Arts. (n.d.). Wabash National Study 2006– • Gulliksen, H. (1950). Theory of mental tests. New York: Wiley. [Reprinted
2009. Retrieved from: http://www.liberalarts.wabash.edu/study-overview by Lawrence Erlbaum Associates, Hillsdale, NJ, 1987.]
• Chickering, A., & Reisser, L. (1993). Education and identity. San Francisco, • Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals
CA: Jossey-Bass. of item response theory. Newbury Park, CA: Sage.
• Cizek, G. J. (2008). Assessing Educational Measurement: Ovations, • Holland, P. W., & Hoskens, M. (2003). Classical test theory as a first-
omissions, opportunities. Educational Researcher, 37, 96–100. order item response theory: Application to true-score prediction from a
• Council of Regional Accrediting Commissions. (2003). Regional possibly nonparallel test. Psychometrika, 68, 123–149.
accreditation and student learning: Principles of good practice. Retreived • Howell, R. D., Breivik, E., & Wilcox, J. B. (2007). Reconsidering formative
from http://www.msche.org/publications/Regnlsl050208135331.pdf measurement. Psychological Methods, 12, 201–218.
• Cronbach, L. J. (1991). Methodological studies—A personal retrospective. • Jones, & K. Black (Eds.), Designing effective assessment: Principles and
In R. E. Snow & D. E. Wiley (Eds.), Improving inquiry in social science: profiles of good practice (pp. 46–49). San Francisco, CA: Jossey-Bass.
Avolume in honor of Lee J. Cronbach (pp. 385–400). Hillsdale, NJ: • Junker, B. W., & Sijtsma, K. (2001). Cognitive assessment models with
Erlbaum. few assumptions, and connections with nonparametric item response
• Edwards, J. R., & Bagozzi, R. P. (2000). On the nature and direction of theory. Applied Psychological Measurement, 25, 258–272.
relationships between constructs and measures. Psychological Methods, • Kane, M. T. (1996). The precision of measurements. Applied Measurement
5, 155–174. in Education, 9, 355–379.
• Evans, N., Forney, D., Guido, F., Patton, K., & Renn, K. (2009). Student • Kolen, M. J. & Brennan, R. L. (2004). Test equating, scaling, and linking:
development in college: Theory, research, and practice (2nd ed.). San Methods and practices. (2nd ed.). New York: Springer-Verlag.
Francisco, CA: Jossey–Bass.
• Kolen, M. J. & Harris, D. J. (April, 1987). Amultivariate test theory
• Ewell, P. (1984). The self-regarding institution: Information for excellence. model based on item response theory and generalizability theory. A paper
Boulder, CO: National Center for Higher Education Management Systems.
178 Measurement and Evaluation in Education References 179
♦ ♦
presented at the Annual Meeting of the American Educational Research • No Child Left Behind Act of 2001, Pub. L. No. 107–110, 115 Stat. 1425
Association, Washington, DC. (2002)
• Lee, W., Brennan, R. L., & Kolen, M. J. (2000). Estimators of conditional • Nugent, W. R., & Hankins, J. A. (1992). A comparison of classical, item
scale-score standard errors of measurement: A simulation study. Journal response, and generalizability theories of measurement. In D. F. Gillespie
of Educational Measurement, 37, 1–20. and C. Glisson (Ed.), Quantitative methods in social work: State of the
• Leighton, J., & Gierl, M. (2007). Cognitive diagnostic assessment for art (pp. 11–39). Binghamton, NY: Haworth Press.
education: Theory and applications. Cambridge, UK: Cambridge University • Pearl, J. (2000). Causality: Models, reasoning, and inference. Cambridge,
Press. UK: Cambridge University Press.
• Linn, R. L. (1993). Linking results of distinct assessments. Applied • Shavelson, R. J. & Webb, N. M. (1991). Generalizability theory: Aprimer.
Measurement in Education, 6 (1), 83–102. Newbury Park, CA: Sage.
• Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test • Spirtes, P., Glymour, C. N., & Scheines, R. (2000). Causation, prediction,
scores. Reading, MA: Addison-Wesley. and search. New York: Springer.
• Mellenbergh, G. J. (1996). Measurement precision in test score and item • Stevens, S. S. (1946). On the theory of scales of measurement. Science,
response models. Psychological Methods, 1, 293–299. 103, 667–680.
• Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement. • Thissen, D., & Wainer, H. (2001). Test Scoring. Mahwah, NJ: Erlbaum.
(pp. 13–103). Washington, DC: American Council on Education and • Wainer, H. (2007). A psychometric cicada: Educational Measurement
National Council on Measurement in Education. returns. Educational Researcher, 36, 485–486.
• Michell, J. (1997). Quantitative science and the definition of measurement • Wright, B. D., & Stone, M. H. (1979). Best test design. Chicago, IL:
in psychology.British Journal of Psychology, 88, 355–383. MESA Press.
• Michell, J. (1999). Measurement in psychology: A critical history of a
methodological concept. New York: Cambridge University Press.
• Michell, J. (2000). Normal science, pathological science, and psychometrics.
Theory and Psychology, 10, 639–667.
• Michell, J. (2008). Is psychometrics pathological science? Measurement,
6, 7–24.
• Millsap, R. E. (1997). Invariance in measurement and prediction: Their
relationship in the single-factor case. Psychological Methods, 2, 248–260.
• Millsap, R. E. (2007). Invariance in measurement and prediction revisited.
Psychometrika, 72, 463–473.
• Mislevy, R. J. (1992). Linking educational assessments: Concepts, issues,
methods, and prospects. Princeton, NJ: Educational Testing Service.
• Mislevy, R. J. (1993). Some formulas for use with Bayesian ability
estimates. Educational and Psychological Measurement, 52, 315–328.
• Muthén, B. O. (2002). Beyond SEM: General latent variable modeling.
Behaviormetrika, 29, 81–117.
View publication stats