0% found this document useful (0 votes)
11 views

MeasurementandEvaluationinEducation

The book 'Measurement and Evaluation in Education' by Dr. A. Sivakumar and G. Thirumoorthy focuses on educational assessment methods and data analysis to evaluate student abilities. It covers key concepts such as measurement, evaluation, research instruments, test construction, and reliability and validity methods. The text is designed for advanced students in educational technology and aims to provide a comprehensive understanding of educational measurement practices.

Uploaded by

Tricia Rugenga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

MeasurementandEvaluationinEducation

The book 'Measurement and Evaluation in Education' by Dr. A. Sivakumar and G. Thirumoorthy focuses on educational assessment methods and data analysis to evaluate student abilities. It covers key concepts such as measurement, evaluation, research instruments, test construction, and reliability and validity methods. The text is designed for advanced students in educational technology and aims to provide a comprehensive understanding of educational measurement practices.

Uploaded by

Tricia Rugenga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 97

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/369479186

Measurement and Evaluation in Education

Book · December 2019

CITATIONS READS

10 10,184

2 authors, including:

Dr Thiru Moorthy
Regional Institute of Education(RIE) Ajmer
53 PUBLICATIONS 30 CITATIONS

SEE PROFILE

All content following this page was uploaded by Sivakumar A. on 24 March 2023.

The user has requested enhancement of the downloaded file.


Measurement and Evaluation
in Education
Measurement and Evaluation
in Education

Dr. A. Sivakumar
G. Thirumoorthy

A.P.H. PUBLISHING CORPORATION


4435–36/7, ANSARI ROAD, DARYA GANJ,
NEW DELHI-110002
Published by
S.B. Nangia
A.P.H. Publishing Corporation
4435–36/7, Ansari Road, Darya Ganj,
New Delhi-110002
Phone: 011–23274050
e-mail: [email protected]
PREFACE

Educational measurement and evaluation refers to the use of educational


assessments and the analysis of data such as scores obtained from
2019 educational assessments to understand the abilities of the students.
The approaches overlap with those in psychometric. Educational
measurement and evaluation is the assigning of numerals to traits
such as achievement, interest, attitudes, aptitudes, intelligence and
performance. The purpose of principle and training in educational
measurement and evaluation is typically to measure abilities and levels of
© Reserved attainment of the students in areas such as reading, writing, mathematics,
science, humanities, social science and so forth. Traditionally, attention
focuses on whether assessments are reliable and valid. Educational
measurement and evaluation is largely concerned with the analysis of
data from educational assessments or tests. Typically, this means using
total scores on assessments, whether they are multiple choice or open-
ended and marked using marking instructions or guides. In technical
Typeset by
terms, the pattern of scores of students for individual items is used to
Ideal Publishing Solutions
C-90, J.D. Cambridge School, infer so-called scale locations of students, the “measurements”. This
West Vinod Nagar, Delhi-110092 process is one form of scaling. Essentially, higher total scores give
higher scale locations, consistent with the traditional and everyday use
of total scores. One of the aims of applying theory and techniques in
educational measurement and evaluation is to try to place the results of
different tests administered to different groups of students on a single
or a common scale through processes known as test equating. The
rationale is that different assessments usually have different difficulties,
Printed at the total scores cannot be directly compared. The goal of trying to
BALAJI OFFSET place the results on a common scale is to allow comparison of the
Navin Shahdara, Delhi-110032
scale locations inferred from the totals via scaling processes. This book

v
vi Measurement and Evaluation in Education

is specially prepared for the student and teachers of Master of Arts,
Master of Science, Master of Philosophy in Educational Technology
& Education and Doctor of Philosophy in Educational Technology &
Education. It consists of five units which deal with Measurement and
evaluation, Research instruments, Construction of test, Methods of
reliability and validity and Tool construction procedure. This book is
dedicated to all students. Suggestions and comments to improve the CONTENTS
contents of this book will be welcomed.
Dr. A. Sivakumar
M.Sc., M.Ed., M.Phil., Ph.D
Assistant Professor,
KSR College of Education, Chapter-1: Measurement and Evaluation in Education 1
and 1.1. Measurement and Evaluation 1
G. Thirumoorthy, 1.2. Concept of Evaluation 2
M.Sc., M.Ed., M.Phil, 1.3 Meaning of Evaluation 3
Assistant Professor, 1.4. Item Formats 29
Michael Job Memorial College of Education for Women
1.5. Multiple Choices 29
1.6. Matching 31
1.7. Guidelines for item Preparation 35

Chapter-2: Research Instruments 37


2.1. Kinds of Instruments 37
2.1.1. Questionnaire 37
2.1.2. Opinionnaire 41
2.1.3. Check List 42
2.1.4. Inventory 44
2.1.5. TEST 45
2.1.6. Schedule 52
2.1.7. Attitude Scale 53
2.1.8. Meaning and Definition of Rating Scale 56
2.1.9. Score Card 62

Chapter-3: Construction of Test 63


3.1. Test Construction 63
3.2. Characteristics of a Good Test 72
vii
viii Measurement and Evaluation in Education

3.2.1. Validity 72
3.2.2. Reliability 73
3.2.3. Objectivity 81

Chapter-4: Methods of Reliability and Validity 84


4.1. Reliability Methods 84
4.2. Split Half 91
ABOUT THE AUTHORS
4.3. Spearman Brown Propheey Formula 95
4.4. Test-retest Reliability 97
4.5. Equivalent-Forms or Alternate-Forms Reliability 98 Dr. A. Sivakumar holds the Degrees of M.Sc., M.Ed., M.Phil
4.6. Alternate or Parallel Forms Method 99 (Education)., Ph.D (Education). He is working as an Assistant professor,
KSR College of Education, Tiruchengode, Namakkal. He has been
4.7. Rational Equivalence Method 99
serving as Guest Lecturer, Department of Educational Technology,
4.8. Construct validity 101 Bharathiar University, Coimbatore, Tamilnadu. He is specialized in
4.9. Face Validity 102 Computer Education, Educational Technology, Educational Psychology,
4.10. Content Validity 102 Teacher Education and SPSS Package. He published 32 papers and
4.11. Criterion-Related Validity 103 participated 23 conferences in Tamil Nadu. He published 7 books in
4.12. Predictive Validity 103 the field of education.
4.13. Concurrent Validity 103
G. Thirumoorthy holds the Degrees of M.Sc., M.Ed., M.Phil
4.14. Convergent Validity 104
(Geography). He is working as an Assistant professor, Michael Job
4.15. Discriminant Validity 104 Memorial College of Education for Women, Coimbatore. His areas
of interests are Environmental Education, Educational Technology,
CHAPTER-5: Tool Construction Procedure 106
Educational Psychology and Teacher Education. He published 10 papers
5.1. Steps of Tool Construction 106 and participated 15 conferences in Tamil Nadu.
5.2. Item Analysis 118
5.3. Nature of Achievement Tests 162
5.4. Internal Consistency 167
5.5. Reliability and Validity of Achievement Test 169

References 175

ix
Chapter-1
MEASUREMENT AND EVALUATION IN
EDUCATION

1.1. MEASUREMENT AND EVALUATION


Meaning of Measurement
• It is the collection of information in numeric form.
• It is the record of performance or the information which is required
to make judgment.
According to R.N. Patel: Measurement is an act or process that
involves the assignment of numerical values to whatever is being tested.
So it involves the quantity of something.

Nature of Measurement
• It should be quantitative in nature
• It must be precise and accurate (instrument)
• It must be reliable
• It must be valid
• It must be objective in nature
Measurement refers to the process by which the attributes or
dimensions of some physical object are determined. One exception
seems to be in the use of the word measure in determining the IQ
of a person. The phrase, “this test measures IQ” is commonly used.
Measuring such things as attitudes or preferences also applies. However,
when we measure, we generally use some standard instrument to
determine how big, tall, heavy, voluminous, hot, cold, fast, or straight
something actually is. Standard instruments refer to physical devices
such as rulers, scales, thermometers, pressure gauges, etc. We measure
1
2 Measurement and Evaluation in Education Measurement and Evaluation in Education 3
♦ ♦
to obtain information about what is. Such information may or may • Judgment forming
not be useful, depending on the accuracy of the instruments we use, • Decision making
and our skill at using them. There are few such instruments in the
• Evaluation is a concept that has emerged as a prominent process of
social sciences that approach the validity and reliability of say a 12”
assessing, testing and measuring. Its main objective is Qualitative
ruler. We measure how big a classroom is in terms of square feet,
Improvement.
we measure the temperature of the room by using a thermometer,
and we use an Ohm meter to determine the voltage, amperage, and • Evaluation is a process of making value judgements over a level
resistance in a circuit. In all of these examples, we are not assessing of performance or achievement. Making value judgements in
anything; we are simply collecting information relative to some Evaluation process presupposes the set of objectives.
established rule or standard. Assessment is therefore quite different • Evaluation is the process of determining the extent to which the
from measurement, and has uses that suggest very different purposes. objectives are achieved.
When used in a learning objective, the definition provided on the • Concerned not only with the appraisal of achievement, but also
ADPRIMA for the behavioral verb measure is: To apply a standard with its improvement.
scale or measuring device to an object, series of objects, events, or • Evaluation is continuous and dynamic. Evaluation helps in forming
conditions, according to practices accepted by those who are skilled the following decisions.
in the use of the device or scale. An important point in the definition
is that the person be skilled in the use of the device or scale. For Types of Decisions
example, a person who has in his or her possession a working Ohm • Instructional
meter, but does not know how to use it properly, could apply it to an
• Curricular
electrical circuit but the obtained results would mean little or nothing
in terms of useful information. • Selection
The process of measurement as it implies involves carrying out • Placement or Classification
actual measurement in order to assign a quantitative meaning to a • Personal
quality. Measurement is therefore a process of assigning numerals to
objects, quantities or events in other to give quantitative meaning to 1.3 MEANING OF EVALUATION
such qualities. In the classroom, to determine a child’s performance, • It is a technique by which we come to know at what extent the
you need to obtain quantitative measures on the individual scores objectives are being achieved.
of the child. If the child scores 80 in Mathematics, there is no other
• It is a decision making process which assists to make grade and
interpretation you should give it. Cannot say student has passed or
ranking.
failed. Measurement stops at ascribing the quantity but not making
According to Barrow and Mc Gee It is the process of education
value judgment on the child’s performance.
that involves collection of data from the products which can be used
1.2. CONCEPT OF EVALUATION for comparison with preconceived criteria to make judgment.

• Science of providing information for decision making. Nature of Evaluation


• Includes measurement, assessment and testing • It is systematic process
• Information gathering • It is a continuous dynamic process
• Information processing • Identifies strength and weakness of the program
4 Measurement and Evaluation in Education Measurement and Evaluation in Education 5
♦ ♦
• Involves variety of tests and techniques of measurement lesson delivery often culminate in ascertaining whether the objectives, set
• Emphasis on the major objective of an educational program out to achieve were actually achieved. This is often called evaluation. One
unit introduces to some important concepts associated with ascertaining
• Based upon the data obtained from the test
whether objectives have been achieved or not. Basically, the unit takes
• It is a decision making process through the meanings of test, measurement assessment and evaluation
Evaluation is perhaps the most complex and least understood of the in education. Evaluation is the structured interpretation and giving
terms. Inherent in the idea of evaluation is “value.” When we evaluate, of meaning to predicted or actual impacts of proposals or results. It
what we are doing is engaging in some process that is designed to provide looks at original objectives, and at what is either predicted or what
information that will help us make a judgment about a given situation. was accomplished and how it was accomplished. So evaluation can
Generally, any evaluation process requires information about the situation be formative, that is taking place during the development of a concept
in question. A situation is an umbrella term that takes into account such or proposal, project or organization, with the intention of improving
ideas as objectives, goals, standards, procedures, and so on. When we the value or effectiveness of the proposal, project, or organisation.
evaluate, we are saying that the process will yield information regarding It can also be assumptive, drawing lessons from a completed action
the worthiness, appropriateness, goodness, validity, legality, etc., of or project or an organisation at a later point in time or circumstance.
something for which a reliable measurement or assessment has been Evaluation is inherently a theoretically informed approach (whether
made. For example, I often ask my students if they wanted to determine explicitly or not), and consequently any particular definition of evaluation
the temperature of the classroom they would need to get a thermometer would have be tailored to its context – the theory, needs, purpose,
and take several readings at different spots, and perhaps average the and methodology of the evaluation process itself. Having said this,
readings. That is simple measuring. The average temperature tells us evaluation has been defined as:
nothing about whether or not it is appropriate for learning. In order to
• A systematic, rigorous, and meticulous application of scientific
do that, students would have to be polled in some reliable and valid
methods to assess the design, implementation, improvement, or
way. That polling process is what evaluation is all about. A classroom
outcomes of a program. It is a resource-intensive process, frequently
average temperature of 75 degrees is simply information. It is the context
requiring resources, such as, evaluate expertise, labor, time, and
of the temperature for a particular purpose that provides the criteria
a sizable budget
for evaluation. A temperature of 75 degrees may not be very good for
some students, while for others, it is ideal for learning. We evaluate • “The critical assessment, in as objective a manner as possible,
every day. Teachers, in particular, are constantly evaluating students, of the degree to which a service or its component parts fulfills
and such evaluations are usually done in the context of comparisons stated goals” (St Leger and Wordsworth-Bell).The focus
between what was intended (learning, progress, behavior) and what was of this definition is on attaining objective knowledge, and
obtained. When used in a learning objective, the definition provided scientifically or quantitatively measuring predetermined and
on the ADPRIMA site for the behavioral verb evaluate is: To classify external concepts.
objects, situations, people, conditions, etc., according to defined criteria • “A study designed to assist some audience to assess an object’s merit
of quality. Indication of quality must be given in the defined criteria and worth” (Shuffleboard). In this definition the focus is on facts as
of each class category. Evaluation differs from general classification well as value laden judgments of the programs outcomes and worth
only in this respect.
In curriculum development and general methods in education, gave The Concepts of test, Measurement, and Evaluation in Education
the importance of objectives in education, also distinguished between These concepts are often used interchangeably by practitioners
Instructional and behavioral objectives. Curriculum implementation and and if they have the same meaning. This is not so. As a teacher, you
6 Measurement and Evaluation in Education Measurement and Evaluation in Education 7
♦ ♦
should be able to distinguish one from the other and use any particular (v) To help teachers determine the effectiveness of their teaching
one at the appropriate time to discuss issues in the classroom. techniques and learning materials;
(vi) To help motivate students to want to learn more as they discover
Measurement
their progress or lack of progress in given tasks;
The process of measurement as it implies involves carrying out
(vii) To encourage students to develop a sense of discipline and systematic
actual measurement in order to assign a quantitative meaning to a
study habits;
quality i.e. what is the length of the chalkboard? Determining this
must be physically done. Measurement is therefore a process of (viii) To provide educational administrators with adequate information
assigning numerals to objects, quantities or events in other to give about teachers’ effectiveness and school need;
quantitative meaning to such qualities. In the classroom, to determine (ix) To acquaint parents or guardians with their children’s performances;
a child’s performance, need to obtain quantitative measures on the (x) To identify problems that might hinder or prevent the achievement
individual scores of the child. If the child scores 80 in Mathematics, of set goals;
there is no other interpretation should give it, cannot say he has (xi) To predict the general trend in the development of the teaching-
passed or failed. learning process;
Evaluation (xii) To ensure an economical and efficient management of scarce
Evaluation adds the ingredient of value judgement to assessment. resources;
It is concerned with the application of its findings and implies some (xiii) To provide an objective basis for determining the promotion
judgement of the effectiveness, social utility or desirability of a product, of students from one class to another as well as the award of
process or progress in terms of carefully defined and agreed upon certificates;
objectives or values. Evaluation often includes recommendations for (xiv) To provide a just basis for determining at what level of education
constructive action. Thus, evaluation is a qualitative measure of the the possessor of a certificate should enter a career.
prevailing situation. It calls for evidence of effectiveness, suitability, Other definitions of evaluation as given by practitioners are:
or goodness of the programme. It is the estimation of the worth of a 1. A systematic process of determining what the actual outcomes are
thing, process or programme in order to reach meaningful decisions but it also involves judgment of desirability of whatever outcomes
about that thing, process or programme. are demonstrated. (Travers, 1955)
The Purposes of Evaluation 2. The process of ascertaining the decision of concern, selecting
appropriate information and collecting and analyzing information in
According to Oguniyi (1984), educational evaluation is carried
order to report summary data useful to decision makers in selecting
out from time to time for the following purposes:
among alternatives (Alkin, 1970).
(i) To determine the relative effectiveness of the programme in terms
3. The process of delineating, obtaining and providing useful
of students’ behavioural output;
information for judging decision alternatives (Stufflebeam et al
(ii) To make reliable decisions about educational planning; 1971)
(iii) To ascertain the worth of time, energy and resources invested in
a programme; Types of Evaluation
(iv) To identify students’ growth or lack of growth in acquiring desirable There are two main levels of evaluation viz: programme level
knowledge, skills, attitudes and societal values; and student level. Each of the two levels can involve either of the two
8 Measurement and Evaluation in Education Measurement and Evaluation in Education 9
♦ ♦
main types of evaluation – formative and summative at various stages. of the teaching-learning process. Summative evaluation is judgmental
Programme evaluation has to do with the determination of whether a in nature and often carries threat with it in that the student may have
programme has been successfully implemented or not. Student evaluation no knowledge of the evaluator and failure has a far reaching effect on
determines how well a student is performing in a programme of study. the students. However, it is more objective than formative evaluation.
Some of the underlying assumptions of summative evaluation are that:
Formative Evaluation
1. The programme’s objectives are achievable;
The purpose of formative evaluation is to find out whether after a
2. The teaching-learning process has been conducted efficiently;
learning experience, students are able to do what they were previously
unable to do. Its ultimate goal is usually to help students perform well 3. The teacher-student-material interactions have been conducive to
at the end of a programme. Formative evaluation enables the teacher to: learning;
1. Draw more reliable inference about his students than an external 4. The teaching techniques, learning materials and audio-visual aids
assessor, although he may not be as objective as the latter; are adequate and have been judiciously dispensed; and
2. Identify the levels of cognitive process of his students; 5. There is uniformity in classroom conditions for all learners.
3. Choose the most suitable teaching techniques and materials; Factors to be Considered for Successful Evaluation
4. Determine the feasibility of a programme within the classroom setting;
1. Sampling technique – Appropriate sampling procedure must be
5. Determine areas needing modifications or improvement in the adopted.
teaching-learning process; and
2. Evaluation itself must be well organized.
6. Determine to a great extent the outcome of summative evaluation.
(Ogunniyi, 1984) • Treatment
Thus, Formative evaluation attempts to: • Conducive atmosphere
(i) Identify the content (i.e. knowledge or skill) which has not been • Intended and un-intended outcomes and their implications
mastered by the students; considered.
(ii) Appraise the level of cognitive abilities such as memorization, 3. Objectivity of the instrument.
classification, comparison, analysis, explanation, quantification, • Feasibility of the investigation
application and so on; and
• Resolution of ethical issues
(iii) Specify the relationships between content and levels of cognitive
abilities. • Reliability of the test (accuracy of data in terms of stability,
In other words, formative evaluation provides the evaluator with repeatability and precision)
useful information about the strength or weakness of the student within Validity – test should measure what it is supposed to measure and
an instructional context. the characteristics to be measured must be reflected.
4. Rationale of the evaluation instrument
Summative Evaluation
5. It must be ensured that the disparity in students’ performances are
Summative evaluation often attempts to determine the extent the related to the content of the test rather than to the techniques used
broad objectives of a programme have been achieved (i.e. SSSCE, (NECO in administering the instrument.
or WAEC), PROMOTION, GRADE TWO, NABTEB Exams and other
public examinations). It is concerned with purposes, progress and outcomes 6. The format used must be the most economical and efficient.
10 Measurement and Evaluation in Education Measurement and Evaluation in Education 11
♦ ♦
7. Teachers must have been adequately prepared. They must be • Identity- Each value on the measurement scale has a unique
qualified to teach the subjects allotted to them. meaning.
• Magnitude- Values on the measurement scale have an ordered
Characteristics of Evaluation
relationship to one another. That is, some values are larger and
1. It involves assessment of all the teaching-learning outcomes in some are smaller.
terms of overall behavioural changes. It goes beyond the knowledge • Equal intervals- Scale units along the scale are equal to one another.
objectives to cover skill, application, interest, attitude and appreciation This means, for example, that the difference between 1 and 2 would
objectives. Therefore, the area and field of testing the stipulated be equal to the difference between 19 and 20.
objectives has been greatly increased by adopting this new term.
• A minimum value of zero. The scale has a true zero point, below
2. It involves forming judgments and taking decisions about the child’s which no values exist.
progress, difficulties encountered by him and taking corrective
measures to improve his learning. Nominal Scale of Measurement
3. Evaluation requires interpretation of data in a careful manner. The nominal scale of measurement only satisfies the identity
4. Evaluation is continuous. It is not confined to one particular class or property of measurement. Values assigned to variables represent a
stage of education. It is to be conducted continuously as the student descriptive category, but have no inherent numerical value with respect
passes from one stage to another, from one class to other, from one to magnitude.
school to the other. It starts at the time the child seeks admission in Gender is an example of a variable that is measured on a nominal
a particular grade in the form of placement evaluation; it continues scale. Individuals may be classified as “male” or “female”, but neither
as the child proceeds, from one unit to another unit of instruction value represents more or less “gender” than the other. Religion and
in the form of formative and diagnostic evaluation and ends in political affiliation are other examples of variables that are normally
Gender is an example of a variable that is measured on a nominal scale. Individuals may be
summative evaluation at the end of instruction in a particular grade. measured on a nominal scale. Nominal scales are used for labeling
classified as "male" or "female", but neither value represents more or less "gender" than the
5. Evaluation is comprehensive. It is not simply concerned with the variables, without any quantitative value. “Nominal” scales could simply
other. Religion and political affiliation are other examples of variables that are normally
academic status of the student but with all aspects of his grow i.e. be called “labels.” Here are some examples, below. Notice that all of
measured on a nominal scale. Nominal scales are used for labeling variables, without any
which includes both cognitive and non-cognitive aspects. these scales
quantitative value. are mutually
―Nominal‖ exclusive
scales could simply be (no
calledoverlap)
―labels.‖ Hereandare none of them have
some examples,

6. Evaluation includes all the means of collecting information about any numerical
below. significance.
Notice that all of A good
these scales are mutually way(notooverlap)
exclusive remember
and none all of have
of them this is that
the student’s learning. The evaluator should make use of tests, “nominal”
any sounds aA lot
numerical significance. goodlike “name”
way to remember and
all of nominal scalessounds
this is that ―nominal‖ are kind
a lot of like

observation, interview, rating scale, check list and value judgement “names”
like ―name‖ andornominal
labels. scales are kind of like ―names‖ or labels.

to gather complete and reliable information about the students.

Measurement Scales
Measurement scales are used to categorize and/or quantify variables.
The four scales of measurement that is commonly used in statistical
analysis: nominal, ordinal, interval, and ratio scales.
Examples of Nominal Scales
Properties of Measurement Scales Note:
Examples a sub-type
of Nominal Scales of nominal scale with only two categories (e.g.
Each scale of measurement satisfies one or more of the following male/female) is called “dichotomous.” If you are a student, you can
Note: a sub-type of nominal scale with only two categories (e.g. male/female) is called
properties of measurement. use that to Ifimpress
―dichotomous.‖ youryou
you are a student, teacher.
can use that to impress your teacher.

Ordinal Scale of Measurement

With ordinal scales, it is the order of the values is what‘s important and significant, but
12 Measurement and Evaluation in Education Measurement and Evaluation in Education 13
♦ ♦
Ordinal Scale of Measurement example of an interval scale is Celsius temperature because the difference
Interval scales
between are nice
each value is thebecause
same. the
Forrealm
example,of statistical analysis
the difference on these data sets
between
With ordinal scales, it is the order of the values is what’s important
and significant, but the differences between each one is not really known. up. For 60example,
and 50 degrees
central istendency
a measurable
can be10measured
degrees, by as is the difference
mode, median, or mean; st
Take a look at the example below. In each case, we know that a #4 is better between 80 and 70 degrees. Time is another good example of an interval
deviation can also be calculated.
than a #3 or #2, but we don’t know–and cannot quantify–how much better scale in which the increments are known, consistent, and measurable.
it is. For example, is the difference between “OK” and “Unhappy” the Interval scales are nice because the realm of statistical analysis on
Like data
these the others, youup.
sets opens canForremember the keytendency
example, central points ofcananbe―interval
measured scale‖ pretty
same as the difference between “Very Happy” and “Happy?” We can’t say.
Ordinal scales are typically measures of non-numeric concepts by mode,
―Interval‖ median,―space
itself means or mean; standard deviation
in between,‖ which is thecan important
also be calculated.
thing to remember–i
like satisfaction, happiness, discomfort, etc. Like the others, you can remember the key points of an “interval
scales not only tell us about order, but also about the value between each item.
“Ordinal” is easy to remember because is sounds like “order” and scale” pretty easily. “Interval” itself means “space in between,” which
―Ordinal‖ is easy to remember because is sounds like ―order‖ and that‘s the key to remember
that’s the key to remember with “ordinal scales”–it is the order that is the important thing to remember–interval scales not only tell us about
with ―ordinal scales‖–it is the order that matters, but that‘s all you really get from these. Here‘sbut
order, thealso
problem
about with interval
the value scales:each
between theyitem.
don‘t have a ―true zero.‖ For ex
matters, but that’s all you really get from these.
Advanced note: The best way to determine central tendency on a there is no Here’s the problem
such thing with interval scales:
as ―no temperature.‖ Withoutthey don’t
a true have
zero, it ais “true
impossible to co
Advanced note: The best way to determine central tendency on a set of ordinal data is to use the
set of ordinal data is to use the mode or median; the mean cannot be ratios. zero.” For example,
With interval data, there
we can is no
addsuch
andthing as “nobut
subtract, temperature.”
cannot multiplyWithoutor divide. Con
mode
definedor median;
from antheordinal
mean cannot
set. be defined from an ordinal set. a true zero, it is impossible to compute ratios. With interval data, we
Ok, consider
can addthis:and10subtract,
degreesbut + 10 degrees
cannot = 20 degrees.
multiply or divide. NoConfused?
problem there. Ok, 20 degrees
twice asconsider
hot as 10 this: 10 degrees
degrees, + 10 degrees
however, because =there
20 degrees. Nothing
is no such problem
as ―nothere.
temperature‖ w
comes 20 degrees
to the is not
Celsius twice
scale. as hotthat
I hope as 10 degrees,
makes sense.however,
Bottombecause there isscales are gre
line, interval
no such thing as “no temperature” when it comes to the Celsius scale.
we cannot calculate
I hope ratios,sense.
that makes whichBottom
brings us to our
line, last measurement
interval scale…
scales are great, but we
cannot calculate ratios, which brings us to our last measurement scale…

Example of Ordinal Scales


Example of Ordinal Scales
The ordinal scale has the property of both identity and magnitude.
Each valuescale
The ordinal on thehasordinal scaleofhas
the property a unique
both identity meaning, and itEach
and magnitude. has an
value on the ordinal
ordered relationship to every other value on the scale.
scale has a unique meaning, and it has an ordered relationship to every other value on the scale.
An example of an ordinal scale in action would be the results of a
horse race, reported as “win”, “place”, and “show”. We know the rank
An example of an ordinal scale in action would be the results of a horse race, reported as "win",
order in which horses finished the race. The horse that won finished
"place",
ahead ofandthe"show". We placed,
horse that know theandrank
theorder in that
horse which horsesfinished
placed finishedahead
the race. The horse that
Example of Interval Scale Example of Interval Scale
of the
won horse that
finished aheadshowed. However,
of the horse we cannot
that placed, and thetell fromthat
horse this ordinal
placed scaleahead of the horse
finished
whether it was a close race or whether the winning horse won by a mile.
that showed. However, we cannot tell from this ordinal scale whether it was a close race or The interval scale of measurement has the properties of identity,
The interval scale of measurement has the properties of identity, magnitude, and
whether magnitude, and equal intervals. A perfect example of an interval scale
Intervalthe winning
scale horse won by a mile.
of Measurement intervals. A perfect
is the example
Fahrenheit scaleof
to an interval
measure scale is theThe
temperature. Fahrenheit scale to
scale is made upmeasure
of tempe
Interval scales are numeric scales in which we know not only the equal
The scale is temperature
made up ofunits,
equalso temperature
that the difference
units,between
so that 40
theand 50 degrees
difference between 40 a
Interval scale of Measurement
order, but also the exact differences between the values. The classic Fahrenheit is equal to the difference between 50 and 60 degrees Fahrenheit.
degrees Fahrenheit is equal to the difference between 50 and 60 degrees Fahrenheit.
Interval scales are numeric scales in which we know not only the order, but also the exact
differences between the values. The classic example of an interval scale is Celsius temperature
because the difference between each value is the same. For example, the difference between 60
Scale of Measurement

Ratio scales are the ultimate nirvana when it comes to measurement scales because they
14 Measurement and Evaluation in Education Measurement and Evaluation in Education 15
s about the order, they tell us the exact value between units, AND ♦ they also have an ♦
Ratio Scale
ute zero–which of Measurement
allows for a wide range of both descriptive and inferential statistics to be Characteristics of Good Evaluation
Ratio scales are the ultimate nirvana when it comes to measurement Characteristics of a Good Evaluation Tool
ed. At thescales
risk of repeating
because myself,
they tell everything
us about the order, above
they tellabout
us theinterval
exact data applies to ratio
1. Objective-basedness: Evaluation is making judegement about some
value between
+ ratio scales have a units,
clear AND they also
definition of have
zero.anGood
absolute zero–which
examples of ratio variables include phenomena or performance on the basis of some pre-determined
allows for a wide range of both descriptive and inferential statistics
objectives. Therefore a tool meant for evaluation should measure
t and weight.
to be applied. At the risk of repeating myself, everything above
attainment in terms of criteria determined by instructional objectives.
about interval data applies to ratio scales + ratio scales have a clear
This is possible only if the evaluator is definite about the objectives,
definition of zero. Good examples of ratio variables include height
Ratio scales provide a wealth of possibilities when it comes to statistical analysis. These the degree of realization of which he is going to evaluate. Therefore
and weight.
each item of the tool should represent an objective.
bles can be Ratio scales provide a wealth of possibilities when it comes
meaningfully added, subtracted, multiplied, divided (ratios). Central 2. Comprehensiveness: A tool should cover all pints expected to be
to statistical analysis. These variables can be meaningfully added,
ncy can besubtracted,
measuredmultiplied,
by mode,divided
median, or mean;
(ratios). measures
Central tendencyofcan
dispersion,
be such as standard learnt by the pupils. It should also cover all the pre-determined
measured by mode, median, or mean; measures of dispersion, such objectives. This is referred to be comprehensiveness.
tion and coefficient of variation
as standard deviation can also of
and coefficient bevariation
calculated from
can also ratio scales.
be calculated 3. Discriminating power: A good evaluation tool should be able
from ratio scales. to discriminate the respondents on the basis of the phenomena
measured. Hence while constructing a tool for evaluation, the
discrimination power has to be taken care of. This may be at two
levels- first for the test as a whole and then for each item included.
4. Reliability: Reliability of a tool refers to the degree of consistency
and accuracy with which it measures what it is intended to measure.
If the evaluation gives more or less the same result every time it is
used, such evaluation is said to be reliable. Consistency of a tool
can be improved by limiting subjectivity of all kinds. Making items
on the basis of pre-determined specific objectives, ensuring that the
expected answers are definite and objective, providing clearly spelt-
out scheme for scoring and conducting evaluation under identical
and ideal condition will help in enhancing reliability. Test-retest
This Device Provides Two Examples of Ratio Scales method, split-half method and equivalent form or parallel form
(height and weight) method are the important methods generally used to determine
Device Provides
TheTwo
ratio Examples of Ratio satisfies
scale of measurement Scales all(height
four ofand weight)
the properties the reliability of a tool.
of measurement: identity, magnitude, equal intervals, and a minimum 5. Validity: Validity is the most important quality needed for an
value of zero.The weight of an object would be an example of a ratio evaluation tool. If the tool is able to measure what it is intended
The ratio scale of measurement satisfies all four of the properties of measurement:
scale. Each value on the weight scale has a unique meaning, weights to measure, it can be said that the tool is valid. It should fulfill the
ty, magnitude,
can beequal intervals,
rank ordered, andalong
units a minimum
the weightvalue
scale of
arezero.The weight of an object would
equal to one objectives for which it is developed. Validity can be defined as “
another, and the scale has a minimum value of zero. Weight scales have the accuracy with which it measures what it is intended to measure
example of a ratio scale.
a minimum value Each
of zerovalue on objects
because the weight
at restscale
can behas a unique meaning, weights can
weightless, or as the degree in which it approaches infallibility in measuring
but they cannot have negative weight. what it purports to measure Content validity, predictive validity,
nk ordered, units along the weight scale are equal to one another, and the scale has a
mum value of zero. Weight scales have a minimum value of zero because objects at rest can
ightless, but they cannot have negative weight.
16 Measurement and Evaluation in Education Measurement and Evaluation in Education 17
♦ ♦
construct validity, concurrent validity, congruent validity, factorial package, etc has been completed, and the course curriculum or package
validity, criterion-related validity, etc. are some of the important is ready to use in its final form. The object of such evaluation is to
types of validity which is needs to fulfill by a tool for evaluation. determine whether it meets its design criteria, ie whether it does the
6. Objectivity: A tool is said to be objective if it is free from personal job for which it was designed. Summative evaluation may also be
bias of interpreting its scope as well as in scoring the responses. carried out in order to compare one course, curriculum, educational
Objectivity is one of the most primary pre-requisites required for package, etc with another (or several others), e.g. to compare the
maintaining all other qualities of a good too. relative effectiveness of two different courses in the same general
area, or to determine which of a number of different textbooks is
7. Practicability: A tool, however, well it satisfies all the above
most suitable for use in a particular course. In such evaluation, the
criteria, may be useless unless it is not practically feasible. For
object is not to improve the courses or textbooks being evaluated;
example, suppose, in order to ensure comprehensiveness, it was felt
rather, it is to choose between them.
that thousand items should be given to be answered in ten hours.
This may yield valid result, but from practical point of view it s Formative and Summative Assessments in the Classroom
quite impossible.
Summative Assessments: Are given periodically to determine at
Formative and Summative Evaluation a particular point in time what students know and do not know. Many
associate summative assessments only with standardized tests such
Probably the most basic distinction is that between formative
as state assessments, but they are also used at and are an important
evaluation and summative evaluation.Defining Formative and Summative
part of district and classroom programs. Summative assessment at the
Assessments terms “formative” and “summative” do not have to be
district/classroom level is an accountability measure that is generally
difficult, yet the definitions have become confusing in the past few years.
used as part of the grading process. The list is long, but here are some
This is especially true for formative assessment. In a balanced assessment
examples of summative assessments:
system, both summative and formative assessments are an integral part
of information gathering. Depend too much on one or the other and the • State assessments
reality of student achievement in your classroom becomes unclear. • District benchmark or interim assessments
Formative evaluation: This is evaluation that is carried out while a • End-of-unit or chapter tests
course, curriculum, educational package, etc is actually being developed,
• End-of-term or semester exams
its main purpose being to find out whether it needs to be, and if so,
whether it realistically can be improved. The key feature of all such • Scores that are used for accountability for schools (AYP) and
evaluation is that it is designed to bring about improvement of the students (report card grades).
course, curriculum or educational package while it is still possible to The key is to think of summative assessment as a means to gauge, at
do so, ie while the material has not yet been put into its operational a particular point in time, student learning relative to content standards.
form. In the case of a major course that is to be run throughout a country Although the information that is gleaned from this type of assessment is
or internationally, such evaluation must clearly be carried out before important, it can only help in evaluating certain aspects of the learning
the course design is finalised, the necessary resource materials mass process. Because they are spread out and occur after instruction every
produced, and the course implemented. In the case of an educational few weeks, months, or once a year, summative assessments are tools to
package, it must be carried out before the final package is published. help evaluate the effectiveness of programs, school improvement goals,
Summative evaluation: This is evaluation that is carried out alignment of curriculum, or student placement in specific programs.
once the development phase of a course, curriculum, educational Summative assessments happen too far down the learning path to
provide information at the classroom level and to make instructional
18 Measurement and Evaluation in Education Measurement and Evaluation in Education 19
♦ ♦
adjustments and interventions during the learning process. It takes shows that the involvement in and ownership of their work increases
formative assessment to accomplish this. students’ motivation to learn. This does not mean the absence of teacher
Formative Assessment: Is part of the instructional process. When involvement. To the contrary, teachers are critical in identifying learning
incorporated into classroom practice, it provides the information needed goals, setting clear criteria for success, and designing assessment tasks
to adjust teaching and learning while they are happening. In this sense, that provide evidence of student learning.
formative assessment informs both teachers and students about student One of the key components of engaging students in the assessment
understanding at a point when timely adjustments can be made. These of their own learning is providing them with descriptive feedback as
adjustments help to ensure students achieve targeted standards-based they learn. In fact, research shows descriptive feedback to be the most
learning goals within a set time frame. Although formative assessment significant instructional strategy to move students forward in their
strategies appear in a variety of formats, there are some distinct ways learning. Descriptive feedback provides students with an understanding
to distinguish them from summative assessments. of what they are doing well, links to classroom learning, and gives
One distinction is to think of formative assessment as “practice.” specific input on how to reach the next step in the learning progression.
We do not hold students accountable in “grade book fashion” for skills In other words, descriptive feedback is not a grade, a sticker, or “good
and concepts they have just been introduced to or are learning. We job!” A significant body of research indicates that such limited feedback
must allow for practice. Formative assessment helps teachers determine does not lead to improved student learning.
next steps during the learning process as the instruction approaches There are many classroom instructional strategies that are part of
the summative assessment of student learning. A good analogy for this the repertoire of good teaching. When teachers use sound instructional
is the road test that is required to receive a driver’s license. What if, practice for the purpose of gathering information on student learning,
before getting your driver’s license, you received a grade every time they are applying this information in a formative way. In this sense,
you sat behind the wheel to practice driving? What if your final grade formative assessment is pedagogy and clearly cannot be separated from
for the driving test was the average of all of the grades you received instruction. It is what good teachers do. The distinction lies in what
while practicing? Because of the initial low grades you received during teachers actually do with the information they gather.
the process of learning to drive, your final grade would not accurately Some of the instructional strategies that can be used formatively
reflect your ability to drive a car. In the beginning of learning to drive, include the following:
how confident or motivated to learn would you feel? Would any of the • Criteria and goal Setting: With students engages them in
grades you received provide you with guidance on what you needed instruction and the learning process by creating clear expectations.
to do next to improve your driving skills? Your final driving test, In order to be successful, students need to understand and know
or summative assessment, would be the accountability measure that the learning target/goal and the criteria for reaching it. Establishing
establishes whether or not you have the driving skills necessary for a and defining quality work together, asking students to participate in
driver’s license—not a reflection of all the driving practice that leads establishing norm behaviors for classroom culture, and determining
to it. The same holds true for classroom instruction, learning, and what should be included in criteria for success are all examples of
assessment. this strategy. Using student work, classroom tests, or exemplars of
Another distinction that underpins formative assessment is student what is expected helps students understand where they are, where
involvement. If students are not involved in the assessment process, they need to be, and an effective process for getting there.
formative assessment is not practiced or implemented to its full
effectiveness. Students need to be involved both as assessors of their • Observations: Go beyond walking around the room to see if students
own learning and as resources to other students. There are numerous are on task or need clarification. Observations assist teachers in
strategies teachers can implement to engage students. In fact, research gathering evidence of student learning to inform instructional
20 Measurement and Evaluation in Education Measurement and Evaluation in Education 21
♦ ♦
planning. This evidence can be recorded and used as feedback classroom level balances formative and summative student learning/
for students about their learning or as anecdotal data shared with achievement information, a clear picture emerges of where a student
them during conferences. is relative to learning targets and standards. Students should be able to
• Questioning Strategies: Should be embedded in lesson/unit articulate this shared information about their own learning. When this
planning. Asking better questions allows an opportunity for deeper happens, student-led conferences, a formative assessment strategy, are
thinking and provides teachers with significant insight into the valid. The more we know about individual students as they engage in
degree and depth of understanding. Questions of this nature engage the learning process, the better we can adjust instruction to ensure that
students in classroom dialogue that both uncovers and expands all students continue to achieve by moving forward in their learning.
learning. An “exit slip” at the end of a class period to determine
Uses of Evaluation
students’ understanding of the day’s lesson or quick checks during
instruction such as “thumbs up/down” or “red/green” (stop/go) cards Students who are completing or have completed a course have
are also examples of questioning strategies that elicit immediate observed a teacher for many hours and are in a position to provide
information about student learning. Helping students ask better potentially useful information concerning the teacher’s effectiveness.
questions is another aspect of this formative assessment strategy. Some of this information might be difficult or costly to obtain through
other channels. Student evaluations provide student perspectives on
• Self and peer Assessment: Helps to create a learning community
teacher performance which may be intrinsically valuable for:
within a classroom. Students who can reflect while engaged in
metacognitive thinking are involved in their learning. When students 1. Continually sensitizing or reminding teachers that students are
have been involved in criteria and goal setting, self-evaluation is a their customers.
logical step in the learning process. With peer evaluation, students 2. Encouraging faculty members to devote the time and effort necessary
see each other as resources for understanding and checking for for good teaching.
quality work against previously established criteria. 3. Serving as a diagnostic tool to identify weakness in teacher
• Student record keeping: Helps students better understand their effectiveness; and
own learning as evidenced by their classroom work. This process 4. Aiding in self-improvement.
of students keeping ongoing records of their work not only engages The basic goal for the use of student evaluations of teachers is to
students, it also helps them, beyond a “grade,” to see where they contribute to high quality in teaching. Student evaluations alone will
started and the progress they are making toward the learning goal. not provide sufficient information to judge faculty performance in all
All of these strategies are integral to the formative assessment dimensions of teaching, but student evaluations can provide a triggering
process, and they have been suggested by models of effective middle mechanism for the identification of superior and unsatisfactory teachers.
school instruction. In this regard, student evaluations can play only a part in helping to
make useful distinctions among teachers for:
Balancing Assessment
1. Promotion and tenure decisions
As teachers gather information/data about student learning, several
categories may be included. In order to better understand student learning, 2. Salary increases and
teachers need to consider information about the products (paper or 3. Improvement or removal of unsatisfactory teachers.
otherwise) students create and tests they take, observational notes, and Student evaluations of teachers may be influenced by such factors as:
reflections on the communication that occurs between teacher and student • Course rigor
or among students. When a comprehensive assessment program at the • Class size
22 Measurement and Evaluation in Education Measurement and Evaluation in Education 23
♦ ♦
• Gender composition of the students in the class Since the mid-1960s, the number of alternative approaches to
• The student’s expected grade in the course conducting evaluation efforts has increased dramatically. Factors such
as the United States Elementary and Secondary Education Act of 1965 that
• Course content
required educators to evaluate their efforts and results, and the growing
• Whether the course is a required or an elective course public concern for accountability of human service programs contributed
• Class level to this growth. In addition, over this period of time there has been an
• The instructor’s professorial rank international movement towards encouraging evidence based practice
in all professions and in all sectors. Evidence Based Practice (EBP)
Types of Evaluation requires evaluations to deliver the information needed to determine what
There are two types of school evaluation: the best way of achieving results is.
• Self-evaluation: This is an internal process of school self-reflection, Steps in Evaluation Approach
whereby the school carries out a systematic examination of the
outcomes of its own agreed courses of action. The school may Evaluation on a broad level is helpful in examining the influence
use an external adviser to assist the self evaluation. This person of courses of action on:
may be the school existing facilitator, or a critical friend (i.e. an • Core issues such as mission, vision, and school aims
outside person chosen by the school, or a teacher not involved • Learning and teaching
in the particular issue being self-evaluated). Such a person may • Perceived changes in the climate or environment facing the school
bring objectivity to the exercise. The focus of these guidelines is
• Planning structures e.g. task groups, steering group etc.
on self-evaluation.
Specifically self-evaluation enables the school to:
• External Evaluation: This is an evaluation carried out by an external
• Measure the progress of implementation of courses of action
body (e.g. Dept. of Education & Science, the schools trustees in
relation to issues such as Religious Formation, Finance, and Plant • Examine the impact of these on:
Management). The School Development Plan can be a valuable • The whole school
resource in this context as it can give the school the confidence • The classroom
to participate in such external evaluations.
• The individual student and teacher
Meaning and Definition of Evaluation Approach • Identify areas of success, or areas which require adjustment for
Evaluation approaches are conceptually distinct ways of thinking future success
about, designing, and conducting evaluation efforts. Many of the • Establish ongoing effective planning
evaluation approaches in use today make unique contributions to • Write the Annual Report. Apart from being a requirement of
solving important problems, while others refine existing approaches some trustees groups in voluntary secondary schools, this is now
in some way. a requirement under S.20 of the Education Act 1998.
Classification systems intended to sort out unique approaches
from variations on a theme are presented here to help identify some Preliminary Steps in Self-evaluation
basic schools of thought for conducting an evaluation. After these The engagement of the stakeholders in the school planning process,
approaches are identified, they are summarized in terms of a few where appropriate to an issue, is important. Stakeholders are also known
important attributes. as the school partners. They comprise:
24 Measurement and Evaluation in Education Measurement and Evaluation in Education 25
♦ ♦
• Patrons - Owners, and Trustees • Analysis self-evaluation profile
• Board of Management -Appointed by the patron after nomination • Critical incident analysis
by the owners, parents, & teachers as appropriate Desk Research - use of documentary evidence e.g. Homework
• Staff - Teachers and Support Staff journals, copies, exam results, rolls, etc.
Field Research - surveying school partners as appropriate:
• Parents – Parents’ Association and general parent body
• Questionnaires -closed & open
• Students – Students’ Council and general student body
• Checklists - narrow & sharpen focus
• Local Community - Supporters of and participants in the education
services of the school. • Interviews - structured & unstructured, individual & group
In advance of undertaking self-evaluation successfully the school • Standard Forms - promote consistency of data recording
may address the following through the appropriate partners. Ideally • Logs - diaries, video recordings etc.
this occurs during the design stage of SDP:
• SCOT Analysis - good basis for group discussion
• Philosophy: Set of beliefs among the partners about the intrinsic
• Evaluation Grids - records interaction between variables.
value of selfevaluation
• Procedures: Means of successfully putting philosophy into Further Tools
action Apart from the desk research and field research tools which may
• Criteria: Statements of desired outcomes used as the basis for have already been used during the review stage of the planning process,
measuring success the following field research tools are useful:
• Evidence: Information collected to indicate level of success based 1. Force Field Analysis
on criteria 2. Spot Check
Self-Evaluation Tools 3. Critical Incident Analysis
Quantitative 4. Self-Evaluation Profile
5. Summative Evaluation Tool
• Desk Research
• Closed Questionnaires 1. Force Field Analysis: The user is asked to identify three things
which help and three things which hinder the successful outcome
• Checklists
of a specific issue e.g. Ability to understand the teacher
• Standard Forms
• Logs, Diaries, Recordings etc Use
• Evaluation Grids Critical Incident • It is useful as a means of identifying progress of implementation as
well as providing information on the individual/classroom experience.
Qualitative
Advantages
• SCOT Analysis
• The teacher can administer this tool quite easily in her/his classroom
• Open questionnaires
• It gives a quick view of the issues affecting the student, and can
• Interviews act as a catalyst for more extensive evaluation
• Force field analysis
• It is easily adapted to suit different issues.
26 Measurement and Evaluation in Education Measurement and Evaluation in Education 27
♦ ♦
Disadvantages Use
• The collation and analysis of response may be difficult because • It is useful as a means of testing the ‘on the ground’ reality of
of the open nature of the responses. policy implementation, i.e. how a school handles problems that
2. Spot Check: The user is asked to circle her/his response to a range arise.
of closed questions relevant to the issue e.g. A specific lesson in
Advantages
your subject.
• It can provide information on the quality of relationships in the
Use school
• It yields an immediate response from the students. • It can inform those with responsibility for implementation of the
realities of implementation on the ground.
Advantages
Disadvantages
• It is a useful tool for measuring the match between teacher and
student perception of what is going on in the class • It requires special skills on the part of the teacher
• The task group/teacher has complete flexibility in framing the • It can be time-consuming.
questions to be asked and the language used in the asking 4. Self-Evaluation Profile: The user is asked to circle her/his response
• The template can be adapted to suit any particular set of information to a range of closed questions relevant to the issue e.g. Classroom
that one is seeking. Management
Disadvantages: Use
• Validity of response could be a problem. • It is useful for self-evaluation of an action plan, which can be
3. Critical Incident Analysis: The user discusses a chosen incident broken down into sub-issues
with the individual/group in order to flesh out the consequences • It is useful as a way of identifying issues for in-depth evaluation.
of a specific course of action e.g. Back-answering a teacher. A
particular incident that created conflict in the school is taken. The Advantages
individual/group, with the assistance of a teacher, looks at the
incident in relation to the following questions: • It yields information simultaneously on two aspects of
implementation:
• What happened?
• The effect of the issue now, and
• Who was involved?
• The effect of the issue over time
• What action was taken? • It is capable of being adapted to suit any issue.
• How effective was the action?
Disadvantages
• What was the response to the action taken?
• What other action(s) could have been taken? • It allows the respondent to deal only with pre-determined issues.

• What would have assisted those involved to do things 5. Summative Evaluation Tool: The user draws together quantitative
differently? and qualitative information, which has been collected.
28 Measurement and Evaluation in Education Measurement and Evaluation in Education 29
♦ ♦
Use • Observational Methods
• It is useful for in-depth evaluation of specific issues. • Query Methods.
An evaluation method must be chosen carefully and must be
Advantages suitable for the job.
• It provides the necessary reliability, and validity that other tools Limitations
may not have
While several states are implementing some form of standards-based
• It is a comprehensive way of evaluating any issue. reform, there is very little empirical evidence to prove that standards,
Disadvantages: assessment, and high-stakes accountability programs are effective in
• It is not time friendly improving public schools.
1. Recent reports on the standards-based reform movement suggest
Techniques of Evaluation
that in many schools the careless implementation of standards and
Evaluation assessment may have negative consequences for students.
• Tests usability and functionality of system 2. Vague and unclear standards in several subject areas in several
• Occurs in laboratory, field and/or in collaboration with users states complicate matters and do not serve as concrete standards
defining what students should know and be able to do.
• Evaluates both design and implementation
3. Top-down standards imposed by the federal or state government
• Should be considered at all stages in the design life cycle are also problematic. They impose content specifications without
taking into account the different needs, opportunities to learn, and
Goals of Evaluation skills that may be appropriate for specific districts or regions.
• Assess extent of system functionality
1.4. ITEM FORMATS
• Assess effect of interface on user
Just as there are several types of tests available to help employers
• Identify specific problems
make employment decisions, there are also several types of test formats.
Evaluation tests the usability, functionality and acceptability of
In this section, the pros and cons of general types of test item formats
an interactive system.
are described. Also, some general guidelines for using different types
• Evaluation may take place: of test item formats are provided.
• In the laboratory
1.5. MULTIPLE CHOICES
• In the field.
Multiple choice questions are composed of one question (stem)
• Some approaches are based on expert evaluation:
with multiple possible answers (choices), including the correct answer
• Analytic Methods and several incorrect answers (distractors). Typically, students select
• Review Methods the correct answer by circling the associated number or letter, or filling
• Model-based methods. in the associated circle on the machine-readable response sheet.
• Some approaches involve users: Example: Distractors are:
• Experimental Methods (A) Elements of the exam layout that distract attention from the questions
30 Measurement and Evaluation in Education Measurement and Evaluation in Education 31
♦ ♦
(B) Incorrect but plausible choices used in multiple choice questions True/False
(C) Unnecessary clauses included in the stem of multiple choice True/false questions are only composed of a statement. Students
questions respond to the questions by indicating whether the statement is true
or false. For example: True/false questions have only two possible
Answer: B answers (Answer: True).
Like multiple choice questions, true/false questions:
Students can generally respond to these type of questions quite
quickly. As a result, they are often used to test student’s knowledge • Are most often used to assess familiarity with course content and
of a broad range of content. Creating these questions can be time to check for popular misconceptions
consuming because it is often difficult to generate several plausible • Allow students to respond quickly so exams can use a large number
distractors. However, they can be marked very quickly. of them to test knowledge of a broad range of content
• Are easy and quick to grade but time consuming to create
Tips for Writing good Multiple Choice items True/false questions provide students with a 50% chance of guessing
Avoid Do use the right answer. For this reason, multiple choice questions are often
used instead of true/false questions.
In the stem: In the stem:
Long / complex sentences Your own words – not statements
Trivial statements straight out of the textbook Tips for Writing Good True/False Items
Negatives and double-negatives Single, clearly formulated
Ambiguity or indefinite terms, problems Avoid Do use
absolute statements, and broad In the choices: Negatives and double-negatives Your own words
generalization Plausible and homogeneous Long / complex sentences The same number of true and false
Extraneous material: distractors Trivial material statements (50 / 50) or slightly more false
Item characteristics that provide a Statements based on common Broad generalizations statements than true (60/40) – students
clue to the answer misconceptions student misconceptions Ambiguous or indefinite terms are more likely to answer true
In the choices: True statements that do not One central idea in each item
Statements too close to the correct answer the questions
answer Short options – and all same 1.6. MATCHING
Completely implausible responses length
‘All of the above,’ ‘none of the Correct options evenly Students respond to matching questions by pairing each of a set
above’ distributed over A, B, C, etc. of stems (e.g., definitions) with one of the choices provided on the
Overlapping responses (e.g., if ‘A’ Alternatives that are in logical or exam. These questions are often used to assess recognition and recall
is true) numerical then ‘C’ is also true) and so are most often used in courses where acquisition of detailed
order
knowledge is an important goal. They are generally quick and easy to
At least 3 alternatives
create and mark, but students require more time to respond to these
questions than a similar number of multiple choice or true/false items.
Suggestion: After each lecture during the term, jot down two or
Example: Match each question type with one attribute:
three multiple choice questions based on the material for that lecture.
Regularly taking a few minutes to compose questions, while the material 1. Multiple Choice (a) Only two possible answers
is fresh in your mind, will allow you to develop a question bank that 2. True/False (b) Equal number of stems and choices
you can use to construct tests and exams quickly and easily. 3. Matching (c) Only one correct answer but at least three choices
32 Measurement and Evaluation in Education Measurement and Evaluation in Education 33
♦ ♦
Tips for Writing Good Matching Items There can also be some serious weaknesses in the matching item
format, which could make an entire section of test items invalid. Some
Avoid Do use
things to look out for:
Long stems and options Short responses 10-15 items on only one page
1. Cued answers. A competent test-taker can usually get one or more
Heterogeneous content Clear directions
(e.g., dates mixed with people) Logically ordered choices (chronological, items correct “for free”, by using the process of elimination. A
Implausible responses alphabetical, etc.) group of ten items with ten options often means that a student
needs to know, at most, the answers to nine of the items.
2. Non-homogenous options. Many, many groups of matching items
Guidelines for Using Multiple Choice or True-False Test Items are practically worthless because they mix totally unrelated things
It is generally best to use multiple-choice or true-false items when: together as options. In such cases, a skilled student can use the
• You want to test the breadth of learning because more material process of elimination to dramatically increase his score, and very
can be covered with this format. little valid testing has taken place.
• You want to test different levels of learning. 3. Excessively large groups of items or options. Since each item has
the entire set of options as answer possibilities, a student may
• You have little time for scoring.
become overwhelmed with the amount of choices from which to
• You are not interested in evaluating how well a test taker can select the correct answer.
formulate a correct answer.
• You have a clear idea of which material is important and which The True/False Item Format
material is less important. The true/false (T/F) format is limited in usefulness compared with
• You have a large number of test takers. most other formats, but is still common. A few reasons for its refusal
to fade into oblivion are the relative ease of writing a true/false item
The Matching Item Format and the ease and objectivity of scoring it. There are more problems
As was mentioned earlier, the matching format can be considered a than benefits, however:
type of multiple choice. The matching format is common in curriculum- 1. T/F items tend to focus on trivial facts, rather than significant
based tests. It is sometimes used to good advantage and sometimes concepts. As a result, they tend to be either too easy or unreasonably
very poorly done. Some of the strengths of the matching format are: difficult.
1. It is easy to construct. Since options are used for more than one 2. T/F items are much more likely to be ambiguous or “tricky” to
item, not nearly as much effort needs to be put into constructing answer. Often the answer turns on a single word. A student may
each individual item. need to analyze multiple words in the item to catch the one that
2. It is compact in size. An individual item usually takes only a is incorrect.
fraction of the space occupied by one conventional MC item. 3. T/F items are too rewarding for guessers, since a random answer
3. It is usually time efficient for the test taker. He only needs to analyze has a 50% chance of being correct. On a curriculum-based test,
one set of options for multiple items, provided the matching group where a passing score typically is 75% - 80%, a chance of 50%
is competently designed. may not be enough to boost the overall test grade. On a norm-
4. It is very useful for working with groups of homogenous items, referenced achievement test, guessing with a chance of 50% may
for example, matching states with their capitals. significantly affect the overall score.
34 Measurement and Evaluation in Education Measurement and Evaluation in Education 35
♦ ♦
Suggestions for True/False Items Directions: Assuming that the information below is true, it is
possible to establish other facts using the ones in this paragraph as a
1. Avoid vague, indefinite, or broad terms in favor of precise statements.
basis for reasoning. This is called drawing inferences.
Good test items must be unambiguous, and T/F items even more so.
Write the proper symbol in the space provided. Use only the
2. If the correctness of a statement hinges on a particular word or information given in the paragraph as a basis for your responses…
phrase, highlight or emphasize that word or phrase. T – if the statement may be inferred as TRUE
3. Avoid negative statements if at all possible. Negative statements F – if the statement may be inferred as UNTRUE
are harder to decode, particularly those with two negatives. N –if no inference can be drawn about it from the paragraph
4. Include similar numbers of true and false items and make them
Interpretive Exercise
similar in length.
5. Group T/F items under a common statement, story, illustration, • Usually begins with verbal, tabular or graphic information which
graph, or other material. This reduces the amount of ambiguity is the basis for 1 or more multiple choice questions.
possible, since the items come from a specific frame of reference. • map, passage from a story, a poem, a cartoon
6. Avoid generalizations such as all, always, never, or none, since • Can challenge students at various levels of understanding
they usually trigger a false statement. Also avoid qualifiers like
• application, analysis, synthesis, evaluation
sometimes, generally, often, and can be, since they are often
indicators of a true statement. • Exercise contains all information needed to answer questions
• Readily adaptive to the more important outcomes of disciplines.
Interpretative Exercises
An interpretive question exercise consists of a series of objective 1.7. GUIDELINES FOR ITEM PREPARATION
items based on a common set of data. The data may in the form of (i) The items should be worded very carefully. A minor change in
written materials, tables, charts, graphs, maps, or pictures. The series of the wording is likely to create major differences in the meaning
related test items may also take various forms but are most commonly as illustrated below:
multiple-choice or true-false items. Because all students are presented (a) Do you approve of seggregation of children? Yes/ No
with a common set of data, it is possible to measure a variety of complex
(b) You do approve of seggregation of children, don’t you? Yes/No
learning outcomes. The students can be asked to identify relationships
© Don’t you approve the seggregation of children? Yes/No
in data, to recognize valid conclusions, to appraise assumptions and
inferences, to detect proper applications of data, and the like. The following (d) You don’t approve of seggregation of children, do you? Yes/ No
are examples that are presented in a variety of school subjects at the Obviously the meaning of the above questions differs from item to
elementary and secondary levels. item. The item numbers c and d are highly suggestive. If accompanied
by a sincere, earnest nod of the head, item. b would cause many
Example 1 respondents to agree, and of items a and d were accompanied by
Ability to Recognize Inferences an unbelieving look, many individuals who approve of segregation
would deny it.
In interpreting written material, it is frequently necessary to draw
inferences from the facts given. The following exercise measures the (ii) Item should be worded very clearly so that the expected responses
extent to which students are able to recognize warranted and unwarranted would be specific in nature.
inferences drawn from a passage:
36 Measurement and Evaluation in Education

(iii) Words of the item should be used in the usual meaning. Those
which are likely to be interpreted in more than one way, or
misinterpreted, should be avoided. If necessary ambiguous
words should be defined and qualifying terms should be given.
(iv) Items should avoid descriptive words (adjective and adverbsa0 Chapter-2
such as frequently, occasionally and rarely, because such words
have no universal meaning RESEARCH INSTRUMENTS
(v) Avoid the use of double negatives. Statements such as –‘ I do
not think that pupils will not do home works’. ‘The educated
persons do not feel coeducation is not good’, and so on, mislead
respondents. 2.1. KINDS OF INSTRUMENTS
(vi) Items of double barreled nature should be avoided. On the basis of the merits and limitations of the interview techniques
(vii) Items requiring comparison or rating should give the point of it is used in many ways for research and non-research purposes. This
reference. technique was used in common wealth teacher training study to know
(viii) Clear and complete directions should be provided for all the traits must essentials for success in teaching. Apart from being an
individual items and also for groups of items having certain independent data collection tool, it may play an important role in the
common characteristics. In preparing the direction one should preparation of questionnaires and check lists which are to be put to
observe the golden mean between extreme incompleteness and extensive use.
detail on the one hand and extreme incompleteness and vagueness
2.1.1. Questionnaire
on the other hand.
Questionnaire is a self report data collection instrument that each
research participant fills out as part of a research study. Researchers use
questionnaire to obtain information about the thoughts, feelings, attitudes
beliefs, values, perceptions, personality and behavioral intentions of
research participants.
According to John W. Best (1992) a questionnaire is used when
factual information is desired, when opinion rather than facts are desired,
an opinionnaire or Attitude scale is used.

Forms/Kinds of Questionnaire
The researcher can construct questions in the form of a closed,
open pictorial and scale items.
1. Close form Questionnaire that calls for short check responses as
the, restricted or close form type. They provide for marking a Yes
or No a short response or checking an item from a list of suggested
responses.
37
38 Measurement and Evaluation in Education Research Instruments 39
♦ ♦
Advantages of the close form data from children and adults who had not developed reading ability.
Pictures often capture the attention of respondents more readily than
1. It is easy to fill out.
printed words, lessen subjects’ resistance and stimulate the interest in
2. It takes little time by respondents questions. “To get better answers, ask better questions”
3. It is relatively objective
4. Easy to tabulate and analyze Characteristics of A Good Questionnaire:
5. Answers are standardized. • Questionnaire should deal with important or significant topic to
create interest among respondents.
Limitations of the close form • It should seek only that data which can not be obtained from other
It fails to reveal the respondents’ motives and does not always get sources.
information of sufficient scope and in depth and may not discriminate • It should be as short as possible but should be comprehensive.
between the finer shades of meaning.
• It should be attractive.
The open form • Directions should be clear and complete.
The open form or unstructured type of questionnaire calls for a • It should be represented in good Psychological order proceeding
free response in respondents own words. from general to more specific responses.
• Double negatives in questions should be avoided.
Advantages of the Open Form Questionnaire
• Putting two questions in one question also should be avoided.
1. Open end questions are flexible.
• It should avoid annoying or embarrassing questions.
2. They can be used when all possible answer categories are not
• It should be designed to collect information which can be used
known.
subsequently as data for analysis.
3. They are preferable to complex issues that cannot be condensed.
• It should consist of a written list of questions.
4. They allow more opportunity for creativity, thinking and self
• The questionnaire should also be used appropriately.
expression.
Designs of Questionnaire
Limitation
After construction of questions on the basis of it’s characteristics
1. There is possibility of collection of worthless or irrelevant it should be designed with some essential routines like:
information.
• Background information about the questionnaire.
2. Data collected through open end questionnaire are not often
• Instructions to the respondent.
standardized from person to person.
• The allocation of serial numbers and
3. Coding is difficult and subjective.
• Coding Boxes.
Pictorial Form
Background Information about
Some questionnaires present respondents with drawings and
photographs rather than written statement from which to choose answers. The Questionnaire Both from ethical and practical point of view,
This form of questionnaire is particularly suitable tool for collecting the researcher needs to provide sufficient background information about
40 Measurement and Evaluation in Education Research Instruments 41
♦ ♦
the research and the questionnaire. Each questionnaires should have a • It is easier to arrange.
cover page, on which some information appears about: • It supplies standardized answers
• The sponsor • It encourages pre-coded answers.
• The purpose • It permits wide coverage.
• Return address and date • It helps in conducting depth study.
• Confidentiality
Disadvantages
• Voluntary responses and
• Thanks • It is reliable and valid, but slow.
• Pre-coding questions can deter them from answering.
Instructions to the Respondent
• Pre-coded questions can bias the findings towards the researcher.
It is very important that respondents are instructed to go presented
• Postal questionnaire offer little opportunities to check the truthfulness
at the start of the questionnaire which indicates what is expected from
of the answers.
the respondents. Specific instructions should be given for each question
where the style of questions varies through out the questionnaire. For • It can not be used with illiterate and small children.
Example – Put a tick mark in the appropriate box and circle the relevant
2.1.2. Opinionnaire
number etc.
“Opinion polling or opinion gauging represents a single question
The Allocation of Serial Numbers approach. The answers are usually in the form of ‘yes’ or ‘no’. An
Whether dealing with small or large numbers, a good researcher needs undecided category is often included. Sometimes large number of
to keep good records. Each questionnaire therefore should be numbered. response alternatives if provided.” - Anna Anastusi
The terms opinion and attitude are not synonymous, through
Coding Boxes sometimes we used it synonymously. We have till now discussed that
When designing the questionnaire, it is necessary to prevent later attitudes scale. We have also discussed that attitudes are impressed
complications which might arise at the coding stage. Therefore, you opinions. You can now understand the difference between opinionnaire
should note the following points: and attitude scale, when we discuss of out opinionnaire, it is characteristics
and purposes. Opinion is what a person says on certain aspects of the
• Locate coding boxes neatly on the right hand side of the page.
issue under considerations. It is an outward expression of an attitude
• Allow one coding box for each answer. held by an individual. Attitudes of an individual can be inferred or
• Identify each column in the complete data file underneath the estimated from his statements of opinions. An opinionnaire is defined
appropriate coding box in the questionnaire. as a special form of inquiry. It is used by the researcher to collect
Besides these, the researcher should also be very careful about the the opinions of a sample of population on certain facts or factors the
length and appearance of the questionnaire, wording of the questions, problem under investigation. These opinions on different facts of the
order and types of questions while constructing a questionnaire problem under study are further quantified, analysed and interpreted.
Advantages of Questionnaire Purpose
Questionnaire is economical. In terms of materials, money and Opinionnaire are usually used in researches of the descriptive type
time it can supply a considerable amount of research data. which demands survey of opinions of the concerned individuals. Public
42 Measurement and Evaluation in Education Research Instruments 43
♦ ♦
opinion research is an example of opinion survey. Opinion polling enables • Used by some investors as a critical part of their investment process
the researcher to forecast the coming happenings in successful manner. • Can aid in mitigating claims of negligence in public liability claims
by providing evidence of a risk management system being in place.
Characteristics
• an ornithological checklist, a list of birds with standardized names
• The opinionnaire makes use of statements or questions on different that helps ornithologists communicate with the public without the
aspects of the problem under investigation. use of scientific names in Latin.
• Responses are expected either on three point or five point scales. • A popular tool for tracking sports card collections. Randomly
• It uses favourable or unfavourable statements. inserted in packs, checklist cards provide information on the contents
• It may be sub-divided into sections. of sports card set.
• The gally poll ballots generally make use of questions instead of Format
statements.
Checklists are often presented as lists with small checkboxes down
• The public opinion polls generally rely on personal contacts rather the left hand side of the page. A small tick or checkmarks drawn in
than mail ballots. the box after the item has been completed. Other formats are also
sometimes used. Aviation checklists generally consist of a system and
2.1.3. Check List
an action divided by a dashed line, and lacks a checkbox as they are
A checklist is a type of informational job aid used to reduce failure often read aloud and are usually intended to be reused.
by compensating for potential limits of human memory and attention.
It helps to ensure consistency and completeness in carrying out a task. Check List in Education
A basic example is the “to do list.” A more advanced checklist would A simple checklist of what schools can do to instill good behaviour
be a schedule, which lays out tasks to be done according to time of in the classroom has been developed and published today by Charlie
day or other factors. Taylor - the head teacher of a special school with some of the toughest
behaviour issues and the government’s expert adviser on behaviour.
Applications
The behaviour checklist - entitled ‘Getting the simple things right’ -
• Pre-flight checklists aid in aviation safety to ensure that critical follows Charlie Taylor’s recent behaviour summit, where outstanding
items are not forgotten head teachers from schools in areas of high deprivation gathered to
• Use in medical practice to ensure that clinical practice guidelines discuss the key principles for improving behaviour. What soon became
are followed. An example is the Surgical Safety Checklist developed clear was how much similarity there was between the approaches that
for the World Health Organization by Dr. Atul Gawande. Evidence the head teachers followed. Many of them emphasised the simplicity
to support surgical checklists is tentative but limited. of their approach but they agreed that most important of all was
• Used in quality assurance of software engineering, to check process consistency.
compliance, code standardization and error prevention, and others.
Actions from the checklist include
• Often used in industry in operations procedures.
• used in civil litigation to deal with the complexity of discovery • Ensuring absolute clarity about the expected standard of pupils’
and motions practice. An example is the open-source litigation behaviour
checklist. • Displaying school rules clearly in classes and around the building.
Staff and pupils should know what they are
44 Measurement and Evaluation in Education Research Instruments 45
♦ ♦
• Ensuring that children actually receive rewards every time they social service, clerical and many other areas of interest have been
have earned them and receive a sanction every time they behave analysed informs of activities. In terms of specific activities, a
badly person’s likes and dislikes are sorted into various interest areas
• Taking action to deal with poor teaching or staff who fail to follow and percentile scores calculated for each area. The area where a
the behaviour policy person’s percentile scores are relatively higher is considered to
be the area of his greatest interests, the area in which he would
• Ensuring pupils come in from the playground and move around
be the happiest and the most successful. As a part of educational
the school in an orderly manner
surveys of many kinds, children’s interest in reading, in games,
• Ensuring that the senior leadership team like the head and assistant in dramatics, in other extracurricular activities and in curricular
head are a visible presence around the school during the day, work etc. is studied. One kind of instrument, most commonly used
including in the lunch hall and playground, and are not confined in interest measurement is known as Strong’s Vocational Interest
to offices Inventory. It compares the subject’s pattern of interest to the interest
patterns of successful individuals in a number of vocational fields.
2.1.4. Inventory
This inventory consists of the 400 different items. The subject has
• Inventory is a list, record or catalog containing list of traits, to tick mark one of the alternatives i. e. L(for like), I(indifference)
preferences, attitudes, interests or abilities used to evaluate personal or D(Dislike) provided against each item. When the inventory is
characteristics or skills. The purpose of inventory is to make a list standardized, the scoring keys and percentile norms are prepared
about a specific trait, activity or programme and to check to what on the basis of the responses of a fairly large number of successful
extent the presence of that ability types of Inventories like individuals of a particular vocation. A separate scoring key is
• Internet Inventory and therefore prepared for each separate vocation or subject area. The
• Personality Inventory subject’s responses are scored with the scoring key of a particular
vocation in order to know his interest or lack of interest or lack of
• Interest Inventory interest in the vocation concerned. Similarly his responses can be
• Persons differ in their interests, likes and dislikes. Internets are scored with scoring keys standardized for other vocational areas.
significant element in the personality pattern of individuals and play In this way you can determine one’s areas of vocational interest.
an important role in their educational and professional careers. The Another well known interest inventories, there are also personality
tools used for describing and measuring interests of individuals inventories to measure the personality.
are the internet inventories or interest blanks. They are self report
instruments in which the individuals note their own likes and 2.1.5. TEST
dislikes. They are of the nature of standardized interviews in which Test is a systematic procedure for observing persons and describing
the subject gives an introspective report of his feelings about certain them with either a numerical scale or a category system. Thus test may
situations and phenomena which is then interpreted in terms of give either qualitative or quantitative information. Test commonly refers
internets. to a set of items or questions under specific conditions.
• The use of interest inventories is most frequent in the areas of
Types of Test
educational and vocational guidance and case studies. Distinctive
patterns of interest that go with success have been discovered • Essay type
through research in a number of educational and vocational fields. • Objective type
Mechanical, computational, scientific, artifice, literary, musical,
46 Measurement and Evaluation in Education Research Instruments 47
♦ ♦
Essay Type • Require less time for typing, duplicating or printing, can be written
It is an item format that requires the student to structure a rather on board
long written response up to several paragraphs • Can be used as device for measuring and improving language and
expression skills
Characteristics of Essay Test
Limitations
• Generally essay tests contain more than one question in the test
Lack of consistency in judgments even among competent
• Essay tests are to be answered in writing only
examiners
• Essay test tests require completely long answers
• They have holo effect
• Essay tests are attempted on the basis of recalling the memory
• Question to question carry effect
Types of Essay Test • Examinee to examinee carry effect
Selective recall (basis given) • Language mechanic effect
• Evaluation recall (basis given) • Limited content validity
• Comparison of two things on a single designated basis • Some examiners are too strict and some are too lenient
• Comparison of two things in general • Difficult to score objectively
• Decisions (for and against) • Time consuming
• Explanation of the use exact meaning of some word, phrase or • Lengthy enumeration of memorized facts
statement
• Summary of some unit of the text or of some article Suggestions for Construction of Essay Tests
• Analysis • Ask questions that require the examinee to show command of
• Illustrations or examples essential knowledge
• Application of rules, laws, or principles to new situations • Make questions as explicit as possible
• Discussions
• Should be no choice in questioning question paper
• Criticism
• Test constructor should prepare ideal answers to all questions
• Inferential thinking
• Intimate the examinee about desired length of the answers
Advantages • Make each question relatively short but increase number of
• Can measure complex learning outcomes questions
• Emphasize integration and application of thinking and problem • Test constructor should get his test reviewed y one ao more
solving colleagues
• Can be easily constructed • Questions should be so worded that all examinees interpret them
• Examinee free to respond in the same way as the examiner wants. Short answer items require
the examinee to respond to the item with a word, short phrase,
• No guessing as in objective item number or a symbol.
48 Measurement and Evaluation in Education Research Instruments 49
♦ ♦
Characteristics Forms of Objective Type Tests
• The test has supply response rather than select or identify (a) Two choice items
• In the form of question or incomplete statement
1. true/false items
• The test can be answered by a word, a phrase, a number or symbol
2. Completion type (if two choices are given against each blank)
Forms of Short Answer Items (b) More than two choice items

• Question form
1. Matching items
• Identification or association form
2. MCQs.
• Completion form True/False Tests (Shooting Questions)
Advantages A true false item consists of a statement or proposition which the
• Very easy to construct examinee must judge and mark as either true or false.
• Low probability of guessing the answer because it has to be supplied Advantages
by the examinees rather than select identify from the given answers
• They are good to test the lowest level of cognitive taxonomy • It takes less time to construct true false items
(knowledge, terminology, facts) • High degree of objectivity

Limitations • Teacher can examine students on more material

• They are unsuitable for measuring complex learning outcomes Limitations


Suggestions for Construction of Short Answer Tests • High degree of guessing
• As for as possible question form should be used • Largely limited to learning outcomes in the knowledge area
• The question should not be picked up exactly from the book • They expose students to error which is psychologically undesirable
• The question should not provide any clue • They may encourage students to study and accept only oversimplified
statements of truth and factual learning.
• The scoring key should be prepared
• The blank space is to be completed by an important word rather Suggestions
than trivial words
• Balance between true and false items
Objective Type Tests • Each statement should be unequivocally true or false. it should
• Any test having clear and unambiguous scoring criteria (gilbert sax) not be partly true or partly false

• Test that can be objectively scored • Double negatives should be avoided


• Long and complex statements should not be used as they measure
Characteristics reading comprehension
• They can be reliably scored • Only one idea should be measured in one statement
• They allow for adequate content sampling
50 Measurement and Evaluation in Education Research Instruments 51
♦ ♦
• explain which judgment is to be used true/false, yes/no, correct/ • Incomplete sentences should not be used for premise
incorrect
Multiple Choice Items
• Clues should be avoided
Multiple choice items consist of two parts: a stem and number
• Statements should not be taken directly from the textbook
of options or alternatives. the stem is a question or statement that is
Matching Type Tests answered or completed by one of the alternatives. all incorrect or less
appropriate alternatives are called distracters or foils and the student’s
A test consisting of a two column format, premises and responses
task is to select the correct or best alternative from all the options.
that requires the student to take a correspondence between the two
Forms of Mcqs
Advantages
1. The correct answer form
• Simple to construct and score
2. The best answer form
• Well suited to measure association
3. The incomplete statement form
• Reduce the effect of guessing
4. The negative form
• They can be used to evaluate examinee’s understanding of concepts,
principle, schemes for classifying objects, ideas or events. 5. The combined response form
6. Substitution form
Limitations
Advantages
• They generally provide clues
• They are restricted to factual information which encourages • They can measure complex level knowledge i.e. understanding,
memorization judgment, ability to solve problems
• If the same number of items are written in both the columns, the • A substantial amount of course content can be tested because the
matching type is converted to MCQs at late stage and in the end examinees do not require much time for writing the answer
it is converted to true and false category. • Objectivity in scoring even a layman can score
• They can check discrimination ability of students
Suggestions
• Reduce the effect of guessing
• Homogeneous items should be selected
• Can be easily adapted for machine scoring
• No clue should be provided in both the columns
• This format is helpful in item analysis
• Clear instruction to attempt
• All the items should be printed on the same page Suggestions
• Premise should be written in the left hand columned be numbered, • Stem should introduce what is expected of the examinee
responses should be written in the right hand column and be lettered. • Specific determiners should be avoided
• Responses should be more than the premises to ensure that examinee • Vocabulary according to the level of students
has to think even upto last premise
• All the choices should be plausible
• Clear directions
52 Measurement and Evaluation in Education Research Instruments 53
♦ ♦
• Test items should have defensible correct or best answer 2.1.7. Attitude Scale
• The correct choice should not be at the same place in all or most Attitude scale is a form of appraisal procedure and it is also one
of the items of the enquiry term. Attitude scales have been designed to measure
• The choice like “none of the above” “all the above” should be attitude of a subject of group of subjects towards issues, institutions
avoided and group of peoples.
“An attitude may be defined as a learned emotional response set
• Each item should pose only one problem
for or against something.” - Barr David Johnson.
• Teacher should construct MCQs on daily basis An attitude is spoken of as a tendency of an individual to read
in a certain way towards a Phenomenon. It is what a person feels or
2.1.6. Schedule
believes in. It is the inner feeling of an individual. It may be positive,
• A schedule is generally filled by the research worker or enumerator, negative or neutral. Opinion and attitude are used sometimes in a
who can interpret the questions when necessary. synonymous manner but there is a difference between two. You will
• Data collection is more expensive as money is spent on enumerators be able to know when we will discuss about opinionnaire. An opinion
and in imparting trainings to them. Money is also spent in preparing may not lead to any kind of activity in a particular direction. But an
schedules. attitude compels one to act either favourably or unfavorably according
• Non response is very low because this is filled by enumerators to what they perceive to be correct. We can evaluate attitude through
who are able to get answers to all questions. But even in this their questionnaire. But it is ill adapted for scaling accurately the intensity
remains the danger of interviewer bias and cheating. of an attitude. Therefore, Attitude scale is essential as it attempts to
minimise the difficulty of opinionnaire and questionnaire by defining
• Identity of respondent is not known.
the attitude in terms of a single attitude object. All items, therefore,
• Information is collected well in time as they are filled by may be constructed with graduations of favour or disfavor.
enumerators.
• Direct personal contact is established. Purpose of Attitude Scale
• The information can be gathered even when the respondents happen In educational research, these scales are used especially for finding
to be illiterate. the attitudes of persons on different issues like:
• There remains the difficulty in sending enumerators over a relatively • Co-education
wider area. • Religious education
• The information collected is generally complete and accurate as • Corporal punishment
enumerators can remove difficulties if any faced by respondents in • Democracy in schools
correctly understanding the questions. As a result the information
• Linguistic prejudices
collected through schedule is relatively more accurate than that
obtained through questionnaires. • International co-operation etc.
• It depends upon the honesty and competence of enumerators Characteristics of Attitude Scale
• This may not be the case as schedules are to be filled in by Attitude scale should have the following characteristics.
enumerators and not by respondents.
• It provides for quantitative measure on a one-dimensional scale
• Along with schedule observation method can also be used. of continuum.
54 Measurement and Evaluation in Education Research Instruments 55
♦ ♦
• It uses statements from the extreme positive to extreme negative The Likert Scale
position. The Likert scale uses items worded for or against the proposition,
• It generally uses a five point scale as we have discussed in rating with five point rating response indicating the strength of the respondent’s
scale. approval or disapproval of the statement. This method removes the
• It could be standardized and norms are worked out. necessity of submitting items to the judges for working out scaled
values for each item. It yields scores very similar to those obtained from
• It disguises the attitude object rather than directly asking about
the Thurstone scale. It is an important over the Thurstone method. The
the attitude on the subject.
first step is the collection of a member of statements about the subject
Examples of Some Attitude Scale in question. Statements may or may not be correct but they must be
representative of opinion held by a substantial number of people. They
Two popular and useful methods of measuring attitudes indirectly,
must express definite favourableness or unfavourableness to a particular
commonly used for research purposes are:
point of view. The number of favourable and unfavourable statements
• Thurstone Techniques of scaled values. should be approximately equal. A trial test maybe administered to
• Likert’s method of summated ratings. a number of subjects. Only those items that correlate with the total
test should be retained. The Likerts calling techniques assigns a scale
Thurstone Technique value to each of the five responses. All favourable statements are
Thurstone Technique is used when attitude is accepted as a uni- scored from maximum to minimum i. e. from a score of 5 to a score
dimensional linear Continuum. The procedure is simple. A large number of one or 5 for strongly agree and so on 1 for strongly disagree. The
of statements of various shades of favorable and unfavourable opinion negative statement or statement apposing the proposition would be
on slips of paper, which a large number of judges exercising complete scored in the opposite order. e. from a score of 1 to a score of 5 or 1
detachment sort out into eleven plies ranging from the most hostile for strongly agree and so on 5 for strongly disagree.
statements to the most favourable ones. The opinions are carefully The total of these scores on all the items measures a respondent’s
worded so as to be clear and unequivocal. The judges are asked not favourableness towards the subject in question. It a scale consists of
express tier opinion but to sort them at their face value. The items which 30 items, Say, the following score values will be of interest.
bring out a marked disagreement between the judges un assigning a 30 x5 =150 most favorable response possible
position are discarded. Tabulations are made which indicate the number 30 x 3 = 90 A neutral attitude
of judges who placed each item in each category. The next step consists 30 x 1= 30 most unfavorable attitude
of calculating cumulated proportions for each item and ogives are
It is thus known as a method of summated ratings. The summed up
constructed. Scale values of each item are read from the ogives, the
score of any individual would fall between 30 and 150. scores above
values of each item being that point along the baseline in terms of
50 will indicate a favourable and scores below go an unfavourable
scale value units above and below which 50% of the judges placed
attitude.
the item. It we’ll be the median of the frequency distribution in which
Limitations Of Attitude Scale :
the score ranges from 0 to 11. The respondent is to give his reaction
In the attitude scale the following limitations may occur:
to each statement by endorsing or rejecting it. The median values of
the statements that he checks establishes his score, or quantifies his • An individual may express socially acceptable opinion conceal
opinion. He wins a score as an average of the sum of the values of his real attitude.
the statements he endorse. Thurstone technique is also known as the • An individual may not be a good judge of himself and may not
technique equal appearing intervals. be clearly aware of his real attitude.
56 Measurement and Evaluation in Education Research Instruments 57
♦ ♦
• He may not have been controlled with a real situation to discover students information for setting goals and improving performance. In
what his real attitude towards a specific phenomenon was. a rating scale, the descriptive word is more important than the related
• There is no basis for believing that the five positions indicated number. The more precise and descriptive the words for each scale
in the Likert’s scale are equally spaced. • It is unlikely that the point, the more reliable the tool.
statements are of equal value in ‘forness’ or “againstness”. Effective rating scales use descriptors with clearly understood
measures, such as frequency. Scales that rely on subjective descriptors
• It is doubtful whether equal scores obtained by several individuals
of quality, such as fair, good or excellent, are less effective because the
would indicate equal favourableness towards again position.
single adjective does not contain enough information on what criteria
• It is unlikely that a respondent can validity react to a short are indicated at each of these points on the scale.
statement on a printed form in the absence of real like qualifying All rating scales can be classified into one of three classifications:
Situation.
1. Some data are measured at the ordinal level. Numbers indicate
• In sprite of anonymity of response, Individuals tend to respond the relative position of items, but not the magnitude of difference.
according to what they should feel rather than what they really Attitude and opinion scales are usually ordinal; one example is a
feel. Likert response scale:
However, until more precise measures are developed, attitude scale Statement: e.g. “I could not live without my computer”.
remains the best device for the purpose of measuring attitudes and Response options:
beliefs in social research.
1. Strongly disagree
2.1.8. Meaning and Definition of Rating Scale 2. Disagree
A rating scale is a set of categories designed to elicit information 3. Agree
about a quantitative or a qualitative attribute. In the social sciences,
4. Strongly agree
particularly psychology, common examples are the Likert response
scale and 1-10 rating scales in which a person selects the number which 2. Some data are measured at the interval level. Numbers indicate the
is considered to reflect the perceived quality of a product. magnitude of difference between items, but there is no absolute
A rating scale is a method that requires the rater to assign a value, zero point. A good example is a Fahrenheit/Celsius temperature
sometimes numeric, to the rated object, as a measure of some rated attribute. scale where the differences between numbers matter, but placement
of zero does not.
Types of rating scales 3. Some data are measured at the ratio level. Numbers indicate
Rating Scales: Allow teachers to indicate the degree or frequency magnitude of difference and there is a fixed zero point. Ratios
of the behaviours, skills and strategies displayed by the learner. To can be calculated. Examples include age, income, price, costs,
continue the light switch analogy, a rating scale is like a dimmer switch sales revenue, sales volume and market share.
that provides for a range of performance levels. Rating scales state the More than one rating scale question is required to measure an
criteria and provide three or four response selections to describe the attitude or perception due to the requirement for statistical comparisons
quality or frequency of student work. between the categories in the polytomous Rasch model for ordered
Teachers can use rating scales to record observations and students categories. In terms of Classical test theory, more than one question is
can use them as self-assessment tools. Teaching students to use required to obtain an index of internal reliability such as Cronbach’s
descriptive words, such as always, usually, sometimes and never alpha, which is a basic criterion for assessing the effectiveness of a
helps them pinpoint specific strengths and needs. Rating scales also give rating scale and, more generally, a psychometric instrument.
58 Measurement and Evaluation in Education Research Instruments 59
♦ ♦
1.
scales
Graphic Rating Scale
A semantic scale is a combination of more than one continuum.
It usually contains an odd number of radio buttons with labels at
1. Graphic Rating Scale
opposite ends. Max Diff scales are often used in trade-off analysis
such as conjoint.
MaxDiff analysis can be used in new product features research
or or even market segmentation research to get accurate orderings
of the most important product features. Discriminate among feature
strengths more effectively than derived importance methodologies.
Like other trade-off analyses, the analysis derives utilities for each of the most important product
Like other trade-off analyses, the analysis derives utilities for each
features which can be used to derive optimal products, using market segmentation to put
of the most important product features which can be used to derive
respondents into groups with similar preference structures, or to prioritize strategic product
Arating
A graphic graphic rating
scale, also knownscale, also known
as a continuous asusually
rating scale a continuous
looks like the rating
figure scale optimal products, using market segmentation to put respondents into
goals.
usually
drawn above.looks likeof the
The ends the figure
continuumdrawn above.labeled
are sometimes The ends of thevalues.
with opposite continuum groups with similar preference structures, or to prioritize strategic
are sometimes labeled with opposite values. Respondents are required
Respondents are required to make a mark at any point on the scale that they find appropriate. product goals.
You can have your respondents perform Forced-choice nature of the tasks, and disentangle the
Sometimes, there are numbers along the markings of the line too. At other times, there are no
to make a mark at any point on the scale that they find appropriate. You can have your respondents perform Forced-choice nature
relative feature importance in cases where average Likert-style ratings might all have very
markings at all on the line.
Sometimes, there are numbers along the markings of the line too. At of the tasks, and disentangle the relative feature importance in
similar ratings.
cases where average Likert-style ratings might all have very similar
2.other
Likerttimes,
Scale there are no markings at all on the line.
ratings.
4. Side-by-Side Matrix
2. Likert Scale
4. Side-by-Side Matrix

A Likert scale typically contains an odd number of options, usually 5 to 7. One end is labeled
A Likert scale typically contains an odd number of options, usually
as the most positive end while the other one is labeled as the most positive one with the label of
A5 to 7. One
Likert scaleend is labeled
typically contains as thenumber
an odd most positive
of options, end while
usually 5 to 7.the
Oneother
end is one
labeled
‗neutral‘ in the middle of the scale.
asisthelabeled as the most positive one with the label of ‘neutral’ in the of
most positive end while the other one is labeled as the most positive one with the label

middle
‗neutral‘ of middle
in the
The phrases the scale.
‗purely
ofnegative‘
the scale.
and ‗mostly negative‘ could also have been ‗extremely disagree‘
The phrases
and ‗slightly disagree‘. ‘purely negative’ and ‘mostly negative’ could also
The
have been ‘extremely disagree’negative‘
phrases ‗purely negative‘ and ‗mostly could also
and ‘slightly have been ‗extremely disagree‘
disagree’.
and ‗slightly disagree‘.
3. Semantic Differential Scale (Max Diff) Another very commonly used scale in questionnaires is the side-
3. Semantic Differential Scale (Max Diff) by-side matrix.
Another very A common
commonly usedand powerful
scale application
in questionnaires is the of the side-by-side
side-by-side matrix. A
3. Semantic Differential Scale (Max Diff) matrixandispowerful
common the importance/satisfaction
application of the side-by-side type ofisquestion.
matrix the importance/satisfaction type
First, ask the respondent how important an attribute is, then
of question.
ask them how satisfied they are with your performance in this
area.First,
QuestionPro’s
ask the respondentlogic and loop
how important functions
an attribute also
is, then allow
ask them howyou to run
satisfied they
arethrough
with your this question
performance in thismultiple times with
area. QuestionPro‘s other
logic and loopalternatives thatyou
functions also allow theto
runrespondent might multiple
through this question consider. timesThis
with yields benchmark
other alternatives that thedata that will
respondent might
allow This
consider. youyields
to compare
benchmark datayour thatperformance
will allow you to against
compare your other competing
performance against
alternatives.
other competing alternatives.

A semantic scale is a combination of more than one continuum. It usually contains an odd
number of radio buttons with labels at opposite ends. Max Diff scales are often used in trade-off
analysis such as conjoint.
60 Measurement and Evaluation in Education Research Instruments 61
♦ ♦
Basic Rules
• Include basic information
• Student’s name
• Rater’s name
• Rater’s position
• Setting in which student was observed
• Rating period (From ___ to ___)
• Date scale was completed
• Other information important to you
• Decide on odd or even number of responses
• Decide whether or not to group items with same content together
HereHereisis an
anexample
example
of dataof data
from from an importance/satisfaction
an importance/satisfaction question. The importance • Allow space for comments after each item
question. The
rating is the lineimportance ratingratings
and the performance is theareline
the and
bars. the
Withperformance ratings
this type of data, you can • Allow space for comments at end of scale
areactually
the bars.
see whereWithyourthis typeneeds
company of todata, you
increase its can
effortsactually see meet
to more closely where yourof
the needs
• Write specific directions, including the purpose of scale and how
company needs to increase its efforts to more closely meet the needs
the customer.
to complete
of the customer.
While
Whilethere
there are many
are many online
online surveysurvey
tools andtools
onlineand online
survey survey
software software
to choose from, will • Put labels at tope of response choices (on every page)
find that not all of them have these different types of scales available to them.
to choose from, will find that not all of them have these different types
Principles for Preparation
of scales available to them.
CONSTRUCTION OF RATING SCALE
• Use action oriented precise verbs.
Construction of RatingRATING
STEPS IN CONSTRUCTING Scale SCALE
• Each item should deal with important content area
Steps1. inDecide
Constructing
what areas wantRating
to measureScale
• Question can be as long as possible, but answer should be short.
1. Decide what
2. For each areaswhat
area decide want to measure
characteristics want to measure
• Use precise, simple and accurate language relation to the subject
2. For eacha range
3. Define area for
decide what characteristics want to measure
each characteristic matter area.
3. Define a range for each characteristic
 Decide how many points on the scale • Provide the necessary space for answers below each question asked.
 State extremes- very good and very bad
• Decide how many points on the scale
Scale Construction Techniques
• State extremes- very good and very bad
• Arbitrary approach- scales on ad hoc basis.
• State points between these extremes
• Consensus approach- panel of judges evaluate
4. Arrange items to form the scale
• Item analysis approach-individual items into test.
5. Design directions
• Cumulative scales- ranking of items.
6. Pilot test scale
• Factors scales-inter correlation of items.
7. Make needed revisions, based on pilot test
62 Measurement and Evaluation in Education

2.1.9. SCORE CARD
Score card is a device similar to a check list in certain respects,
and to a rating scale in others. It contains a list of items that pertain to
various aspects regarding a phenomenon or situation about an individual,
institution, organization or object. It gives predetermined values to the Chapter-3
presence of ( or an assigns rating of) each aspect or characteristic. The
respondents check the items regarding the aspects presented in the CONSTRUCTION OF TEST
situation (or rate them). At the stage of appraisal, the investigator counts
the point values, and gives the total weighted score to the phenomenon,
Usually, the jury technique is employed where a number of persons
are asked to make the assessment and the average score is assigned 3.1. TEST CONSTRUCTION
to the phenomenon.
A test item must focus the attention of the examinee on the principle
Score card are frequently used or construct upon which the item is based. Ideally, students who answer
a test item incorrectly will do so because their mastery of the principle
• To estimate the socio economic status of the family, or construct in focus was inadequate or incomplete. Any characteristics
communities, etc. of a test item which distract the examinee from the major point or
• To assess institutions in general, and their overall or specific focus an item reduce the effectiveness of that item. Any item answered
contribution in particular correctly or incorrectly because of extraneous factors in the item, results
• To evaluate a literary or academic work, text book and so on. in misleading feedback to both examinee and examiner.
The socio economic status of an individual includes the following
items with their aspects and point value for each (or rating assigned Test Construction
to each) Writing items requires a decision about the nature of the item or
• Total income question to which we ask students to respond, that is, whether discreet
or integrative, how we will score the item; for example, objectively or
• Material possessions- land properties, modern amenities
subjectively, the skill we purport to test, and so on. We also consider
• Educational background of family the characteristics of the test takers and the test taking strategies
• Occupation respondents will need to use. What follows is a short description of
The construction of a score card involves three main steps. these considerations for constructing items. Test construction is based
• To identify the important aspects of a phenomenon which are to upon practical and scientific rules that are applied before, during and
be evaluated after each item until it finally becomes a part of the test. The following
are the stages of constructing a test as followed by the Center.
• To select the important aspects of the phenomenon
• To assign point values to each ( or to the rating of each) Preparation
The evaluation of phenomenon using a score card suffers Each year the Center attracts a number of competent specialists
from a limitation, namely, certain intangibles connected with the in testing and educationalists to attend in workshops that deal with the
phenomenon do not lend themselves to ratings, and this vitiate the theoretical and applied aspects of the test.
total assessment.
63
64 Measurement and Evaluation in Education Construction of Test 65
♦ ♦
The theoretical aspects includ Justifications have to be provided if an item is deemed invalid. All
data is entered into the Center’s computer.
1. The concepts on which a test is constructed
2. The goals and objectives of the test. Item Entry
3. The general components, parts and sections of a test All items are entered into the computer marked with the relevant
4. The theoretical and technical foundations to write test items. judgment except those deemed invalid.

The applied training includes Review


All the computerized items are reviewed by four testing experts
1. Discussing different types of test items. This aims at putting
to verify:
theoretical and technical concepts into practice.
1. Strict implementation of referring committee comments.
2. Collective training on writing items for various parts of the test
2. Accuracy of numbers and graphs.
3. Constructing items on individual basis (outside workshops) which
3. No grammatical, spelling errors or misprints.
is intended as peer teaching.
4. Full compliance to items with the test pre-set standards and
Test Writing regulations.
• Each item’s writer is charged with making questions that are relevant 5. Answer key is provided for all test items.
to his/her specialty. 6. Screening out the items that seem too complicated or might take
• Each item’s writer is given a code number that carries his/her too much time to answer.
personal data for confidentiality.
Item Trial (pilot testing)
• Each item is given a code number that is kept intact. Each item is
Trial items are included in the actual test and they go through
in the question bank even if it is not to be used in the test.
the same stages of test preparation above, but they do not calculate
Referring towards the final score.
Various items of the test are referred by a 3-member committee: Item Analysis
1. A specialist in the content area. Items are statistically analyzed so that the valid ones are selected
2. A specialist in measurement for the test, while the invalid ones are rejected.
3. An experienced educator.
Test Construction
Each item is either Test is constructed in its final form. Items are randomly selected
1. Accepted as is, or from those deposited in the question bank in a manner that adequately
represents that various parts of test dimensions.
2. Accepted after modifications, or
3. Rejected as invalid. Test Production
Based on a set questionnaire the committee judge each item from The test is finally produced. Trial questions are included and various
various dimensions including the nature of the item, item difficulty, versions of the test are prepared to avoid cheating. The test is printed
item conformity to the content controls, item bias, and item quality. in booklets that include test instructions and test items.
66 Measurement and Evaluation in Education Construction of Test 67
♦ ♦
Equivalence of Various Test Versions be improved by maintaining and developing a pool of valid items from
The Center produces multiple versions of the same test. However, which future tests can be drawn and that cover a reasonable span of
the Center has been cautious about discrepancies between various difficulty levels. Item analysis helps improve test items and identify
versions of the test in terms of difficulty, discrimination or content. An unfair or biased items. Results should be used to refine test item wording.
equivalence check is run during test construction based on scientific In addition, closer examination of items will also reveal which questions
criteria to ensure test discrimination and differential capacity. Although were most difficult, perhaps indicating a concept that needs to be
it is more accurate to run equivalence test during construction stage, taught more thoroughly. If a particular distracter (that is, an incorrect
equivalence criterion is applied during and after the test. answer choice) is the most often chosen answer, and especially if that
distracter positively correlates with a high total score, the item must be
​​Item Writing examined more closely for correctness. This situation also provides an
An Effective Item opportunity to identify and examine common misconceptions among
students about a particular concept.
• Discriminates students who understand the content from those In general, once test items have been created, the value of these items
who do not. can be systematically assessed using several methods representative of
• Focuses on important information. item analysis: a) a test item’s level of difficulty, b) an item’s capacity to
Item writing is the preparation of assessment tasks which can reveal discriminate, and c) the item characteristic curve. Difficulty is assessed
the knowledge and skill of students when their responses to these tasks by examining the number of persons correctly endorsing the answer.
are inspected. Tasks which confuse, which do not engage the students, Discrimination can be examined by comparing the number of persons
or which offend, always obscure important evidence by either failing getting a particular item correct with the total test score. Finally, the
to gather appropriate information or by distracting the student from item characteristic curve can be used to plot the likelihood of answering
the intended task. Sound assessment tasks will be those which students correctly with the level of success on the test.
want to tackle, those which make clear what is required of the students,
and those which provide evidence of the intellectual capabilities of the Test Standardization: Steps of Test Standardization
students. Remember, items are needed for each important aspect as Steps for constructing Standardized Tests
reflected in the test specification. Some item writers fall into the trap Standardized tests are carefully constructed tests with a uniform
of measuring what is easy to measure rather than what is important procedure of scoring, administering and interpreting the test results. They
to measure. This enables superficial question quotas to be met but at consist of items of high quality. The items are pretested and selected
the expense of validity – using questions that are easy to write rather on the basis of difficulty value, discrimination power, and relationship
than those which are important distorts the assessment process, and to clearly defined objectives in behavioural terms. Any person can
therefore conveys inappropriate information about the curriculum to administer the test as the directions for administering, time -limits and
students, teachers, and school communities. scores are given. These are norm-based tests. Norms are age, grade,
There must be a match between what is taught and what is assessed. sex etc. Reliability and validity of a test are established beforehand. A
However, there must also be an effort to test for more complex levels manual is supplied which explains purposes and uses of the test.
of understanding, with care taken to avoid over-sampling items that
assess only basic levels of knowledge. Tests that are too difficult (and Steps for construction of a Standardized test
have an insufficient floor) tend to lead to frustration and lead to deflated
1. Planning the test.
scores, whereas tests that are too easy (and have an insufficient ceiling)
facilitate a decline in motivation and lead to inflated scores. Tests can 2. Preparing the test.
68 Measurement and Evaluation in Education Construction of Test 69
♦ ♦
3. Try out of the test. 3. Preliminary Administration: After the modification of items
4. Reliability of the final test. according to suggestions of experts the test is ready for experimental
try –out to find out the major weaknesses and inadequacies of the
5. Validity of the final test.
item. It helps in finding out the ambiguous items, non –functioning
6. Preparation of norms for the final test. distractors in multiple-choice questions, very difficult or very easy
7. Preparation of manual and reproduction of test. items. It also helps in determining the reasonable time limit, number
1. Planning: For standardized test a systematic and satisfactory of items to be included in the final test, to avoid overlapping and
planning is necessary. For this test constructor should carefully identifying any vagueness in the instructions.
fix up the objectives of the test. He should determine the nature Try-out is done in three stages-
of the content or topics and item types like long answer, short • Preliminary try-out
answer, very- short answer type and the types of instructions like It is done individually to improve and modify the language
knowledge, understanding, application, skill have to be included. difficulty and ambiguity of items, it is done on around100 individuals
A Blue-print should be prepared. The method of sampling, a and workability of items are observed so that item can be modified.
detailed arrangement for the preliminary administration and the • The proper try-out
final administration should be determined. A probable length of It is done on around 400 individuals. Its sample should be similar
test, number of questions and time limit of test completion should to those for whom the test is intended. the purpose of this try out is
also be determined. A clear cut instruction for test scoring and its to select good items for the test and reject the poor items. This step
administration procedure should also be determined. includes two activities-
2. Writing the items of the test: Writing the items of the test 1. Item analysis – A test should neither be too easy nor too difficult, each
is a creative art. It depends upon the item writer’s intuition, item should discriminate validity among high and low achievers.
imagination, experience and practice. Requirements of writing The procedure used to judge the quality of an item is called item-
the items are- analysis. it includes following steps-
• Complete mastery over the subject-matter. In order to write correct • The test paper should be arranged from highest to lowest score.
items test constructer must be fully acquainted with all facts,
• 27%test papers from highest 27% from lowest end will be selected.
fallacies, principles and misconceptions of the subject- matter.
• Then the number of pupils in the upper and lower group who
• Test writer must be aware of the ability and intelligence level of
selected each alternative for each test item.
the persons for whom the test is meant.
After item analysis only good items with appropriate difficulty
• The item writer must have a large vocabulary so that confusion in level and satisfactory discriminating power are retained and form
writing items may be avoided. The vocabulary used in the items the final test. Desired numbers of items are selected according to
should be simple enough to be understood by all. blue –print and arranged in order of difficulty in the final draft.
• After test items are written they must be arranged properly and Time limit is set.
assembled into a test. Items should be arranged from easy to difficult. (c) Final try-out
• Test constructer should give clear cut instruction about the purpose Final try-out is done on large sample of individuals for estimating
of test, time limit, procedure of recording the answers. the reliability and validity of the test. This final try out indicates how
• After writing down the items, they must be submitted to a group effective the test really will be when it would be administered on the
of experts of language, subject. sample for which it really intended.
70 Measurement and Evaluation in Education Construction of Test 71
♦ ♦
4. Reliability of the test: When test is finally composed, the final 2. Comprehensiveness: A tool should cover all pints expected to be
test is again administered on a fresh sample in order to compute learnt by the pupils. It should also cover all the pre-determined
the reliability coefficient. This time also sample should not be objectives. This is referred to be comprehensiveness.
less than 100.Reliability is calculated through Test-retest method, 3. Discriminating power: A good evaluation tool should be able
split-half method and the equivalent -form method. Reliability to discriminate the respondents on the basis of the phenomena
shows the consistency of test scores. measured. Hence while constructing a tool for evaluation, the
5. Validity of the final test: Validity refers what the test measures discrimination power has to be taken care of. This may be at two
and how well it measures. If a test measures a trait that it intends levels- first for the test as a whole and then for each item included.
to measure well then the test can be said valid one. It is correlation 4. Reliability: Reliability of a tool refers to the degree of consistency
of test with some outside independent criterion. and accuracy with which it measures what it is intended to measure.
6. Norms of the final test: Test constructor also prepares norms If the evaluation gives more or less the same result every time it is
of the test. Norms are defined as average performance. They are used, such evaluation is said to be reliable. Consistency of a tool
prepared to meaningfully interpret the scores obtained on the test can be improved by limiting subjectivity of all kinds. Making items
for. The obtained scores on test themselves convey no meaning on the basis of pre-determined specific objectives, ensuring that the
regarding the ability or trait being measured. But when these are expected answers are definite and objective, providing clearly spelt-
compared with norms, a meaningful inference can be immediately out scheme for scoring and conducting evaluation under identical
drawn. The norms may be age norms, grade norms etc. Similar and ideal condition will help in enhancing reliability. Test-retest
norms cannot be used for all tests. method, split-half method and equivalent form or parallel form
7. Preparation of manual and reproduction of the test: Preparation method are the important methods generally used to determine
of manual is the last step in test construction in which psychometric the reliability of a tool.
properties of the test norms and references are reported. It gives a 5. Validity: Validity is the most important quality needed for an
clear indication regarding the procedures of the test administration, evaluation tool. If the tool is able to measure what it is intended
the scoring methods and time limits. It also includes instructions to measure, it can be said that the tool is valid. It should fulfill the
regarding the test. Standardized test assesses the rate of development objectives for which it is developed. Validity can be defined as “
of student’s ability. It helps in diagnosing the learning difficulties the accuracy with which it measures what it is intended to measure
of the students. It also helps the teacher to assess the effectiveness or as the degree in which it approaches infallibility in measuring
of his teaching and school instructional programme. what it purports to measure Content validity, predictive validity,
construct validity, concurrent validity, congruent validity, factorial
Characteristics of Good Evaluation validity, criterion-related validity, etc. are some of the important
1. Objective-basedness: Evaluation is making judegement about types of validity which is needs to fulfill by a tool for evaluation.
some phenomena or performance on the basis of some pre- 6. Objectivity: A tool is said to be objective if it is free from personal
determined objectives. Therefore a tool meant for evaluation bias of interpreting its scope as well as in scoring the responses.
should measure attainment in terms of criteria determined by Objectivity is one of the most primary pre-requisites required for
instructional objectives. This is possible only if the evaluator is maintaining all other qualities of a good too.
definite about the objectives, the degree of realization of which 7. Practicability: A tool, however, well it satisfies all the above
he is going to evaluate. Therefore each item of the tool should criteria, may be useless unless it is not practically feasible. For
represent an objective. example, suppose, in order to ensure comprehensiveness, it was felt
72 Measurement and Evaluation in Education Construction of Test 73
♦ ♦
that thousand items should be given to be answered in ten hours. success. The criterion measure against this type is important because
This may yield valid result, but from practical point of view it s the future outcome of the testee is predicted. The criterion measure
quite impossible. against which the test scores are validated and obtained are available
after a long period.
3.2. CHARACTERISTICS OF A GOOD TEST Construct validity: Is the extent to which the test measures a
Whether a test is standardized or teacher-made, it should apply theoretical trait. Test item must include factors that make up psychological
the qualities of a good measuring instrument. The qualities of a good construct like intelligence, critical thinking, reading comprehension
test which are: validity, reliability, and usability. or mathematical aptitude.
Factors that influence validity are:
3.2.1. Validity
1. Inappropriateness of test items – items that measure knowledge
Validity – is the most important characteristics of a good test. can not measure skill.
Validity – refers to the extent to which the test serves its purpose or
2. Direction – unclear direction reduce validity. Direction that do
the efficiency with which it measures what it intends to measure. The
not clearly indicate how the pupils should answer and record their
validity of test concerns what the test measures and how well it does
answers affect validity of test items.
for. For example, in order to judge the validity of a test, it is necessary
to consider what behavior the test is supposed to measure.A test may 3. Reading vocabulary and sentence structures – too difficult and
reveal consistent scores but if it is not useful for the purpose, then it complicated vocabulary and sentence structure will not measure
is not valid. For example, a test for grade V students given to grade what it intend to measure.
IV is not valid. Validity is classified into four types: content validity, 4. Level of difficulty of Items – too difficult or too easy test items
concurrent validity, predictive validity, and construct validity. can not discriminate between bright and slow pupils will lower
Content validity: Means that extent to which the content of the test its validity.
is truly a representative of the content of the course. A well constructed 5. Poorly constructed test item – test items that provide clues and
achievement test should cover the objectives of instruction, not just items that are ambiguous confuse the students and will not reveal
its subject matter. Three domains of behavior are included: cognitive, a true measure.
affective and psychomotor. 6. Length of the test- a test should of sufficient length to measure
Concurrent validity: Is the degree to which the test agrees with what it is supposed to measure. A test that is too short can not
or correlates with a criterion which is set up an acceptable measure. adequately sample the performance we want to measure.
The criterion is always available at the time of testing. Concurrent
validity or criterion-related validity- establishes statistical tool to interpret 7. Arrangement of items – test item should be arrange according
and correlate test results. For example, a teacher wants to validate an to difficulty, with the easiest items to the difficult ones. Difficult
achievement test in Science (X) he constructed. He administers this items when encountered ahead may cause mental block and may
test to his students. The result of this test can be compared to another also cause student to take much time in that number.
Science students (Y), which has been proven valid. If the relationship 8. Patterns of answers – when students can detect the pattern of
between X and Y is high, this means that the achievement test is correct answer, they are liable to guess and this lowers validity.
Science is valid. According to Garrett, a highly reliable test is always
valid measure of some functions. 3.2.2. Reliability
Predictive validity: Is evaluated by relating the test to some actual Reliability means consistency and accuracy. It refers then to the
achievement of the students of which the test is supposed to predict his extent to which a test is dependable, self consistent and stable. In other
74 Measurement and Evaluation in Education Construction of Test 75
♦ ♦
words, the test agrees with itself. It is concerned with the consistency The formula using Spearman rho is:
of responses from moment to moments even if the person takes the D2 = sum of squared differenceSD2 Where ; S 1 – 6=rs
same test twice, the test yields the same result. N3 – N between ranks
For example, if a student got a score of 90 in an English achievement N = total number of cases
test this Monday and gets 30 on the same test given on Friday, then
For example, 10 students where used as samples to test the reliability
both score can not be relied upon.
of the achievement test in Biology. After two administration of test the
Inconsistency of individual scores however may be affected by
data and computation of Spearman rho is presented in the table below:
person’s scoring the test, by limited samples on certain areas of the
subject matter and particularly the examinees himself. If the examinees Differences squared Scores Ranks between ranks difference Students
mood is unstable this may affect his score.
Factors that affect reliability are: S.N S1 S2 R1 R2 D D2
1. Length of the test. As a general rule, the longer the test, the higher 1. 89 90 2 1.5 0.5 0.25
the reliability. A longer test provides a more adequate sample of 2. 85 85 4.5 4 0.5 0.25
the behavior being measured and is less distorted by chance factors
3. 77 76 9 9 0 0
like guessing.
4. 80 81 7.5 8 0.5 0.25
2. Difficulty of the test. When a test is too easy or too difficult, it
cannot show the differences among individuals; thus it is unreliable. 5. 83 83 6 6.5 0.5 0.25
Ideally, achievement tests should be constructed such that the 6. 87 85 3 4 1.0 1.0
average score is 50 percent correct and the scores range from near 7. 90 90 1 1.5 0.5 0.25
zero to near perfect.
8. 73 72 10 10 0 0.25
3. Objectivity. Objectivity eliminates the bias, opinions or judgments
9. 85 85 4.5 4 0.5 0.25
of the person who checks the test. Reliability is greater when test
can be scored objectively. 10. 80 83 7.5 6.5 1.0 1.0
4. Heterogeneity of the student group. Reliability is higher when test D = 3.5Σ Total
scores are spread over a range of abilities. Measurement errors D2Σ rs = 1 – 6
N3 – N= 1 – = 1 – = 1 – 0.0212 = 0.98 (very high relationship
are smaller than that of a group that is more heterogeneous.
5. Limited time. a test in which speed is a factor is more reliable than The rs value obtained is 0.98 which means very high relationship;
a test that is conducted at a longer time. hence achievement test in Biology is reliable.
A reliable test however, is not always valid. Pearson Product-Moment Correlation Coefficient can also be used
for test-retest method of estimating the reliability of test. The formula is:
Methods of Estimating Reliability of Test Using the same data for Spearman rho, the scores for 1st and 2nd
1. Test-retest method. The same instrument is administered twice administration may be presented in this way:
to the same group of subjects. The scores of the first and second
administrations of the test are determined by Spearman rank X (S1) Y (S2) X2 Y2 XY
correlation coefficient or Spearman rho and Pearson Product- 89 90 7921 8100 8010
Moment Correlation Coefficient. 85 85 7225 7225 7225
76 Measurement and Evaluation in Education Construction of Test 77
♦ ♦
X (S1) Y (S2) X2 Y2 XY subject is able to answer correctly all or nearly all items within the time
limit of the test, the scores on the two halves would be about similar
77 76 5929 5776 5852
and the correlation would be closed to +1.00.
80 81 6400 6561 6480 3. Kuder-Richardson Formula 21 is the last method of establishing the
83 83 6869 6869 6869 reliability of a test. Like the split half method, a test is conducted
87 85 7569 7225 7395 only once. This method assumes that all items are of equal difficulty.
The formula is:
90 90 8100 8100 8100
73 72 5329 5184 5256 Where:
X = the mean of the obtained scores
85 85 7225 7225 7225
S = the standard deviation
80 83 6400 6889 6640 k = the total number of items
= 69052UXS2 =69154 USX2 = 68967 S = 830 USX = 829 S
Usability
Illustrate Below Usability means the degree to which the tests are used without
Alternate-forms method. The second method of establishing the much expenditure of time, money and effort. It also means practicability.
reliability of test results. In this method, we give two forms of a test Factors that determine usability are: administrability, scorability,
similar in content, type of items, difficulty, and others in close succession interpretability, economy and proper mechanical makeup of the test.
to the same group of students. To test the reliability the correlation Administrability means that the test can be administered with
technique is used (refer to the formula used in Pearson Product-Moment ease, clarity and uniformity. Directions must be made simple, clear
Correlation Coefficient). and concise. Time limits, oral instructions and sample questions are
1. Split-half Method: The test may be administered once, but the test specified. Provisions for preparation, distribution, and collection of
items are divided into two halves. The most common procedure is test materials must be definite.
to divide a test into odd or even items. The results are correlated Scorability is concerned on scoring of test. A good test is easy to
and the r obtained is the reliability coefficient for a half test. The score thus: scoring direction is clear, scoring key is simple, answer is
Spearman-Brown formula is used which is: available, and machine scoring as much as possible be made possible.
Test results can be useful if after evaluation it is interpreted. Correct
where; interpretation and application of test results is very useful for sound
r = reliability of whole test educational decisions. An economical test is of low cost. One way to
rht = reliability of half of the test economize cost is to use answer sheet and reusable test. However, test
For example, rht is 0.69. what is r? validity and reliability should not be sacrificed for economy. Proper
rt = 2 rht mechanical make-up of the test concerns on how tests are printed, what
1 + rht font size are used, and are illustrations fit the level of pupils/students.
= 2(0.69)
1+ 0.69 Establishing Norms/Standards
= 0.82 very high relationship, so the test is reliable. The concept of reliability relates to the question of ‘accuracy’ with
Split-half method is applicable for not highly speeded measuring which we measure ‘what’. A test must be reliable as it must have the
instrument. If the measuring instrument includes easy items and the ability to consistently yield the same results when repeated measurements
78 Measurement and Evaluation in Education Construction of Test 79
♦ ♦
are taken of the same individual under the same conditions. A test is wanted to evaluate the reliability of a critical thinking assessment,
said to be reliable if it functions in a consistent manner. A reliable you might create a large set of items that all pertain to critical
test gives stable and trust worthy results. In measuring reliability, the thinking and then randomly split the questions up into two sets,
emphasis is upon the agreement of the test with itself. which would represent the parallel forms.
Characteristics of Reliability Following are the characteristics of 3. Inter-rater reliability: Is a measure of reliability used to assess the
reliability: degree to which different judges or raters agree in their assessment
1. Reliability is consistency of test scores. decisions. Inter-rater reliability is useful because human observers will
2. It is the measure of variable error or chance error or measurement not necessarily interpret answers the same way; raters may disagree
error. as to how well certain responses or material demonstrate knowledge
of the construct or skill being assessed. Example: Inter-rater reliability
3. It is the function of a test length.
might be employed when different judges are evaluating the degree
4. It refers as the stability of the test for a certain population. to which art portfolios meet certain standards. Inter-rater reliability
5. It is self- correlation. is especially useful when judgments can be considered relatively
6. It is the coefficient of stability and internal consistency. subjective. Thus, the use of this type of reliability would probably be
7. It is reproducibility of the scores. more likely when evaluating artwork as opposed to math problems.

8. It is the important characteristics of measuring instrument. 4. Internal consistency reliability is a measure of reliability used to
evaluate the degree to which different test items that probe the
9. It refers to the accuracy or precision of a measuring instrument. same construct produce similar results.
According to Hopkin reliability means, the consistency with which
a test measures whatever it measures.Reliability is the degree to which (a) Average inter-item correlation is a subtype of internal
an assessment tool produces stable and consistent results. consistency reliability. It is obtained by taking all of the
items on a test that probe the same construct (e.g., reading
Types of Reliability comprehension), determining the correlation coefficient for
each pair of items, and finally taking the average of all of
1. Test-retest reliability: Is a measure of reliability obtained by these correlation coefficients. This final step yields the average
administering the same test twice over a period of time to a group inter-item correlation.
of individuals. The scores from Time 1 and Time 2 can then be
correlated in order to evaluate the test for stability over time. (b) Split-half reliability: Is another subtype of internal consistency
Example: A test designed to assess student learning in psychology reliability. The process of obtaining split-half reliability is begun
could be given to a group of students twice, with the second by “splitting in half” all items of a test that are intended to
administration perhaps coming a week after the first. The obtained probe the same area of knowledge (e.g., World War II) in order
correlation coefficient would indicate the stability of the scores. to form two “sets” of items. The entire test is administered to a
group of individuals, the total score for each “set” is computed,
2. Parallel forms reliability: Is a measure of reliability obtained and finally the split-half reliability is obtained by determining
by administering different versions of an assessment tool (both the correlation between the two total “set” scores.
versions must contain items that probe the same construct, skill,
knowledge base, etc.) to the same group of individuals. The scores Types of Validity
from the two versions can then be correlated in order to evaluate
the consistency of results across alternate versions. Example: If you 1. Face Validity: Ascertains that the measure appears to be assessing
the intended construct under study. The stakeholders can easily
80 Measurement and Evaluation in Education Construction of Test 81
♦ ♦
assess face validity. Although this is not a very “scientific” type of discipline. If the measure can provide information that students are
validity, it may be an essential component in enlisting motivation lacking knowledge in a certain area, for instance the Civil Rights
of stakeholders. If the stakeholders do not believe the measure is Movement, then that assessment tool is providing meaningful
an accurate assessment of the ability, they may become disengaged information that can be used to improve the course or program
with the task. Example: If a measure of art appreciation is created requirements.
all of the items should be related to the different components 5. Sampling Validity: (similar to content validity) ensures that the
and types of art. If the questions are regarding historical time measure covers the broad range of areas within the concept under
periods, with no reference to any artistic movement, stakeholders study. Not everything can be covered, so items need to be sampled
may not be motivated to give their best effort or invest in this from all of the domains. This may need to be completed using a
measure because they do not believe it is a true assessment of art panel of “experts” to ensure that the content area is adequately
appreciation. sampled. Additionally, a panel can help limit “expert” bias
2. Construct Validity: Is used to ensure that the measure is actually (i.e. a test reflecting what an individual personally feels are the
measure what it is intended to measure (i.e. the construct), and most important or relevant areas). Example: When designing an
not other variables. Using a panel of “experts” familiar with the assessment of learning in the theatre department, it would not
construct is a way in which this type of validity can be assessed. be sufficient to only cover issues related to acting. Other areas
The experts can examine the items and decide what that specific of theatre such as lighting, sound, functions of stage managers
item is intended to measure. Students can be involved in this process should all be included. The assessment should reflect the content
to obtain their feedback. Example: A women’s studies program area in its entirety.
may design a cumulative assessment of learning throughout the
major. The questions are written with complicated wording and 3.2.3. Objectivity
phrasing. This can cause the test inadvertently becoming a test of • Objectivity is also referred to as rater reliability
reading comprehension, rather than a test of women’s studies. It
• Objectivity is the close agreement between scores assigned by
is important that the measure is actually assessing the intended
two or more judges
construct, rather than an extraneous factor.
3. Criterion-Related Validity: Is used to predict future or current Factors Affecting Objectivity
performance - it correlates test results with another criterion of
• The clarity of the scoring system
interest. Example: If a physics program designed a measure to
assess cumulative student learning throughout the major. The new • The degree to which judges can assign scores accurately (fairly,
measure could be correlated with a standardized measure of ability no bias)
in this discipline, such as an ETS field test or the GRE subject A test is objectivity, if it receives the same score when it is examined
test. The higher the correlation between the established measure by the same examiner of two different occasions. By objectivity we
and new measure, the more faith stakeholders can have in the new mean the degree to which personal element or judgement is eliminated
assessment tool. in scoring. A measuring instrument is said to be highly objective of
the score assigned by different but equally competent scorers is not
4. Formative Validity: When applied to outcomes assessment it is
affected by the judgement, personal opinion or bias. Objectivity can be
used to assess how well a measure is able to provide information to
determined by finding the co-efficient of correlation between scores,
help improve the program under study. Example: When designing
assigned to a group of papers by the same examiner on two occasions.
a rubric for history one could assess student’s knowledge across the
It is called the co-efficient of objectivity.
82 Measurement and Evaluation in Education Construction of Test 83
♦ ♦
Usability In a well- prepared study, it is not uncommon for the interviewer
Usability means the degree to which the tests are used without instructions to be several times longer than the interview questions.
much expenditure of time, money and effort. It also means practicability. Naturally, the more complex the concepts and constructs, the greater
Factors that determine usability are: administrability, scorability, is the need for clear and complete instructions. The instruments
interpretability, economy and proper mechanical makeup of the test. should be made easier to administer by giving close attention to its
Administrability means that the test can be administered with ease, design and layout. A long completion time, complex instructions,
clarity and uniformity. Directions must be made simple, clear and participant’s perceived difficulty with the survey, and their rated
concise. Time limits, oral instructions and sample questions are specified. enjoyment of the process also influence design.
Provisions for preparation, distribution, and collection of test materials 3. Scoring and interpretability: This aspect of practicality is relevant
must be definite. Scorability is concerned on scoring of test. A good when persons other than the test designers must interpret the results.
test is easy to score thus: scoring direction is clear, scoring key is It is usually, but not exclusively, an issue with standardized tests. In
simple, answer is available, and machine scoring as much as possible such case, the designer of the data collection instrument provides
be made possible. Test results can be useful if after evaluation it is several key pieces of information to make interpretation possible:
interpreted. Correct interpretation and application of test results is very • A statement of the functions the test was designed to measure and
useful for sound educational decisions. An economical test is of low the procedures by which it was developed.
cost. One way to economize cost is to use answer sheet and reusable
• Detailed instructions for administration.
test. However, test validity and reliability should not be sacrificed for
economy. Proper mechanical make-up of the test concerns on how • Scoring keys and instructions.
tests are printed, what font size are used, and are illustrations fit the • Norms for appropriate reference group.
level of pupils/students. • Evidence about reliability.
Usability is an important criterion in assessing the value of a test. • Evidence regarding the inter-correlations of sub-scores.
It depends upon a number of factors such as case of administration,
case of scoring, case of interpretation and use of scores, low cost • Evidence regarding the relationship of the test to other measures.
satisfactory format etc. • Guides for test use.
1. Economy: More items give more reliability, but in the interest
of limiting the interview or observation time the number of
measurement questions should be reduced. The choice of data
collection method is also often dictated by economic factors. The
rising cost of personal interviewing first led to an increased use of
telephone surveys and subsequently to the current rise in internet
surveys. In standardized tests, the cost of test materials alone can
be such a significant expense that it encourages multiple reuse. For
the fast and economical scoring, computer scoring and scanning
are increasingly used.
2. Convenience in administration: A questionnaire or a measurement
scale with a set of detailed but clear instructions, with examples,
is easier to complete correctly than one the lacks these futures.
Methods of Reliability and Validity 85
CHAPTER-IV

So how do we determine whether two observers are being consistent
in their observations? You probably should establish inter-rater reliability
METHODS OF RELIABILITY AND VALIDITY
outside of the context of the measurement in your study. After all, if
you use data from your study to establish reliability, and you find that
ILITY METHODS
Chapter-4 reliability is low, you’re kind of stuck. Probably it’s best to do this as
a side study or pilot study. And, if your study goes on for a long time,
METHODS
r general classes OF RELIABILITY
of reliability estimates, each of which AND VALIDITY
estimates reliability in a you may want to reestablish inter-rater reliability from time to time
They are: to assure that your raters aren’t changing.
There are two major ways to actually estimate inter-rater
reliability. If your measurement consists of categories -- the raters
Rater or Inter-Observer ReliabilityUsed to assess the degree to which different
are checking off which category each observation falls in -- you can
4.1.
observers give RELIABILITY
consistent METHODS
estimates of the same phenomenon. calculate the percent of agreement between the raters. For instance,
etest ReliabilityThere
Used are four general
to assess classes ofofreliability
the consistency a measureestimates,
from one each
time of
to let’s say you had 100 observations that were being rated by two
which estimates reliability in a different way. They are: raters. For each observation, the rater could check one of three
r.
• Inter-Rater or Inter-Observer Reliability: Used to assess the categories. Imagine that on 86 of the 100 observations the raters
el-Forms ReliabilityUsed to assess
degree to which the raters/observers
different consistency of give
the results of two
consistent tests
estimates checked the same category. In this case, the percent of agreement
ucted in the sameof waythefrom
samethe
phenomenon.
same content domain. would be 86%. OK, it’s a crude measure, but it does give an idea
of how much agreement exists, and it works no matter how many
al Consistency• Reliability
Test-RetestUsed
Reliability: Used
to assess thetoconsistency
assess the consistency of a measure
of results across items
from one time to another. categories are used for each observation.
a test. Let's discuss each of these in turn. The other major way to estimate inter-rater reliability is appropriate
• Parallel-Forms Reliability: Used to assess the consistency of the
when the measure is a continuous one. There, all you need to do is
results of two tests constructed in the same way from the same
r Inter-ObservercontentReliability calculate the correlation between the ratings of the two observers. For
domain.
instance, they might be rating the overall level of activity in a classroom
• Internal Consistency on a 1-to-7 scale. You could have them give their rating at regular
ver you use humans as a part
Reliability: Usedof to
your
time intervals (e.g., every 30 seconds). The correlation between these
assess
procedure, you have the consistency
to worry about whether ratings would give you an estimate of the reliability or consistency
of results across itemsare
u get are reliable or consistent. People between the raters.
within a test. Let’s You might think of this type of reliability as “calibrating” the
r their inconsistency. Weof are
discuss each theseeasily
in observers. There are other things you could do to encourage reliability
turn. repetitive tasks. We
We get tired of doing between observers, even if you don’t estimate it. For instance, I used
misinterpret. to work in a psychiatric unit where every morning a nurse had to do
Inter-Rater or Inter-
a ten-item rating of each patient on the unit. Of course, we couldn’t
Observer Reliability
w do we determine whether count on the same nurse being present every day, so we had to find a
Whenever you usetwo observers
humans as a partare being
of your consistentprocedure,
measurement in their
way to assure that any of the nurses would give comparable ratings.
You probably youshould
have toestablish
worry about whether the
inter-rater results you
reliability get areofreliable
outside or consistent.
the context of the The way we did it was to hold weekly “calibration” meetings where we
People are notorious for their inconsistency. We are easily distractible.
n your study. After all, if you use data from your study to establish reliability, and We would have all of the nurses ratings for several patients and discuss why
get tired of doing repetitive tasks. We daydream. We misinterpret. they chose the specific values they did. If there were disagreements,
eliability is low, you're kind of stuck. Probably it's best to do this as a side study or
the nurses would discuss them and attempt to come up with rules for
nd, if your study
84 goes on for a long time, you may want to reestablish inter-rater

m time to time to assure that your raters aren't changing.


86 Measurement and Evaluation in Education Methods of Reliability and Validity 87
♦ ♦
deciding when they would give a “3” or a “4” for a rating on a specific be able to generate lots of items that reflect the same construct. This is
item. Although this was not an estimate of reliability, it probably went often no easy feat. Furthermore, this approach makes the assumption
a long way toward improving the reliability between raters. that the randomly divided halves are parallel or equivalent. Even by
chance this will sometimes not be the case. The parallel forms approach
Test-Retest Reliability is very similar to the split-half reliability described below. The major
onstruct beingWemeasured between the two occasions. The amount
estimate test-retest reliability when we administer the same test of time allowed between
difference is that parallel forms are constructed so that the two forms can
to the same sample on two different occasions. This approach assumes be used independent of each other and considered equivalent measures.
measures isthatcritical. We know that if we measure the same thing twice that the correlation
there is no substantial change in the construct being measured For instance, we might be concerned about a testing threat to internal
etween the two observations will depend in part by how much time elapses between thevalidity.
between the two occasions. The amount of time allowed between two If we use Form A for the pretest and Form B for the posttest, we
measures is critical. We know that if we measure the same thing twice minimize that problem. it would even be better if we randomly assign
measurement thatoccasions.
the correlationThe shorter
between thethe
twotime gap, the
observations higher
will dependthe correlation; the longer the individuals
in part time to receive Form A or B on the pretest and then switch them
by how much time elapses between the two measurement occasions. split-half reliability we have
on the posttest. With an instrument
split-half thatwewe
reliability havewish to use asthat
an instrument a single me
ap, the lower the correlation. This is because the two observations are related over time --wethe
The shorter the time gap, the higher the correlation; the longer the time wish to use as a single measurement instrument and only develop
instrument and only develop randomly split halves for purposes of estimating reliability
gap,we
loser in time theget
lowerthethemore
correlation.
similarThis
theisfactors
becausethat
the contribute
two observations
to error. Since this correlationrandomly
is split halves for purposes of estimating reliability.
are related over time -- the closer in time we get the more similar the
he test-retest estimate
factors of reliability,
that contribute you can
to error. Since obtain considerably
this correlation different estimates depending
is the test-retest
estimate of reliability, you can obtain considerably different estimates
n the interval.
depending on the interval.

Internal Consistency Reliability


Internal Consistency Reliability
In internal consistency reliability estimation we use our single
measurement instrument administered to a group of people on one
In internal
occasionconsistency reliabilityInestimation
to estimate reliability. we use
effect we judge our singleofmeasurement
the reliability the i
instrument by estimating how well the items that reflect the same construct
Parallel-Forms Reliability administered to a group of people on one occasion to estimate reliability. In effect we
yield similar results. We are looking at how consistent the results are for
Parallel-Forms Reliability
In parallel forms reliability you first have to create two parallel reliability ofdifferent
the instrument
items forbytheestimating how within
same construct well the
theitems that There
measure. reflectare
thea same cons
forms. One way to accomplish this is to create a large set of questions wide variety of internal consistency measures that can be used.
similar results. We are looking at how consistent the results are for different items fo
that address
In parallel the same
forms construct and
reliability youthen randomly
first have divide the questions
to create two parallel forms. One way to
into two sets. You administer both instruments to the same sample of Average
construct within the Inter-item
measure. Correlation
There are a wide variety of internal consistency measure
ccomplishpeople.
this isThetocorrelation
create a between
large set of questions
the two that
parallel forms address
is the estimatethe same construct and thenThe average inter-item correlation uses all of the items on our
be used.
of reliability. One major problem with this approach is that you have to instrument that are designed to measure the same construct. We first
andomly divide the questions into two sets. You administer both instruments to the same sample
Average
f people. The correlation between the two parallel forms is the estimate of reliability. OneInter-item
major Correlation
roblem with this approach is that you have to be able to generate lots of items that reflect the
88 Measurement and Evaluation in Education Methods of Reliability and Validity 89
♦ ♦
compute the correlation between each pair of items, as illustrated in In split-half reliability we randomly divide all items that purport
the figure. For example, if we have six items we will have 15 different to measure the same construct into two sets. We administer the entire
item pairings (i.e., 15 correlations). The average interitem correlation instrument to a sample of people and calculate the total score for each
is simply the average or mean of all these correlations. In the example, randomly
the total divided
score for half. the split-half
each randomly reliability
divided half. estimate,
the split-half as shown
reliability in
estimate, as show
we find an average inter-item correlation of. 90 with the individual the figure, is simply the correlation between these two total scores. In
figure, is simply the correlation between these two total scores. In the example it is .87.
correlations ranging from. 84 to. 95. the example it is. 87.

Average Itemtotal Correlation


Average Itemtotal Correlation
Average Itemtotal Correlation Cronbach’s Alpha (a)
Cronbach's Alpha (a)
ThisThis approach
approach alsothe
also uses uses the inter-item
inter-item correlations.
correlations. In addition,In
weaddition,
compute we
a total score Imagine that we compute one split-half reliability and then
This approach also uses the inter-item correlations. In addition, we compute a total score
compute a total score for the six items and use that as a seventh variable
for the six items and use that as a seventh variable in the analysis. The figure shows the six item- randomly divide
Imagine that we the itemsone
compute intosplit-half
anotherreliability
set of split halves
and then and divide th
randomly
for the six items
in the and use that
analysis. Theasfigure
a seventh
showsvariable in the
the six analysis. The
item-to-total figure showsatthe six item-
correlations recompute, and keep doing this until we have computed all possible
to-total correlations at the bottom of the correlation matrix. They range from .82 to .88 in this into another set of split halves and recompute, and keep doing this until we have comp
the bottom of the correlation matrix. They range
to-total correlations at the bottom of the correlation matrix. They rangefrom. 82from
to. 88
.82 in
to .88 in this split half estimates of reliability. Cronbach’s Alpha is mathematically
sample analysis, with the average of these at .85. possible split half
this sample analysis, with the average
sample analysis, with the average of these at .85. of these at. 85. equivalent to estimates
the averageof of
reliability. Cronbach's
all possible split-halfAlpha is mathematically
estimates, although equivalen
that’s
average notpossible
of all how wesplit-half
computeestimates,
it. Noticealthough
that when I say
that's not we
howcompute
we compute it. Not
all possible split-half estimates, I don’t mean that each time we go
when I say we compute all possible split-half estimates, I don't mean that each time w
an measure a new sample! That would take forever. Instead, we
measure a newall
calculate sample! Thatestimates
split-half would take fromforever.
the sameInstead,
sample.we Because
calculate we
all split-half es
from measured all of our
the same sample. sample
Because weonmeasured
each of the sixour
all of items,
sample all on
weeach
haveoftothe six items
do is have the computer analysis do the random subsets of items and
have to do is have the computer analysis do the random subsets of items and comp
compute the resulting correlations. The figure shows several of the
resulting correlations.
split-half estimates Theforfigure shows
our six itemseveral
example of and
the split-half
lists themestimates
as SH for our s
withand
example a subscript.
lists them Just
as SHkeep
withina mind that Just
subscript. although
keep inCronbach’s Alpha Cronbach'
mind that although
is equivalent to the average of all possible split half correlations we
is equivalent to the average of all possible split half correlations we would never
would never actually calculate it that way. Some clever mathematician
calculate it that way.
(Cronbach, Some clever
I presume!) mathematician
figured out a way (Cronbach, I presume!) figured out a
to get the mathematical
Split-Half Reliability equivalent
get the mathematicala lotequivalent
more quickly.a lot more quickly.
Split-Half Reliability

Split-Half Reliability
In split-half reliability we randomly divide all items that purport to measure the same
construct into two sets. We administer the entire instrument to a sample of people and calculate
90 Measurement and Evaluation in Education Methods of Reliability and Validity 91
♦ ♦
items designed to measure the same construct. This is relatively easy
to achieve in certain contexts like achievement testing (it’s easy, for
have lots instance,
of items, Cronbach's
to constructAlpha tends
lots of to be addition
similar the most problems
frequently for
useda estimate
math of intern
test), but for more complex or subjective constructs this can be a real
consistency.
challenge. If you do have lots of items, Cronbach’s Alpha tends to be
the test-retest
The most frequently usedisestimate
estimator of internal
especially feasibleconsistency.
in most experimental and quas
The test-retest estimator is especially feasible in most experimental
experimental designs that use a no-treatment control group. In these designs you always have
and quasi-experimental designs that use a no-treatment control group.
control group that designs
In these is measured
you on two occasions
always (pretestgroup
have a control and posttest). the main problem wit
that is measured
on two
this approach occasions
is that you don't(pretest
have anyand posttest).about
information the main problem
reliability untilwith this the postte
you collect
approach is that you don’t have any information about reliability until
and, if the reliability estimate is low, you're pretty much sunk.
you collect the posttest and, if the reliability estimate is low, you’re
pretty much sunk.
4.2. SPLIT HALF
Comparison of Reliability Estimators
Comparison of Reliability Estimators 4.2. SPLIT HALF
Each of the reliability estimators has certain advantages and One way
Oneto way
test the reliability
to test of a testof
the reliability is atotest
repeat
is totherepeat
test. This is not
the test. always possibl
This
disadvantages. Inter-rater reliability is one of the best ways to estimate is not always
Each of the reliability Another approach, which possible. Another
is applicable approach, which
to questionnaires, is applicable
is to divide to even and od
the test into
reliability when yourestimators
measure ishas an certain advantages
observation. and itdisadvantages.
However, requires Inter-rater
questionnaires, is to divide the test into even and odd questions and
reliability multiple
is one ofraters
the bestor observers. As an alternative,
ways to estimate you could
reliability when look at the
your measure is an observation. questions and compare the results.
compare the results.
correlation of ratings of the same single observer repeated
However, it requires multiple raters or observers. As an alternative, you could look at the on two Example 1: 12 students take a test with 50 questions. For each
different occasions. For example, let’s say you collected videotapes Example student
1: 12 students takescore
the total a testiswith 50 questions.
recorded For each
along with the student
sum ofthethetotal score is recorde
scores
correlationofofchild-mother
ratings of the same single
interactions and observer
had a rater repeated
code the onvideos
two different
for how occasions. For along withforthe
thesum
evenof questions
the scores and
for the sum
even of
questions andfor
the scores thethe
sumoddof question
the scores for the od
oftensay
example, let's theyou
mother smiled
collected at the child.
videotapes To establish inter-rater
of child-mother interactionsreliability
and had a rater code the as shown in Figure
question as shown in Figure 1.
1. Determine whether the
Determine whether the test
test is
is reliable
reliable by
by using
using the split-ha
videos for how often the mother smiled at the child. To establish inter-rater them
you could take a sample of videos and have two raters code reliability you could the split-half methodology.
independently. To estimate test-retest reliability you could have a single methodology.
take a sample
rater of
codevideos and have
the same videostwoonraters code them
two different independently.
occasions. You mightTo use
estimate test-retest
reliability the
youinter-rater
could haveapproach
a single especially
rater code if theyou
samewere interested
videos on twoindifferent
using aoccasions. You
team of raters and you wanted to establish that they yielded consistent
might use the inter-rater approach especially if you were interested in using a team of raters and
results. If you get a suitably high inter-rater reliability you could then
you wanted to establish
justify allowingthatthemthey
to yielded consistent results.
work independently If you
on coding get a suitably
different videos. high inter-rater
reliability You
you might
could use
thenthejustify
test-retest approach
allowing them when
to workyouindependently
only have a single
on coding different
rater and don’t want to train any others. On the other hand, in some
videos. You might use the test-retest approach when you only have a single rater and don't want
studies it is reasonable to do both to help establish the reliability of
to train anytheothers.
ratersOnor the other hand, in some studies it is reasonable to do both to help establish
observers.
the reliability ofThetheparallel
raters orforms estimator is typically only used in situations
observers.
where you intend to use the two forms as alternate measures of the
Thesame thing.
parallel Bothestimator
forms the parallel forms and
is typically all used
only of theininternal consistency
situations where you intend to use
estimators have one major constraint -- you have to have multiple Figure 1 – Split-half methodology for Example 1
the two forms as alternate measures of the same thing. Both the parallel forms and all of the Figure-1: Split-half methodology for Example 1
The statistical test consists of looking at the correlation coefficient (cell G3 of Figure 1). If it
internal consistency estimators have one major constraint -- you have to have multiple items
high then the questionnaire is considered to be reliable.
designed to measure the same construct. This is relatively easy to achieve in certain contexts like
92 Measurement and Evaluation in Education Methods of Reliability and Validity 93
♦ ♦
The statistical test consists of looking at the correlation coefficient WeFigure
first 2split
– Datathe questions
for Example 2 into the two halves: Q1-Q5 and Q6-
(cell G3 of Figure 1). If it is high then the questionnaire is considered Q10, asWe
shown
first splitin
theFigure 3. the two halves: Q1-Q5 and Q6-Q10, as shown in Figure 3.
questions into
to be reliable.
r = CORREL(C4:C15,D4:D15) = 0.667277
One problem with the split-half reliability coefficient is that since
only half the number of items is used the reliability coefficient is
reduced. To get a better estimate of the reliability of the full test, we
apply the Spearman-Brown correction, namely:
2r
=ρ = 0.800439
1+ r

This result shows that the test is quite reliable.


Real Statistics Functions: The Real Statistics Resource Pack
contains the following supplemental functions:
SPLIT_HALF(R1, R2) = split half coefficient (after Spearman- Figure 3 – Split-half coefficient (Q1-Q5 v. Q6-Q10)

Brown correction) for data in ranges R1 and R2 Figure-3: Split-half coefficient (Q1-Q5 v. Q6-Q10)
SPLITHALF(R1, type) = split-half measure for or the scores in E.g. the formula in cell B23 is =SUM(B4:F4) and the formula in
the first half of the items in R1 vs. the second half of the items if type cell E.g.
C23theisformula
=SUM(G4:K4). The coefficient
in cell B23 is =SUM(B4:F4) 0.64451
and the formula (cell
in cell C23 H24) can The
is =SUM(G4:K4).
= 0 and the odd items in R1 vs. the even items if type = 1. be calculated as in(cell
coefficient 0.64451 Example 1.calculated
H24) can be Alternatively, the1.coefficient
as in Example cancoefficient
Alternatively, the be
The SPLIT_HALF function ignores any empty cells and cells calculated
can bebycalculated
the worksheet by the formula
worksheet=SPLIT_HALF(B23,B37,C23:C37)
formula =SPLIT_HALF(B23,B37,C23:C37) or
with non-numeric values. This is no so for the SPLITHALF function. or =SPLITHALF(B4:K18,0).
=SPLITHALF(B4:K18,0).
For Example 1, SPLIT_HALF(C4:C15, D4:D15) =. 800439. We can also split the questionnaire into odd and even questions,
Example 2: Calculate the split half coefficient of the ten question We can in
as shown alsoFigure
split the questionnaire
4. into odd and even questions, as shown in Figure 4.
questionnaire using a Likert scale (1 to 7) given to 15 people whose
results are shown in Figure 2.

Figure 2 – Data for Example 2 Figure 4 – Figure-4:


Split-half coefficient (odd v. coefficient
Split-half even) (odd v. even)
Figure 2 – Data for Example 2
E.g. the formula in cell L23 is =B4+D4+F4+H4+J4 and the formula in cell M23 is
We first split the questions into the two halves: Q1-Q5 and Q6-Q10, as shown in Figure 3.
=C4+E4+G4+I4+K4. The coefficient 0.698813 (cell R24) can be calculated as in Example 1.
Alternatively, the coefficient can be calculated by the supplemental formula
=SPLIT_HALF(L23,L37,M23:M37) or =SPLITHALF(B4:K18,1).
94 Measurement and Evaluation in Education Methods of Reliability and Validity 95
♦ ♦
E.g. the formula in cell L23 is =B4+D4+F4+H4+J4 and the 2. As the test is administered once, the chance errors may affect the
formula in cell M23 is =C4+E4+G4+I4+K4. The coefficient 0.698813 scores on the two halves in the same way and thus tending to make
(cell R24) can be calculated as in Example 1. Alternatively, the the reliability coefficient too high.
coefficient can be calculated by the supplemental formula =SPLIT_ 3. This method cannot be used in power tests and heterogeneous
HALF(L23,L37,M23:M37) or =SPLITHALF(B4:K18,1). tests.
In this method the test is administered once on the sample and it
4. This method can not be used for estimating reliability of speed
is the most appropriate method for homogeneous tests. This method
tests.
provides the internal consistency of a test scores. All the items of
the test are generally arranged in increasing order of difficulty and 4.3. SPEARMAN BROWN PROPHEEY FORMULA
administered once on sample. After administering the test it is divided
The Spearman–Brown prediction formula, also known as the
into two comparable or similar or equal parts or halves. The test is
Spearman–Brown prophecy formula, is a formula relating psychometric
divided into two halves only for the purpose of scoring and not for
reliability to test length and used by psychometricians to predict the
administration. The scores are arranged or are made in two sets obtained
reliability of a test after changing the test length. The method was
from odd numbers of items and even numbers of items separately. The
published independently by Spearman (1910) andBrown (1910).
odd numbered items 1,3,5,7 etc. and the even numbered items 2,4,6,8
The Spearman-Brown prophecy formula is commonly used for
etc. form two different sets of items for scoring. After obtaining two
adjusting split-half reliability estimates for full test reliability. To review
scores on odd and even numbers of test items, co-efficient of correlation
briefly, split-half reliability is an internal consistency estimate. Split-
is calculated. It is really a correlation between two equivalent halves of
half reliability is typically calculated in the following steps:
scores obtained in one sitting. To estimate reliability, Spearman-Brown
Prophecy formula is used:- 1. Divide whatever test analyzing into two halves and score them
separately (usually the odd numbered items are scored separately
n σ 2 t-∑ pq
r1t =
(n-1)
x
σ 2t from the even-numbered items).
2. Calculate a Pearson product-moment correlation coefficient between
Where r1t = reliability coefficient of the whole test.
the students’ scores on the even-numbered items and their scores on
r1 1 = reliability coefficient of the half test, found experimentally. the odd-numbered items. The resulting coefficient is an estimate of
22 the half-test reliability of your test (i.e., the reliability of the odd-
Advantages numbered items, or the even-numbered items, but not both combined).
1. The carryover effect or practice effect is not there as the testee is 3. Apply the Spearman-Brown prophecy formula to adjust the half-
not tested twice. test reliability to full-test reliability. We know that, all other factors
2. The fluctuations of individual’s ability because of environmental being held constant, a longer test will probably be more reliable
or physical conditions is minimised. than a shorter test. The Spearman-Brown prophecy formula was
developed to estimate the change in reliability for different numbers
3. Difficulty of constructing parallel forms of test is eliminated.
of items. The Spearman-Brown formula that is often applied in
Limitations the split-half adjustment is as follows:

1. A test can be divided into two equal halves in a number of ways 2 × rhalf −test
and the coefficient of correlation in each case may be different. reliability =
1 + rhalf −test
96 Measurement and Evaluation in Education Methods of Reliability and Validity 97
♦ ♦
For example, if the half-test correlation (for a 30-item test) between 2 × 5half −test r 2 × .50 1.50
the 15 odd-numbered and 15 even-numbered items on a test turned out reliability
= = = = .666 ≈ .67
to be. 50, the full-test (30-item) reliability would be. 67 as follows: 1 + 5half −test 1 + .50 1.50

2 × rhalf −test 2 × .50 1.00 We might want to use this last strategy if we were trying to figure out
reliability
= = = = .666 ≈ .67
1 + rhalf −test 1 + .50 1.50 how short we could make our test and still maintain decent reliability.
[For more on the Spearman-Brown formula, see Brown, 1996, pp.
[ p. 9 ] 194-196, 204-205, or Brown with Wada, 1999, pp. 220-223, 233-234.]
The Spearman-Brown prophecy formula can be used for adjusting
However, there is another version of the formula, which can be split-half reliability, but more importantly, it can be used for answering
applied to situations other than a simple doubling of the number of items: what-if questions about test length when you are designing or revising
n×r a language test. Unfortunately, the Spearman-Brown formula is limited
reliability =
1 + (n − 1)r to estimating differences on one dimension (usually the number of
items, or raters). For those interested in doing so on more than one
Using the more complex formula, we get the same answer as we dimension, generalizability theory (G-theory) provides the same sort
did with the simpler formula for the split-half reliability adjustment of answers, but for more dimensions (called facets in G-theory). For
example as follows: instance, in Brown (1999), I used G-theory to examine (separately
n×r 4.2 × .50 4.2 × .50 4.2 × .50 2.10 and together) the effects on reliability of various numbers of items
reliability
= = = = = = .807 ≈ .81
1 + (n − 1)r 1 + (4.2 − 1).50 1 + (3.2).50 1 + 1.60 2.60 and subtests on the TOEFL, and numbers of languages among the
persons taking the test.
We can also use the more complex formula to estimate what the
reliability for that same test would be if it had 60 items by using n = 4.4. TEST-RETEST RELIABILITY
4 (for the number of times we must multiply 15 to get 60; 4 x 15 = Test-retest reliability is the degree to which scores are consistent
60) as follows: over time. It indicates score variation that occurs from testing session
n×r .33 × .50 .33 × .50 .33 × .50 .165 to testing session as a result of errors of measurement. Problems:
reliability
= = = = = = .248 ≈ .25
1 + (n − 1)r 1 + (.33 − 1).50 1 + ( −.67).50 1 + ( −.335) .665 Memory, Maturation, Learning.
This method involves (i) repetition of a test on the same group
Or we can estimate what the reliability would be for various fractions immediately or after a lapse of time, and (ii) computation of correlation
of the test length. For instance, we could estimate the reliability for a between the first and the second set of scores. The correlation co-efficient
63 item test by using n = 4.2 (for the number of times we must multiply thus obtained indicates the extent or magnitude of the agreement between
15 to get 63; 4.2 x 15 = 63) as follows: the two sets of scores and is often called the coefficient of stability.
n×r 4 × .50 4 × .50 4 × .50 2.00 The estimate of reliability in this case vary according to the length of
reliability
= = = = = = .80
1 + (n − 1)r 1 + (4 − 1).50 1 + (3).50 1 + 1.50 2.50 time-interval allowed between the two administrations. The product
moment method of correlation is a significant method for estimating
We can even estimate the reliability for a shorter version of the reliability of two sets of scores. Thus, a high correlation between two
test, say a 5-item version by using a decimal fraction, that is, n =. 33 sets of scores indicates that the test is reliable. In other words, it shows
(for the number of times we must multiply 15 to get 5;. 33 x 15 = 4.95 that the scores obtained in first administration resemble with the scores
or about 5), as follows: obtained in second administration of the same test.
98 Measurement and Evaluation in Education Methods of Reliability and Validity 99
♦ ♦
In this method the time interval plays an important role. Immediate 4.6. ALTERNATE OR PARALLEL FORMS METHOD
repetition of a test may involve (i) immediate memory effects (ii) practice This method involves the administration of equivalent or parallel
effects (iii) confidence effects, induced by familiarity of contents. Intervals forms of the test instead of repetition of a single test. The two equivalent
of six months or long may show ‘maturity effect’. The factors of intervening forms are so constructed as to make them similar (but not identical)
learning and unlearning may lead to lowering of self-correlation. Owing in context, mental process involved, number of items, difficulty level
to difficulties in controlling conditions which influence scores on retest, and in other aspects. Parallel tests have equal mean scores, variances
the test-retest method is generally less useful than are the other methods. and intercorrelations among items. That is, two parallel forms must
be homogeneous or similar in all respects, but not a duplication of
Advantages
test items. The subjects take one form of the test and then as soon as
1. It is generally used for estimating reliability coefficient. possible, the other form. The reliability coefficient may be looked upon
2. It is worthy to use in different situations conveniently. as the coefficient correlation between the scores on two equivalent
forms of test.
3. A test of an adequate length can be used after an interval of many
days between successive testing, Advantages
Limitations 1. Memory, practice and carryover effects are minimised and not
affect the scores.
1. If the test is repeated immediately or after a little time gap, there
may be possibility of carry-over effect, transfer effect, memory 2. The reliability coefficient obtained by this method is a measure
effect, practice effect and confidence effect induced by familiarity of both temporal stability and consistency of response to different
with the material will almost certainly effect scores when the test item samples or test forms.
is administered for a second time. 3. It is useful for the reliability of achievement tests.
2. Index of reliability so obtained is less accurate.
Limitations
3. If the interval between tests is rather long (more than six months)
growth factor and maturity affect the scores and tenders to lower 1. Practice and carry over factors cannot be completely controlled.
down the reliable index. 2. When the tests are not exactly equal the comparison between two
4. On repeating the same test on the same group second time, makes sets of scores obtained from these tests may lead to erroneous
the students disinterested and thus they do not like to take part decisions.
wholeheartedly. 3. Administration of two forms simultaneously creates boredom.
4. The testing conditions while administering the Form B may not
4.5. EQUIVALENT-FORMS OR ALTERNATE-FORMS RELIABILITY
be the same.
Two tests that are identical in every way except for the actual items
5. Test scores of second form of the test are generally high.
included. Used when it is likely that test takers will recall responses made
during the first session and when alternate forms are available. Correlate 4.7. RATIONAL EQUIVALENCE METHOD
the two scores. The obtained coefficient is called the coefficient of
It is a method based on consistency of responses to all items.
stability or coefficient of equivalence. Problem: Difficulty of constructing
This method enables to compute the inter-correlation of the items of
two forms that are essentially equivalent.
the test and correlations of each item with all the items of the test. In
Both of the above require two administrations.
this method, it is assumed that all items have same or equal difficulty
100 Measurement and Evaluation in Education Methods of Reliability and Validity 101
♦ ♦
value, correlation between the items are equal, all the items measure With all that in mind, here’s a list of the validity types that are
essentially the same ability and the test is homogeneous in nature. typically mentioned in texts and research papers when talking about
Like split-half method this method also provides a measure of interval the quality of measurement:
consistency. The most popular formula is Kuder-Richardson:
4.8. CONSTRUCT VALIDITY
n ó 2 t-∑ pq
r1t = x • Translation validity
(n-1) ó2t
• Face validity
Where r1t = reliability coefficient of the whole test.
• Content validity
n = number of items in the test
= the SD of the test scores • Criterion-related validity
ót
p = the proportion of the group answering a test item correctly • Predictive validity
q = (1-p) = the proportion of the group answering a test item • Concurrent validity
incorrectly.
• Convergent validity
Advantages • Discriminant validity
1. This coefficient provides some indicators of how internally I have to warn you here that I made this list up. I’ve never heard of
consistent or homogeneous the items of the test are. “translation” validity before, but I needed a good name to summarize
what both face and content validity are getting at, and that one seemed
2. Split-half method simply measures the equivalence but rational
sensible. All of the other labels are commonly known, but the way I’ve
equivalence method measures both equivalence and homogeneity.
organized them is different than I’ve seen elsewhere.
3. It neither requires administration of two equivalent forms of tests Let’s see if we can make some sense out of this list. First, as
nor it requires to split the tests into two equal halves. mentioned above, I would like to use the termconstruct validity to be
the overarching category. Construct validity is the approximate truth
Limitations
of the conclusion that your operationalization accurately reflects its
1. The coefficient obtained by this method is generally some what construct. All of the other terms address this general issue in different
lesser than the coefficients obtained by other methods. ways. Second, I make a distinction between two broad types: translation
2. It the items of the tests are not highly homogeneous, this method validity and criterion-related validity. That’s because I think these
will yield lower reliability coefficient. correspond to the two major ways you can assure/assess the validity of
an operationalization. In translation validity, you focus on whether the
3. Kuder-Richardson and split-half method are not appropriate for
operationalization is a good reflection of the construct. This approach is
speed test.
definitional in nature -- it assumes you have a good detailed definition of
The population of interest in a study is the “construct” and the
the construct and that you can check the operationalization against it. In
sample is your operationalization. If we think of it this way, we are
criterion-related validity, you examine whether the operationalization
essentially talking about the construct validity of the sampling!). Second,
behaves the way it should given your theory of the construct. This is
I want to use the term construct validity to refer to the general case
a more relational approach to construct validity. it assumes that your
of translating any construct into an operationalization. Let’s use all of
operationalization should function in predictable ways in relation to
the other validity terms to reflect different ways you can demonstrate
other operationalizations based upon your theory of the construct. (If
different aspects of construct validity.
102 Measurement and Evaluation in Education Methods of Reliability and Validity 103
♦ ♦
all this seems a bit dense, hang in there until you’ve gone through the something that’s not always true. For instance, we might lay out all of
discussion below -- then come back and re-read this paragraph). Let’s the criteria that should be met in a program that claims to be a “teenage
go through the specific validity types. pregnancy prevention program.” We would probably include in this
domain specification the definition of the target group, criteria for
Translation Validity deciding whether the program is preventive in nature (as opposed to
I just made this one up today! (See how easy it is to be a treatment-oriented), and lots of criteria that spell out the content that
methodologist?) I needed a term that described what both face and should be included like basic information on pregnancy, the use of
content validity are getting at. In essence, both of those validity types abstinence, birth control methods, and so on. Then, armed with these
are attempting to assess the degree to which you accurately translated criteria, we could use them as a type of checklist when examining
your construct into the operationalization, and hence the choice of our program. Only programs that meet the criteria can legitimately be
name. Let’s look at the two types of translation validity. defined as “teenage pregnancy prevention programs.” This all sounds
fairly straightforward, and for many operationalizations it will be. But
4.9. FACE VALIDITY for other constructs (e.g., self-esteem, intelligence), it will not be easy
In face validity, you look at the operationalization and see whether to decide on the criteria that constitute the content domain.
“on its face” it seems like a good translation of the construct. This is
probably the weakest way to try to demonstrate construct validity. For 4.11. CRITERION-RELATED VALIDITY
instance, you might look at a measure of math ability, read through In criteria-related validity, you check the performance of your
the questions, and decide that yep, it seems like this is a good measure operationalization against some criterion. How is this different from content
of math ability (i.e., the label “math ability” seems appropriate for validity? In content validity, the criteria are the construct definition itself
this measure). Or, you might observe a teenage pregnancy prevention -- it is a direct comparison. In criterion-related validity, we usually make
program and conclude that, “Yep, this is indeed a teenage pregnancy a prediction about how the operationalization will perform based on our
prevention program.” Of course, if this is all you do to assess face theory of the construct. The differences among the different criterion-
validity, it would clearly be weak evidence because it is essentially a related validity types is in the criteria they use as the standard for judgment.
subjective judgment call. (Note that just because it is weak evidence
doesn’t mean that it is wrong. We need to rely on our subjective judgment 4.12. PREDICTIVE VALIDITY
throughout the research process. It’s just that this form of judgment In predictive validity, we assess the operationalization’s ability
won’t be very convincing to others.) We can improve the quality of to predict something it should theoretically be able to predict. For
face validity assessment considerably by making it more systematic. instance, we might theorize that a measure of math ability should
For instance, if you are trying to assess the face validity of a math be able to predict how well a person will do in an engineering-based
ability measure, it would be more convincing if you sent the test to a profession. We could give our measure to experienced engineers and
carefully selected sample of experts on math ability testing and they see if there is a high correlation between scores on the measure and
all reported back with the judgment that your measure appears to be their salaries as engineers. A high correlation would provide evidence
a good measure of math ability. for predictive validity -- it would show that our measure can correctly
predict something that we theoretically think it should be able to predict.
4.10. CONTENT VALIDITY
In content validity, you essentially check the operationalization 4.13. CONCURRENT VALIDITY
against the relevant content domain for the construct. This approach In concurrent validity, we assess the operationalization’s ability
assumes that you have a good detailed description of the content domain, to distinguish between groups that it should theoretically be able to
104 Measurement and Evaluation in Education Methods of Reliability and Validity 105
♦ ♦
distinguish between. For example, if we come up with a way of assessing prepared to measure whether students can perform multiplication,
manic-depression, our measure should be able to distinguish between and the people to whom it is shown all agree that it looks like a
people who are diagnosed manic-depression and those diagnosed good test of multiplication ability, this demonstrates face validity of
paranoid schizophrenic. If we want to assess the concurrent validity the test. Face validity is often contrasted with content validity and
of a new measure of empowerment, we might give the measure to construct validity.
both migrant farm workers and to the farm owners, theorizing that our Some people use the term face validity to refer only to the validity
measure should show that the farm owners are higher in empowerment. of a test to observers who are not expert in testing methodologies. For
As in any discriminating test, the results are more powerful if you are instance, if a test is designed to measure whether children are good
able to show that you can discriminate between two groups that are spellers, and parents are asked whether the test is a good test, this
very similar. measures the face validity of the test. If an expert is asked instead,
some people would argue that this does not measure face validity.
4.14. CONVERGENT VALIDITY This distinction seems too careful for most applications. Generally,
In convergent validity, we examine the degree to which the face validity means that the test “looks like” it will work, as opposed
operationalization is similar to (converges on) other operationalizations to “has been shown to work”.
that it theoretically should be similar to. For instance, to show the
convergent validity of a Head Start program, we might gather evidence
that shows that the program is similar to other Head Start programs.
Or, to show the convergent validity of a test of arithmetic skills, we
might correlate the scores on our test with scores on other tests that
purport to measure basic math ability, where high correlations would
be evidence of convergent validity.

4.15. DISCRIMINANT VALIDITY


In discriminant validity, we examine the degree to which
the operationalization is not similar to (diverges from) other
operationalizations that it theoretically should be not be similar to.
For instance, to show the discriminant validity of a Head Start program,
we might gather evidence that shows that the program is not similar to
other early childhood programs that don’t label themselves as Head Start
programs. Or, to show the discriminant validity of a test of arithmetic
skills, we might correlate the scores on our test with scores on tests
that of verbal ability, wherelow correlations would be evidence of
discriminant validity.
Face validity is the extent to which a test is subjectively viewed as
covering the concept it purports to measure. It refers to the transparency
or relevance of a test as it appears to test participants.In other words,
a test can be said to have face validity if it “looks like” it is going
to measure what it is supposed to measure.For instance, if a test is
Tool Construction Procedure 107

CHAPTER-5
TOOL CONSTRUCTION PROCEDURE

5.1. STEPS OF TOOL CONSTRUCTION


A research tool is a device or instrument used to collect information
from the sample or target population. As a researcher have to follow the
following steps to construct a tool. Research tool may be defined as:
Anything that becomes a means of collecting information for your
study is called a research tool or a research instrument. For example,
observation forms, interview schedules, questionnaires, and interview
guides are all classified as research tools.
Constructing a research tool is the first practical step in carrying
out research process. Need to decide how will collect the data then
construct a research instrument for this. If planning to collect data
specifically for research then have to develop a research instrument or
The construction
Guidelines of a research
to Construct instrument
a Research Toolor tool for data collection is the most important
select an already developed one. If using a secondary data (information
asp of aThe
research project because
underlying anything
principle youthe
behind say guidelines
by way of findings or conclusions
suggested below is based up
already collected for other purposes), develop a form to extract required
data. Field testing a research tool is an important part. But as a rule, to isthetotype of information
ensure youofcollect,
the validity and theby
instrument data you collect
making sureisthat
entirely dependent upon the
questions
field testing should not be carried out on the sample of present study relate tothat
questions theyou
objectives of respondents.
ask of your present study.
The famous saying about computers- ―garbage in
but on a similar population. Step I:
garbage Clearly
out‖- is alsodefine and for
applicable individually list The
data collection. all the specific
research tool objectives
provides the input into a
The construction of a research instrument or tool for data collection or research
study and therefore Questions
the quality for present
and validity study.(the findings), are solely dependent o it.
of the output
is the most important asp of a research project because anything you Step II: For each objective or research questions, list all the associated
say by way of findings or conclusions is based up to the type of questions
Guidelines to ConstructThat want toTool:
a Research answer through present study.
information you collect, and the data you collect is entirely dependent Step III: Take each research question listed in step II and list the
upon the questions that you ask of your respondents. The famous information
The underlying required
principle behindtothe
answer it. suggested below is to ensure the validity
guidelines
saying about computers- “garbage in garbage out”- is also applicable ofStep IV: Fby
instrument ormulate question(s)
making sure to obtain
that questions relate tothis
the information.
objectives of present study.
for data collection. The research tool provides the input into a study
and therefore the quality and validity of the output (the findings), are How to construct questionnaires
solely dependent o it. • Deciding which questionnaire to use - closed or open ended, - self
or interviewer administered
106
108 Measurement and Evaluation in Education Tool Construction Procedure 109
♦ ♦
• Wording and structure of questions opinionnaires, Q methodology, observation, checklists, rating
• Questions should be kept short and simple--avoid double scales, content analysis, interviews, and sociograms. Some research
barreled i.e. two questions in one –ask two Qs rather than one. investigations use one of these devices. Others may employ many of
them in combination.
• Avoid negative questions which have not in them as it is
confusing for respondent to agree or disagree. Item Analyses
• Question should not contain Prestige Bias – causing Item analysis is a process which examines student responses to
embarrassment or forcing the respondent to give false answer individual test items (questions) in order to assess the quality of those
in order to look good. Questions about educational qualification items and of the test as a whole. Item analysis is especially valuable
or income might elicit this type of response in improving items which will be used again in later tests, but it can
• Use indirect questions for sensitive issues- in indirect questions also be used to eliminate ambiguous or misleading items in a single
respondents can relate their answer to other people. test administration. In addition, item analysis is valuable for increasing
instructors’ skills in test construction, and identifying specific areas of
• Avoiding Leading Question: Don’t lead the respondent to
course content which need greater emphasis or clarity.
answer in a certain way. e.g. “How often do you wash your
Because the discrimination index reflects the degree to which an
car?” assumes that respondent has a car and he washes his
item and the test as a whole are measuring a unitary ability or attribute,
car. Instead, ask a filter question to find if he has a car, and
values of the coefficient will tend to be lower for tests measuring a
then, ‘If you wash your car, how many times a year?’
wide range of content areas than for more homogeneous tests. Item
• Using closed- ended questions- try to make sure that all possible discrimination indices must always be interpreted in the context of the
answers are covered so that respondents are not constrained in type of test which is being analyzed. Items with low discrimination
their answer. “Don’t Know” category also needs to be added. indices are often ambiguously worded and should be examined. Items
• Length and ordering of the Questions: with negative indices should be examined to determine why a negative
value was obtained. For example, a negative value may indicate that
• Keep the questionnaire as short as possible -Ask easy Qs.
the item was mis-keyed, so that students who knew the material tended
Which respondents will enjoy answering?
to choose an unkeyed, but correct, response option.
• If combined questionnaire, keep open ended Qs for the end.
• Make Qs as interesting as possible and easy to follow by Means
varying type and length of question mean total test score (minus that item) is shown for students who
Group the qs. Into specific topic as this it makes it easier to selected each of the possible response alternatives. This information
understand and follow. should be looked at in conjunction with the discrimination index;
• Layout and spacing is important as cluttered Questionnaire higher total test scores should be obtained by students choosing the
is less likely to be answered. correct, or most highly weighted alternative. Incorrect alternatives with
In a research process the researchers choose the most appropriate relatively high means should be examined to determine why “better”
instrument and procedures of research tools that provide for the students chose that particular alternative.
collection and analysis of data upon which hypotheses may be tested. Frequencies and Distribution
The data-gathering devices that have proved useful in educational
research include psychological tests and inventories, questionnaires, The number and percentage of students who choose each alternative
are reported. The bar graph on the right shows the percentage choosing each
110 Measurement and Evaluation in Education Tool Construction Procedure 111
♦ ♦
response; each “#” represents approximately 2.5%. Frequently chosen wrong As with many statistics, it is dangerous to interpret the magnitude of a
alternatives may indicate common misconceptions among the students. reliability coefficient out of context. High reliability should be demanded in
situations in which a single test score is used to make major decisions, such
Difficulty and Discrimination Distributions as professional licensure examinations. Because classroom examinations
At the end of the Item Analysis report, test items are listed according are typically combined with other scores to determine grades, the standards
their degrees of difficulty (easy, medium, hard) and discrimination for a single test need not be as stringent. The following general guidelines
(good, fair, poor). These distributions provide a quick overview of the can be used to interpret reliability coefficients for classroom exams:
test, and can be used to identify items which are not performing well
and which can perhaps be improved or discarded. Reliability Interpretation
.90 and above Excellent reliability; at the level of the best standardized tests
Test Statistics
.80 –. 90 Very good for a classroom test
Two statistics are provided to evaluate the performance of the
test as a whole. Good for a classroom test; in the range of most. There are
.70 –. 80
probably a few items which could be improved.
Reliability Coefficient Somewhat low. This test needs to be supplemented by other
.60 –. 70 measures (e.g., more tests) to determine grades. There are
The reliability of a test refers to the extent to which the test is
probably some items which could be improved.
likely to produce consistent scores. The particular reliability coefficient
computed by ScorePak reflects three characteristics of the test: Suggests need for revision of test, unless it is quite short (ten
.50 –. 60 or fewer items). The test definitely needs to be supplemented
• Intercorrelations among the items — the greater the relative number by other measures (e.g., more tests) for grading.
of positive relationships, and the stronger those relationships are,
Questionable reliability. This test should not contribute
the greater the reliability. Item discrimination indices and the test’s .50 or below
heavily to the course grade, and it needs revision.
reliability coefficient are related in this regard.
• Test length - a test with more items will have a higher reliability, The measure of reliability used by ScorePak® is Cronbach’s Alpha.
all other things being equal. This is the general form of the more commonly reported KR-20 and
• Test content - generally, the more diverse the subject matter tested can be applied to tests composed of items with different numbers of
and the testing techniques used, the lower the reliability. points given for different response alternatives. When coefficient alpha
Reliability coefficients theoretically range in value from zero (no is applied to tests in which each item has only one correct answer and
reliability) to 1.00 (perfect reliability). In practice, their approximate all correct answers are worth the same number of points, the resulting
range is from. 50 to. 90 for about 95% of the classroom tests scored coefficient is identical to KR-20.
by ScorePak. High reliability means that the questions of a test tended (Further discussion of test reliability can be found in J. C. Nunnally,
to “pull together.” Students who answered a given question correctly Psychometric Theory. New York: McGraw-Hill, 1967, pp. 172-235,
were more likely to answer other questions correctly. If a parallel test see especially formulas 6-26, p. 196.)
were developed by using similar items, the relative scores of students
would show little change. Low reliability means that the questions Standard Error of Measurement
tended to be unrelated to each other in terms of who answered them The standard error of measurement is directly related to the
correctly. The resulting test scores reflect peculiarities of the items or the reliability of the test. It is an index of the amount of variability in an
testing situation more than students’ knowledge of the subject matter. individual student’s performance due to random measurement error.
112 Measurement and Evaluation in Education Tool Construction Procedure 113
♦ ♦
If it were possible to administer an infinite number of parallel tests, a course content and objectives;(b) an item may show low discrimination
student’s score would be expected to change from one administration if the test measures many different content areas and cognitive skills.
to the next due to a number of factors. For each student, the scores For example, if the majority of the test measures “knowledge of
would form a “normal” (bell-shaped) distribution. The mean of the facts,” then an item assessing “ability to apply principles” may have
distribution is assumed to be the student’s “true score,” and reflects a low correlation with total test score, yet both types of items are
what he or she “really” knows about the subject. The standard deviation needed to measure attainment of course objectives.
of the distribution is called the standard error of measurement and • Item analysis data are tentative. Such data are influenced by the
reflects the amount of change in the student’s score which could be type and number of students being tested, instructional procedures
expected from one test administration to another. employed, and chance errors. If repeated use of items is possible,
Whereas the reliability of a test always varies between 0.00 and statistics should be recorded for each administration of each item.
1.00, the standard error of measurement is expressed in the same scale
as the test scores. For example, multiplying all test scores by a constant Construction and Procedure of Research Tools
will multiply the standard error of measurement by that same constant, In a research process the researchers choose the most appropriate
but will leave the reliability coefficient unchanged. instrument and procedures of research tools that provide for the collection
A general rule of thumb to predict the amount of change which and analysis of data upon which hypotheses may be tested. The data-
can be expected in individual test scores is to multiply the standard gathering devices that have proved useful in educational research include
error of measurement by 1.5. Only rarely would one expect a student’s psychological tests and inventories, questionnaires, opinionnaires, Q
score to increase or decrease by more than that amount between two methodology, observation, checklists, rating scales, content analysis,
such similar tests. The smaller the standard error of measurement, the interviews, and sociograms. Some research investigations use one of
more accurate the measurement provided by the test. these devices. Others may employ many of them in combination.
In general, construction of all kinds, of evaluation tools with slight
A Caution in Interpreting Item Analysis Results
differences here and there involves the following activities:
Each of the various item statistics provided by ScorePak provides • Planning
information which can be used to improve individual test items and to
• Preparation
increase the quality of the test as a whole. Such statistics must always be
interpreted in the context of the type of test given and the individuals being • Try-out
tested. W. A. Mehrens and I. J. Lehmann provide the following set of cautions • Standardization
in using item analysis results (Measurement and Evaluation in Education
and Psychology. New York: Holt, Rinehart and Winston, 1973, 333-334): Planning the Construction of Tools

• Item analysis data are not synonymous with item validity. An At this stage, various decisions are taken and certain activities
external criterion is required to accurately judge the validity of are completed. But before doing all these it is suggested by Stanley
test items. By using the internal criterion of total test score, item and Hopkins (p. 172) that certain principles should be taken into
analyses reflect internal consistency of items rather than validity. consideration. These principles of planning are as follows:

• The discrimination index is not always a measure of item quality. • Principle of adequate provision—outcomes of instruction
There is a variety of reasons an item may have low discriminating • Principle of emphasis of the course (approximately)
power:(a) extremely difficult or easy items will have low ability to • Principle of purpose
discriminate but such items are often needed to adequately sample • Principle of conditions under which the test is administered
114 Measurement and Evaluation in Education Tool Construction Procedure 115
♦ ♦
The first principle is quite clear. It simply says that the test should be abilities and skills. These are further divided into five areas which are
so designed that it measures the most important outcome of instruction or as follows:
objectives. The second principle says that if the test has to cover a large • Knowledge
amount of materials, it would be necessary to determine which part or
1. Knowledge of specifics
aspect should receive that weightage in terms of number of items. Proper
sampling has to be done. This should reflect relative importance of various 2. knowledge of terms
components of the course. The third principle explains that the purpose 3. knowledge of specific facts
of the test should be made clear and kept in mind. It means clarifying
4. knowledge of ways and means of dealing with specifics
whether test will employ relative, absolute or criterion-related standards
of achievement. The validity and reliability of the tests are linked with 5. knowledge of conventions
these purposes. Different methods of providing reliability and validity 6. knowledge of trends and sequences
are used depending upon these purposes. The fourth principle requires 7. knowledge of classifications and categories
that decisions should be taken regarding the number of items to be finally
kept in the test, time to be allowed for completing the test, how frequently 8. knowledge of criteria
test will be administered, what will be the format of the test, what kinds 9. knowledge of methodology
of items will be included in the test, how the responses of the examinees 10. knowledge of the universal and abstractions in a field
will be recorded, and so on.
11. knowledge of principles and generalizations
Steps of Planning 12. knowledge of theories and structures
Planning of a performance test involves the following steps: • Intellectual abilities and skills
• Identifying the learning outcomes (i) Comprehension (understanding the meaning)
• Defining the outcomes in terms of specific observable behaviours
(a) Translation
• Outlining the subject matter content
(b) Interpretation
• Preparing a table of specifications
• Using the table of specifications (c) Extrapolation
(i) Application
Identifying Learning Outcomes
(ii) Analysis
In constructing an objective-type performance test the first order of
business is to identify the instructional objectives, which are intended (a) Analysis of elements
to be measured. This is a difficult job. However, one useful guide is (b) Analysis of relationships
the taxonomy of educational objectives. Learning objectives in other (c) Analysis of organizational principles
areas such as skills, attitudes and interests are measured by rating
(i) Synthesis
scales, checklists, anecdotal records, inventories and similar non-testing
procedures. It is only cognitive domain objectives, i.e., knowledge, and (a) Production of a unique communication
intellectual abilities and skills, which can be measured through paper (b) Production of a plan or proposed set of operations
and pencil test. These cognitive objectives of the Bloom’s taxonomy (c) Deviation of a set of abstract relations
are divided into two major areas: (1) knowledge and (2) intellectual
(i) Evaluation
116 Measurement and Evaluation in Education Tool Construction Procedure 117
♦ ♦
(a) Judgments in terms of internal evidence The figures in the cells indicate the number of items (weightage)
(b) Judgments in terms of external criteria to be given to each area and the outcome.
All these objectives are arranged in order of increasing complexity.
Using Table of Specifications
Subdivisions within each area are also in order of increasing complexity.
The whole structure of objectives is hierarchical in nature. This taxonomy While preparing the performance test, items have to be constructed
is useful in planning the performance test. according to the table of specification. Thus, the table serves as a guide,
a blue-print for constructing the test.
Defining Objectives in Specific Terms
Preparing the Test
Identifying the learning objectives, defineding there in specific
behavioural terms which provide evidence that the outcomes have Some of the most important points to be kept in mind while preparing
been achieved. For this purpose the objectives are written in sentences, the test as mentioned by Stanley (1964) are as follows:
which use action verbs such as ‘recognizes’ and ‘identifies’. • Have more items in the first draft of the test than decided to be
kept in the final form
Outlining Subject Matter Content • Most of the items should have 50 per cent difficulty levels
Outlining Subject Matter Content
Taxonomical learning objectives are general and may apply to any • After a gap of some time make a critical revision of the test
topic Taxonomical
or area oflearning
the subject matter.
objectives A performance
are general and may apply totest is designed
any topic to
or area of the
measure these objectives, which cover students’ abilities and skills, • The items should be arranged in ascending order of difficulty
subject matter. A performance test is designed to measure these objectives, which cover
their mental
students‘ abilitiesdevelopment, their
and skills, their mental reactionstheir
development, as reactions
well asasthewellknowledge
as the knowledge of • A regular sequence in the pattern of correct responses should be
the
of thesubject matter
subject matter being
being taught.taught.
Therefore,Therefore,
it is essential it
to is essential
identify how muchto identify
and which avoided
how
aspectsmuch and which
of the subject aspects
matter will ofby
be covered the
thesubject
test. Only matter will be
major elements needcovered
to be listed.by • The direction to the examiners should be as clear, complete and
the test. Only major elements need to be listed. concise as possible
Preparing a Table of Specifications There are a variety of item types, which can be chosen for constructing
Preparing a Table of Specifications a performance test, such as completion type, true–false, matching and
Having identified the learning outcomes and outlined the course content a table of
Having identified the learning outcomes and outlined the course multiple-choice type. Of all these, multiple-choice type tends to provide the
specifications is prepared, which relates outcomes and indicates the relative weight to be
content a table of specifications is prepared, which relates outcomes highest quality items. They provide a more adequate measure of learning
assigned to each of the areas of the subject matter. The table ensures that the test will measure a
and indicates the relative weight to be assigned to each of the areas outcomes than the other items. In addition, they can measure a variety of
representative sample of the learning outcomes and the subject matter content. Table
of the subject matter. The table ensures that the test will measure a outcomes ranging from simple to complex. Hence, the following section
10.1 roughly illustrates how it is done.
representative sample of the learning outcomes and the subject matter is devoted to listing a few rules of constructing multiple-choice items.
content. Table 10.1 roughly illustrates how it is done.
Rules of Constructing Multiple-choice Items
Table-10.1:
Table Specification
10.1 Specification of course
of course content of learningcontent
outcomes of learning outcomes
Following are the important rules:
• Present a single, clearly formulated problem in the stem of the
item
• State the stem in simple, clear language
• Avoid repeating the same material over again alternatives
• State the stem in positive form, unless not possible
• If using negative form of stem, underline the negative wording
The figures in the cells indicate the number of items (weightage) to be given to each area and
the outcome.
118 Measurement and Evaluation in Education Tool Construction Procedure 119
♦ ♦
• Make certain that the intended answer is correct or clearly the best Purpose of item-analysis: The purpose is to find out how the items
• Make all alternatives grammatically consistent with the stem and and the distractors of the items in the test are working. In other words,
parallel in form toPurpose
find out which item
of item-analysis: is good
The purpose is toand
find which is items
out how the bad andso the
that bad items,
distractors of the
which
items are
in the testineffective
are working. Inmay be eliminated
other words, and,itemfinally,
to find out which a test
is good and whichof good
is bad so
• Avoid verbal clues, which might enable students to select the
items
that bad may be constructed.
items, which are ineffective may be eliminated and, finally, a test of good items may be
correct answer or to eliminate an incorrect alternative
Bases of item-analysis: The item-analysis is done based on the
constructed.
Similar to wording in both the stem and the correct answer, stating
following:
the correct answer in textbook language or stereotyped phraseology, Bases of item-analysis: The item-analysis is done based on the following:
stating the correct answer in greater detail, including absolute terms • Item-difficulty index
(e.g., all, any, never, only) in the distracters, including two responses • Item validityindex
Item-difficulty index
that are all inclusive and including two responses that have the same  Item validity index
• Effectiveness of the distractors index
 Effectiveness of the distractors index
meaning should be avoided. These three statistical indices constitute the criteria on the basis
• Avoid use of the alternatives ‘all of the above’, and ‘none of the ofThese
which threean item indices
statistical is selected
constituteoftherejected.
criteria on the basis of which an item is selected of
above’. Procedure of item-analysis: There are a number of item-analysis
rejected.
• Vary the position of the correct answer in a random manner. procedures that might be applied. Dowine (1967) presents a detailed
Procedure ofofitem-analysis:
discussion these. But,There are a number
the most simple,ofpopular
item-analysis
andprocedures
effectivethat might be
procedure
• Control the difficulty of the item by varying the problem either in applied. Dowine (1967) presents a detailed discussion of these. But, the most simple, popular and
is described here and is illustrated by taking an example of 47 test papers
the stem or by changing the alternative. effective procedure is described here and is illustrated by taking an example of 47 test papers
arranged serially from top to bottom according to the magnitude of the
(For explanation and examples of these rules consult N. E. Gronlund, arranged serially from top to bottom according to the magnitude of the total scores. Following
total scores. Following steps are involved: By taking approximately
Constructing Achievement Tests, Prentice-Hall, inc. Englewood Cliffs, steps are involved: By taking approximately one-third of the total from top and one-third from
one-third of the total from top and one-third from bottom two top and
NJ, 1968). bottom two top and bottom groups are formed and the middle group is set aside. Out of a group
bottom groups are formed and the middle group is set aside. Out of
Having written out the items, they are arranged and assembled in the of 47, a group of top 15 and another group of bottom 15 are formed. The 17 papers in the middle
a group of 47, a group of top 15 and another group of bottom 15 are
form of a complete test. Then, the thus-completed test is reviewed and are set aside.
formed. The 17 papers in the middle are set aside.
shortcomings are removed. After this the directions for the examinees
are prepared. For For
eacheach
item, item,
numbernumber of instudents
of students top and in top and
bottom groupsbottom
selectinggroups selecting
each alternative is
each alternative is counted and recorded. The results
counted and recorded. The results for each item are presented item wise on a sheet offor each item
paperare
as
The directions should contain information about: (i) the purpose
presented item wise on a sheet of paper as follows:
follows:
of the test, (ii) time for completing the test, (iii) ways of recording the
answers (separate answer sheet or in the text booklet), (iv) instruction
for not to guess, and so on.
Thus, the test and instruction prepared, is then reproduced either
in the form of photocopy, cyclostyled material or printed material.
Such kinds of tables are prepared for all the items. After this
Such kinds of tables are prepared for all the items. After this item-statistics (difficulty index
5.2. ITEM ANALYSIS item-statistics (difficulty index and validity index) for all the items
and validity index) for all the items are calculated as follows:
For the purpose of item analysis the preliminary form of the test are calculated
Item difficulty: Thisas follows:
is calculated by applying the formula:
is administered to a representative sample to the population for which Item difficulty: This is calculated by applying the formula:
it is meant. Then, it is scored and all the test booklets are arranged in
a pile according to the size of the scores serially, the topmost score
being at the top.

In the previous example 10 students (6 + 4) from both the groups answered


out of the total of 30 (total in both groups).
120 Measurement and Evaluation in Education Tool Construction Procedure 121
♦ ♦
In the previous example 10 students (6 + 4) from both the groups If the obtained reliability coefficient of the half test is 60, then
answered the item correctly out of the total of 30 (total in both groups). by the Spearman–Brown formula the reliability (R) of the whole test
So, will be as follows:
10 100 2 × 60 1.20
Dt =− × 33.3 or 33 0 0
= = = .75
30 1 1 + .60 1.60
This is the simplest procedure based on the top-bottom groups. It may be seen that the increase in the length of the test increases
This provides a close approximation to the estimate that would be the reliability coefficient of the test.
obtained with the total group. The foregoing item has a difficulty One great advantage of split-half method of reliability is that the
level of 33 per cent, which means that out of 100 only 33 students test is administered only once which saves a lot of time and provides
answer the item correctly. The difficulty index is interpreted in such greater convenience. But, probably administration is by using Kunder–
a way that higher the index, easier is the item and lower the index, Richardson formula-21 which runs as
difficult is the item. M ( K 2 M ) in which
Some people call this as facility index instead of difficulty index. R( KR − 21) =−
1
Ks 2
But the meaning is the same.
Validity index: Validity of an item means the extent to which k = Number of items in the test
the item measures the same thing, which is measured by the whole M = Mean of the test scores
test. In a way it is an index of homogeneity of items. This is also S = Standard deviation of the test scores.
known as discrimination index because it is calculated on the basis Kuder–Richardson formula provides a conservative estimate
of discrimination an item makes between the top and the bottom of reliability. In other words, it provides a smaller correlation
groups. The two groups are formed in such a way that the top group coefficient than other methods. It is also indicative of internal
is of high-ability students and the bottom group consists of low- consistency of test scores. It is not appropriate with speeded test
ability students on the trait being measured by the whole test. So, like split-half method.
if an item also measures the same trait it must also function in the In case of parallel form method of reliability, forms of the test are
way the whole test is functioning, i.e., larger number of students in prepared. These forms are made similar as far as possible with regard to
the top group and lesser number in the bottom group must answer the item statistics, estimated by means and variability of scores judged
it correctly. If so, the item is said to be significantly correlated with by observation or computed experimentally. The scores obtained on these
the whole test. In other words, if the item significantly discriminates two forms administered to the same group simultaneously are correlated,
between the two groups, it is said to be valid, and that is its validity which yields a coefficient known as coefficient of equivalence. In a
index. It is a kind of biserial correlation between the item and the way, split-half method also yields a coefficient of equivalence between
test. It is calculated as follows: two halves of the test. An extension of this method is coefficient to
equivalence and stability. This is calculated by finding out a correlation
Validity
Validity index
index in which T = total number of students. Reliability
in which T = total number of students. Reliability coefficient for the
Validity index in which T = total number of students. Reliability coefficient for the between the two forms of the same test given at two different points
whole test is estimated by applying the following formula known as the Spearman–Brown
coefficient for the whole test the
whole test is estimated by applying is estimated by applying
following formula known as the theSpearman–Brown
following of time with a gap of few days in-between.
prophesy formula:
formula known
prophesy formula: as the Spearman–Brown prophesy formula: Factors affecting reliability: There are several factors, which
adversely affect the reliability of a test. Some of them are as follows:
• Poorly constructed items: If the items of the test have very low
or very high difficulty levels, i.e., if most of the items cluster at
If the obtained reliability coefficient of the half test is 60, then by the Spearman–Brown
If the
formula the reliability (R)obtained reliability
of the whole test willcoefficient of the half test is 60, then by the Spearman–Brown
be as follows:
formula the reliability (R) of the whole test will be as follows:
122 Measurement and Evaluation in Education Tool Construction Procedure 123
♦ ♦
the extremes the reliability of the test is decreased. Similarly, if Measurement in Education (1966) have described three types of
the items have low discrimination indices, the reliability will be validities, which is a standard and widely adopted classification system.
adversely affected. These three types are as follows:
• Length of the test: The length of the test also affects its reliability. • Content validity
A short test of very few items cannot be very reliable. Increasing • Criterion-related validity
the length of the test increase its reliability also. The application
• Construct validity
of Spearman–Brown formula has clearly demonstrated this fact.
Some writers mention concurrent validity, predictive validity and
But, a very lengthy test may also affect reliability adversely by
factorial validity also. But, these can be considered as sub-classifications
way of causing fatigue, monotony and disinterest in the students.
of criterion-related validity only.
• Inadequate scoring methods: These also lower down reliability. If the Content validity: Content validity means to what extent the test
scoring is subjective as is the case with the essay-type examinations, measures performance objectives identified and stated for the purpose
it affects reliability adversely. Similarly, giving differential weights of constructing the test. It is especially important in achievement
to some items also reduces reliability. testing or in constructing a performance test. The main purpose is
• Inadequate time: If sufficient time is not given to the students to to know how well the test measures the subject matter content and
complete the test, this also reduces reliability. learning outcomes. It is concerned with the sampling of a specified
Relation with validity: It is also important to note how reliability is universe of content. All that is taught cannot be measured in a single
related to validity. A reliable test need not be essentially valid. A 100 per test; hence, the test items sample the most critical aspects. But, these
cent reliable test may be invalid or may suffer from poor validity. For should be representative of the whole that has been taught. If not so,
being reliable a test need not be valid. But the contrary is not possible. the content validity cannot be high or even satisfactory. Thus, the
A test cannot be valid unless it is reliable. For validity, reliability is an content validity of a test explains whether the items of the test cover
essential condition. If a test does not measure something reliably, it sufficiently well the appropriate content. Every subject has a definite
cannot be valid? If a test fails to measure something with consistency, structure of knowledge consisting of certain facts, concepts, major
it cannot purport to measure anything at all. understandings, principles, generalizations, and so on. A high content
Thus, when a test is standardized all these facts are taken into validity test samples all of these in a very judicious manner. If any
account while establishing its reliability. of this area is not well represented, the validity of the test cannot be
considered to be high or satisfactory.
Validity Content validity is not determined statistically. Hence, it has no
A standardized test has a proved validity. Hence, standardization of statistical index. It is determined by inspection of the test items and by
the test involves calculation of its validity also. Validity of a test means relating them to an outline of the subject matter, or by checking them
the extent to which a test measures what it purports to measure. If it is against the table of specifications. Just by inspection the test-maker or
a test of performance, then its validity would mean the extent to which the test-user tries to find out the extent to which the content areas of
it measures the objectives or the outcomes of learning. Hence, validity a subject and taxonomic performance behaviours are reflected in the
is calculated by correlating the scores of the test with the scores on items. If the items are found to be a good match to the content areas
the criterion (i.e., the scores on some other measure of performance). and performance objectives the validity of the test maybe estimated
Types of validity: French and Michael (1966), and a committee to be high.
designated by American Psychological Association, the American Criterion-related validity: This kind of validity is a ‘sort of
Educational Research Association and the National Council on relationship of the test with some external criterion’. The criterion
124 Measurement and Evaluation in Education Tool Construction Procedure 125
♦ ♦
may be a future performance or phenomenon or a current phenomenon. They can only be indirectly inferred from experimental data or evidence,
If the test is related to a future performance or phenomenon, it is but cannot be explicitly defined. Hence, when the test-constructors
known as predictive validity. This is a kind of criterion validity. But, construct such a test (intelligence or attitude test), it becomes necessary
when the test is related to the criterion currently available, it is known for them to collect evidence to show that the test constructed by them
asconcurrent validity. really measures intelligence, interest, attitude and so on. This evidence
Criterion validity becomes important when the performance of an must enable them to say with confidence that the test measures that
individual or a group of individuals on some criterion is to be predicted trait. Establishing this fact means establishing construct validity. Thus,
on the basis of the test under construction. For example, if the test under construct validity means the extent to which a test measures the same
construction is meant for predicting success of the students in the final trait, which it was intended to measure.
examination, the test-maker should establish predictive validity of the It is difficult to establish construct validity of any test as valid criteria
test for examination marks or result. Similarly, a test for vocational are, generally, not available. Constructs are hypothetical qualities, which
selection must have high predictive validity for that vocation. In case can only be inferred from certain behaviours. For example, intelligence
of selection test predictive validity is a must. Predictive validity is can only be inferred from student’s success in the examination (school
established by correlating the test scores with the indices of success of marks), teacher’s ratings of students, intelligent behaviour and other
performance on the criterion (examination marks or indices of vocational tests claiming to measure intelligence. But, these criteria are not found
success), which are made available in near future. to be sufficiently valid. Hence establishing construct validity of a test
Concurrent validity is obtained by correlating the test scores with against less valid criteria creates a problem. And this is a problem ad
the indices of success or performance on the criterion, which is made infinitum.
available currently at the same time when the test is being constructed.
The purpose served by this kind of validity is to a have substitute Norms
for the test under construction or just to establish that the test under All test constructors develop norms for the tests they construct.
construction measures the same trait, which is measured by another Norms are derived or transformed scores. All, in the same way,
test already constructed and validated. For example, correlating a test convert the raw scores obtained by a large representative sample on
of intelligence under construction with the already available test of the test into standard scores, which are the units of a scale, which
vocabulary will yield concurrent validity of the intelligence test, which is universally understood. Two scales are used for preparing norms:
will indicate whether the intelligence test has any relationship with (1) percentile norms and (2) standard scores. Standard scores have a
student’s acquisition of vocabulary. variety of transformations such as sigma (a) scores, t-scores, T-scores
Both predictive and concurrent validities are forms of criterion and stanines. They are calculated statistically with the help of mean
validity with the difference that in case of concurrent validity the and standard deviation. Percentile norms are very popular. Norms help
criterion is made available at the same time when the test is being in the interpretation of scores. They render the scores obtained by any
constructed, while in case of predictive validity the criterion is made person or group of persons meaningful.
available at any time in future. Both of these are expressed in terms
of statistical indices, which are obtained by finding out the correlation Characteristics of Standardized Tests
between the test scores and the scores on the criterion. The standardized tests have the following characteristics:
Construct validity: Sometimes, the test under construction intends • The standardized tests follow the guidelines and technical
to measure some trait, which is just a hypothetical entity or quality and recommendations made by American Educational Research
not a concrete phenomenon, which cannot be observed directly. For Association and National Council on Measurements Used in
example, intelligence, aptitude, attitude, and interest are such entities.
126 Measurement and Evaluation in Education Tool Construction Procedure 127
♦ ♦
Education (1955). Hence. These tests are carefully developed tools. • Both require validity, reliability, objectivity and efficiency
Their items are selected after a thorough scrutiny. Experienced The differences between them according to the same authors are:
experts prepare the items. They are subjected to a vigorous analysis • The classroom tests may have more content validity because it
and try-out. These tests items are then analysed on a larger and more is constructed with special reference to classroom objectives and
representative sample. Systematic studies are made for establishing subject matter taught in the class. Standardized tests are constructed
their reliability, validity and usability. Obviously, they are of better for common and broad objective and subject matter. They may,
quality and have wider applicability. They are more reliable and for that reason, miss certain immediate classroom objectives and
valid in general. subject matters. Hence, their content validity for classroom use
• Standardized tests have carefully prepared manual, which describes may be lower than that of the teacher-made test.
the instructions of their administration and other details. The • The quality of the items of standardized test is superior as compared
instructions are well standardized. They give specific information to that of the teacher-made test. The reason is that specialists
to the examinees as well as to the test user with the result that the construct standardized tests.
scores obtained on the test can be interpreted uniformly in the
• The standardized tests have higher degree of reliability and validity.
same way everywhere and by every one.
• The procedures of test administration and test scoring of standardized
• he standardized tests have standardized procedures of recording
tests are carefully described and more adequately standardized as
the examinees’ responses and scoring them. This minimizes the
compared to the teacher-made tests.
variations in the scores and errors creeping in the results.
• The standardized tests have norms, which help in interpreting
• The standardized tests have various kinds of norms, which are
obtained scores on them meaningfully. Classroom tests, generally,
generally, established, on large samples and for various groups
do not have norms.
of population. They provide valuable information and great help
in interpreting the scores obtained by any single individual or any • All standardized tests have published manual, which is not the
other group of individuals. case with teacher-made tests.
• The scope of applicability and use of standardized test is, generally Tools of Research
very wide, both in terms of purposes and populations. They can
After selecting a research design and deciding who will be included
be used for purposes of survey, selection, guidance, diagnostic
in the study, the next step is to identify or develop suitable tool(s) for
assessment in the classroom, assessment of school programs and
collection of the desired information. Tools are nothing but the instruments
instruction, general assessment of student abilities and achievement
that help the researchers to gather data. Naturally, the type of information
and for providing a criterion of standardization of other tests being
you gather depends on the kind of tools you have used for this purpose.
newly developed. They can be used at national and, in some cases,
The selection of a tool depends upon the objectives and design of the
even at international level.
study, and the type of respondents intended to cover. Different types of
Comparison with teacher-made tests: Although standardized tests
tools are required for collecting information from illiterate population
are similar to teacher-made or classroom tests in several respects,
and young children than from literate and adult respondents. Also, status
there are some differences also between them. They are similar in the
study (concerned with question ‘what’) requires different types of tools
following respects as described by DeCecco and Crawford.
than process-exploring study (concerned with ‘why’ and ‘how’). Various
• Both are means of performance assessment kinds of tools and tests used in educational research are presented in the
• Both use the same types of test items following pages. Mainly, the following tools are discussed and described.
128 Measurement and Evaluation in Education Tool Construction Procedure 129
♦ ♦
• Questionnaire sample. This, however, is relatively unimportant wherever the sample
• Interviews is large. Possibility of misinterpretation of the questions is also a great
disadvantage of the questionnaire. But, constructing questions carefully
• Schedule
and making them clear and unambiguous can minimize this. While
• Achievement test constructing a questionnaire all these advantages and disadvantages
• Checklist should be kept in mind.
• Inventories
Construction of Questionnaire
Questionnaire The first step in the construction of an adequate questionnaire is
The Penguin Dictionary of Psychology defines a questionnaire as a to have a full and clear understanding of the objective of the study and
series of questions dealing with some psychological, social, educational the nature of the data needed.
topic or topics sent or given to a group of individuals with the objective The questionnaire should be neither too short nor too long. It can
of obtaining data with regard to some problem; sometimes employed never be of infinite length. Stating the problem and then, objective more
for diagnostic purposes or for assessing personality traits. One type of clearly and relating each question to the purpose may help in avoiding
questionnaire is concerned not with what the individuals can do, but unnecessary and irrelevant questions. Each question must be justified
with what they have done in the past or do habitually, what opinions on the basis of its contribution to the overall purpose of the study.
they hold, what are their likes and dislikes, what are their fears or The questionnaire once prepared should be revised again and again
hopes and what kinds of persons they think they are. Another type is on the basis of more discussion with experts, extensive reading, pilot
more objective seeking to obtain factual information about individuals, study and so on.
educational practices, statistics about pupils and so on. Responses to The use of a five-point rating scale elicits more valid responses
questionnaire may be verbal or numerical and they may or may not and is less frustrating to the respondent who wants to be truthful.
be amenable to statistical analysis. Questions which are emotionally toned, too broad, vague, ‘difficult
The questionnaire as an instrument of research has been very much in vocabulary’, unnecessary, out of frame of reference and having more
criticized in the past since Horace Mann first designed it in 1847. In spite than one idea in one should be avoided.
of its abuses and weaknesses, the National Education Association (NEA), Leading questions—These types of questions suggest the respondents
however, said as a result of a study in 1930 that the questionnaire as a to answer in a specific manner. For example, if I ask ‘Whether teacher
tool of research could not be condemned outright although there is a lot X still abuses students in the class?’, it suggests that teacher X behaves
of scope for drastic improvement. Today its weaknesses and limitations in that way with students. If you rephrase this question like ‘how does
as well as its strengths are well recognized and a more serious attempt is teacher X behave with students in the class?’, this is a neutral question
made to improve its quality and to limit its use to situations where it is and the respondents may answer this question either way.
most appropriate. A questionnaire, undoubtedly, has certain advantages Double-meaning questions—The questions should not be stated
in social science research. One major advantage is that it permits wide in a manner that they convey different meanings to different persons.
coverage for a minimum expense of money and effort. It reaches persons Items should not contain hidden assumptions, for example, asking
who are difficult to contact otherwise. Because of its impersonality it someone ‘when did you stop beating children?’ may indicate that you
elicits more candid and objective replies. It allows greater uniformity used to beat the children earlier.
in the manner in which the questions are posed and thus ensures greater Social desirability (‘faking good’): It is said that man is a social
comparability in the answers. The advantage of the questionnaire is animal. We all want to create a positive impression about ourselves in
undoubtedly the problem of non-returns, which decreases the size of the the society and make a conscious effort to conceal the negative aspects
130 Measurement and Evaluation in Education Tool Construction Procedure 131
♦ ♦
of our personality. This tendency is also reflected while answering Validity of Questionnaires
questions in a questionnaire. Consider the following questions. It is difficult to establish the validity of a questionnaire as a whole
because it consists of specific and relatively independent questions, each
1. Don’t you abuse children? Yes/No dealing with a specific aspect of the overall situation. Mouly (1964: 252)
2.
Should primary school children be
Yes/No
holds a view that instead of the validity of the total instrument it is the
subjected to corporal punishment validity of the individual items, which should be considered more important.
3. Do you help the needy? Yes/No However, this does not mean that the unity and validity of the whole
questionnaire should be ignored. These should, essentially, be considered
The answers to these questions in all probability will be ‘yes’ with respect to the topic under investigation, the actual validation of any
because answering then negatively will create a poor impression of other instrument of tests and measurement. In case of a questionnaire,
the respondent in the society. So, attempts should be made to avoid however, face validity is considered most important. Adequate coverage of
socially desirable type of questions or they should be worded carefully. the topic, clear and unambiguous questions, checking of obtained responses
Format of Questions against an external criterion may contribute significantly to the validity of
a questionnaire. Interviews with a small sample of respondents may serve
Generally, two types of questions are used in a questionnaire; open as criteria against which validity of a questionnaire may be established.
ended and closed end. The type of questions the teacher sets for the annual
examination such as ‘answer only five questions’ (out of 10 in 3 h time Reliability of Questionnaires
period) are known as opened-ended questions. With open-ended formats Usual procedures of calculating the reliability of questionnaire are
the respondents may write whatever they want, which may cover three, five, often ignored, because it is difficult to establish and apply in this case.
or eight pages. Their answers may or may not be related to the question; Split-half reliability is, of course, out of the question. According to
they may describe something at a superficial level or may analyse in great Mouly (p. 254), the reason given is that the items of the questionnaire
depth and so on. In short, the respondents are free to answer the questions are relatively independent and not additive. This is true in case of a
the way they wish to. The coding of these types of responses is not easy questionnaire, which elicits different kinds of information. But, in
and coding errors may occur in the data set. A particular response may have case of questionnaires such as the Job satisfaction Questionnaire and
positive or negative effects on the researcher and the coding will be affected Teacher Effectiveness Questionnaire, which elicit information about a
accordingly. However, these types of questions provide an opportunity to single unitary trait, split-half reliability may be calculated. However,
assess the depth of knowledge of the students on a particular topic. test–retest method is considered to be the most feasible approach to
The closed-end formats require the researchers to have a reasonable the establishment of the reliability of the questionnaire.
idea of the likely responses to the question in advance and specify
those responses in the questionnaire. The common response formats Checklist for Evaluating Questionnaires
in case of closed-end question are: Yes/No, True/False, multiple choice According to Mouly (1964: 263), the quality of a questionnaire
and rating scales. In case of rating scales, the respondents are asked to can be improved by the following:
indicate their views on a 3 points (agree, undecided, disagree), 5 points
(strongly agree, agree, undecided, disagree, strongly disagree), 7 points, • It should deal with a significant topic.
9 points or II points scale. In some cases, the students are asked to rank • Tile importance of the problem should be clearly stated in the
a number of items in order of priority. For example, you list 10 things statement of the problem and in the covering letter.
(may be fruits) and the students will order them according to their liking • It should seek information not available elsewhere.
or disliking for a particular fruit. Rank I generally indicates high liking. • It should be as brief as possible.
132 Measurement and Evaluation in Education Tool Construction Procedure 133
♦ ♦
• The directions should be clear, complete and acceptable. respondents to fill the necessary information. If the questions are not
• The questions should be objective and relatively free from ambiguity presented in an attractive manner in the questionnaire, there may not
and other invalidating features. be enough motivation for the respondents to fill the questionnaire.
There are certain guidelines of preparing the questionnaires, some of
• Embarrassing questions should be avoided.
which are mentioned below.
Some more suggestions are given in the following sections.
Explanatory note: The questionnaire should be given with certain
Type of Information Gathered Through Questionnaires notes, which are self-explanatory in nature to foster respondents’
motivation. This should include broad aim of the study and why are
A questionnaire can be used for collecting following types of
you asking the respondents’ to fill up the questionnaire. The notes
information in research.
should be written in a manner such that the individuals are encouraged
• Background and demographic information: A questionnaire may and feel that the researchers will value their responses and treat the
be used to collect information about the respondents’ age, sex, responses with respect.
education, SES, birth order, details of family, nationality/ethnicity, Length of the questionnaire: What should be the length of the
religions involvements and so on. questionnaire? How many questions the researchers would like to keep
• Behavioural reports: Questionnaires are used for eliciting information in their questionnaires is a difficult question to answer. There is no hard
about past behaviour of the respondents. In such cases it is assumed and fast rule. It depends upon the topic under investigation, age of the
that respondents have accurate memory of past events, which may subjects and method of distribution (e.g., sending the questionnaire by
not be true. Also, many of the respondents may not be willing to post or delivering it personally). However, the questionnaire should not
report such events. Sensitive and socially desirable behaviours be too long that the students get tired or bored. On the other hand very
are often misreported. short questionnaires are not taken seriously. Roughly, a questionnaire
• Attitudes and opinions: The questionnaires are most frequently should not take more than 45 min.
used to know the attitudes and opinions of the individuals. The Question order: What should be the order of questions in a
common procedure is to present a statement and ask people to rate questionnaire? Generally, the questions are arranged from easy to difficult
on a scale (usually a 3, 5, 7, 9 or II point scale) that to what extent order. The first few questions are made such that no respondent has any
they agree or disagree with a statement. An alternative of rating difficulty in answering those questions. Generally, questions related to
scale, the forced-choice design, is also used where two opposing age, sex, educational background and monthly income are kept at the
statements are presented and the respondents are asked to choose beginning. This also develops confidence in the researcher to handle
one out of two. the questions that are placed later on in the questionnaire. It is rare
to place extremely sensitive questions right at the beginning. People
• Knowledge: Questionnaires are also used to test the knowledge
need time to get accustomed to the type of issues you are interested
of respondent on a particular topic.
in. Slowly, the difficulty level of the questions is increased.
• Questionnaires are used to gather information about the expectations Question density: What should be the spread of questions on a
and aspirations of the individuals. particular page? In this computer age, by varying the size of the letters
many questions can be printed on one page and thus the questionnaire
Questionnaire Layout
may look short. It is a different issue whether the students will be
This section is concerned with the presentation of questions in the able to read the questions. This exercise is undesirable and counter-
questionnaire to gather better quality data and ensure higher response productive. Clear and self-evident layout will enhance the possibility
rate. It may be remembered that the questionnaire is given to the of getting valid information from the sample.
134 Measurement and Evaluation in Education Tool Construction Procedure 135
♦ ♦
Interview Method Individual to group: This is a situation where one interviewer
In general, the term ‘interview’ refers to an activity where an interviews a group of interviewees. For example, panel discussions
individual, who is called an interviewee, is asked a few specific questions even on televisions.
by an individual, who is called an interviewer. What is to be highlighted Group to individual: This is a situation where a group of interviewers
in an interview situation is that an interview is a purposeful activity interview an interviewee. For example, selection committee interviews.
conducted for understanding the opinions and views of the interviewee by Group to group: This is a situation where a group of interviewers
the interviewer for some specific purposes. This has a specific significance interviews another group. For example, parent–teacher meeting. Such
in educational research. situations are also seen in certain TV interviews.
All the interview situations explained here can be used in research
Interview as a Tool situations depending upon the requirement of research studies.
It refers to an interview schedule, which has a series of questions/
Functions of Interviews
issues that serves as the focus for eliciting responses by the interviewer.
These schedules are classified as follows: Interviews serve some specific functions according to the demand
of the situation. They may be broadly classified as follows:
• Structured
• Social
• Semi-structured
• Diagnostic/clinical
• Unstructured
A structured interview schedule is one where the items of the • Research
schedule are written in clear terms and in a particular order. The The interview serves social purpose in the sense that it can cover
respondents are asked to answer all the items of the schedule in the various social issues, which may be found in a society. Most of the
same order without any change. It is a highly rigid format. These newspaper interviews belong to this category.
schedules can be debated for their advantages and disadvantages. This interview serves diagnostic purposes when it is conducted for
A semi-structured interview schedule is one where the items are clinical purposes. This may be used in a hospital/clinical situation, a
not structured rigidly and the interviewers have the freedom of altering paramedical situation, a school situation where a counsellor is trying
and adding questions or rearranging the items depending upon the to understand the problems of students or any other situation where
nature of responses they are getting. diagnostic purpose is being served with remediation in view. All the
An unstructured interview schedule is one where a broad framework information elicited will serve the purpose of remediation.
is vaguely worked out and the interviewers have all the freedom of Research purpose is also fulfilled by interviews as a lot of information
questioning the interviewee deep as the situation demands. This is is elicited as a part of the qualitative information, which is normally
used by the specialists (particularly in qualitative research) who are used in those researches where ethnographic methods are used in which
willing to probe very deep into issues and opinions. participant observation is used chiefly apart from interviews.

PARTICIPANTS IN INTERVIEW SITUATIONS Preparation for Interview in Research


The participants in different interview situations may vary. There Planning and preparation for interview in educational research
may be the hollowing combinations in an interview situation. assumes significance and that needs to be understood carefully. Certain
Individual to individual: The is a situation where one interviewer steps to be followed in planning and preparation of interview schedule
interviews one interviewee. It is a one-to-one situation. are as follows.
136 Measurement and Evaluation in Education Tool Construction Procedure 137
♦ ♦
Define and profile clientele to be interviewed: It is very important • Listen patiently to all their opinions and enable the respondents
to define who are the people need to be interviewed. It is also necessary to be as natural as they can.
to have background information about the people who are going to be • Show keenness towards the views expressed and try to get the
interviewed. most out or every question.
Work out the participant composition of the interview: As discussed
• Keep the direction of the interview in your hand and avoid irrelevant
earlier, the participant composition may be one to one, one to group,
conversation, and try to keep the respondents on track.
group to one or group to group. This also needs to be planned in advance
so that in different situations, different strategies may be adopted. This • Do not jump from one question to the other unless the earlier
requires a careful planning. question is answered fully to the satisfaction of the respondents.
List out the materials required along with the interview schedule: • Let our response recording mechanism not come in the way of
Depending upon the dimensions, participants and modes, certain the respondent responding the questions.
materials that are required to be used in interview need to be listed out. • Repeat the question slowly and clearly in case the respondents
This can also prompt the researchers in planning the kind of questions have not understood it properly.
that are to be asked and recording the responses. • Keep your pace, pause and intonation at par with the respondents’
Identify different items of the interview schedule: Based on the abilities. If not, the whole purpose may get defeated.
already identified domains, various items of the interview need to be
identified. For example, the length of questions, precision, language Recording and Reporting of Responses
used, shifting from one set of issues to another and beginning and In the present-day society, use of electronic equipments has arrived
ending anchor items need to be carefully planned. as a boon. Probably, the simplest thing one can do is to record the
Further, planning items for structured, semi-structured and responses using a tape recorder. If one does not have that facility, one
unstructured interviews will differ for obvious reasons. That needs can manually record the responses, may be with the help of another
to be understood. person. But it must be noted that no activity must affect the attention
Prepare the proposal for execution, recording and analyzing the and tempo of the respondent.
responses: Equally important is planning for execution, recording and
analysis of responses. Schedule Method
Once planning is over, the interview schedule is ready for use. Let There are several methods of data collection and approaching
us know how an interview can be successfully executed. informants with a view to obtaining their viewpoint on a social problem
and finding its solution. Schedule method is one of the most important
Execution of an Interview methods for the study of social problems. This is close to questionnaire
The following tips will facilitate a researcher in executing the method in many respects, but major difference between the two is
interview smoothly. that in questionnaire method an investigator assists the respondents in
• Try to develop rapport with the respondents and put them at ease filling the questionnaires, whereas in schedule method an investigator
so that the respondents speak freely. assists the informants and gives them necessary clarifications as and
when required. Two methods, in many respects, are different in so far
• Once you are sure that the respondents are ready for the interview, as collection of data is concerned.
start with anchor questions smoothly in such a way that the What is a schedule? A schedule is like a questionnaire, which
respondents must feel that the real interview has already begun. contains a set of questions. These questions are required to be replied by
138 Measurement and Evaluation in Education Tool Construction Procedure 139
♦ ♦
the respondents with the help of an investigator. According to Thomas data and information. The questions are absolutely pointed ones and
Carson Macormic, ‘the schedule is nothing, more than a list of questions the investigators collect information as well as simultaneously observe
which it seems necessary to test the hypothesis and hypotheses’. Goode the reactions of the respondents? On the basis of these observations,
and Hatt have said that, ‘schedule is the name usually applied to a set if necessary, they also put certain additional questions to clarify the
of questions which are asked and filled in by the investigator in a face position. In this case, the respondents can be individual or group of
to face situation with another person’. As per G.A. Lundberg, ‘the individuals and schedule is filled under certain specific conditions.
schedule is device for isolating one element at a time thus intensifying Many a time, this method is adopted to verify the information already
our observation’. C.A. Moser has defined questionnaire that, ‘since it collected.
is handled by investigator it can be fairly formal document in which Rating schedule: In social research, rating schedules are used when
efficiency of field handling rather than attractiveness is the major information is to be collected about attitudes, opinions, preferences,
operative consideration in design’. Then we come to a definition of inhibitions and other similar elements and their values are to be assessed
schedule given by Bogardus who says that, ‘A schedule is a form of and value of each is required to be measured. These prove very useful
abbreviated questions which interviewer help with himself and fills when factors that are responsible for measuring a phenomenon are to
out as he proceeds with his enquiry’. be measured. Different scales of measurement are to be constructed
From all these definitions it becomes clear that a schedule is a for evaluation.
list of questions formulated and presented with a specific purpose of Document schedule: In this, whole study is based on certain
testing an assumption of hypothesis. Since in the schedule method an schedules, e.g., studies which deal with the writing of history and
interviewer is always present and can also provide stimuli, the success so on. With the help of these documents certain questions are asked
of schedule is linked with ability and performance of the interviewer. about the life history of a person and on the basis of replies received
Similarly, since questions are asked and replies are noted, the depth efforts are made to construct life history. It is felt that more than the
to which a problem is posed depends on the interviewer who carries required material should be collected so that some materials can be
the schedule. A schedule is thus a formal document for maintaining kept in reserve for future use. In this quite a large number of relevant
uniformity in question and it is not always essential that it must be records are consulted. In this type of schedule, those terms are used,
beautifully printed on an attractive paper. which frequently occur in the documents, otherwise some confusion
Aims and purposes of schedule: Whether it is a questionnaire or may arise. The documents, which are frequently consulted include
schedule method, obviously the main aim is to collect data for a research autobiographies, diaries, case histories and government records.
project in an objective manner. Since the investigators put the questions Institutional survey schedule: In every society there are some
and the informants give replies, all these cannot be memorized. A schedule specialized institutions and agencies. These schedules are used for
helps in recording what cannot be memorized. Since all the information collecting information about them. The nature and complexity of the
are available in writing, tabulation and analysis of the data collected institution decides the size of the schedule. Obviously, more complex the
becomes easy, because the information to be analysed is available on the working of the institution, more bulky will be the size of the schedule.
schedule. Another purpose of schedule is that it delimits and specifies This schedule is also helpful in studying both traditional and immediate
the object of enquiry because in this method questions are asked about problems of an institution.
a specific subject and information is collected about that alone. Interview schedule: This is used for testing and collecting data
Types of schedules: Although schedules are of different types, the and also for the collection of supplementary data. The informants
aim is to collect data. Types of schedules are as follows: take the schedule with them and interview the respondent and fill in
Observation schedule: This is a type of schedule in which questions the forms. Usually in this method certain standardized questions are
are put on a specific topic about which investigators want to collect asked by the interviewer.
140 Measurement and Evaluation in Education Tool Construction Procedure 141
♦ ♦
Characteristics of good schedule: Every schedule cannot be a good questions and evaluation, more particularly when the informants are
one and for a good schedule it is desirable that it should possess certain fully aware that they are under no obligation to answer the questions
qualities. These are the following. being put to them by the investigators.
Accurate communication: From accurate communication we mean If at all the respondents decide to respond to the questions, they
that the questions in the schedule should be so worded that there is no will be interested in replying to such questions, which are directly
gap in what is asked by the investigator and what is understood by the connected with the study. They may resent replying to such question
respondent. If the respondent understands exactly what is being asked by with which the study is not directly linked. They may even decline
the investigator, then we can say that there is accurate communication. replying such questions and can become repulsive also. Such questions
It is, therefore, most desirable that the questions being asked should be should only be put when it is seen that additional information being
very clear and not ambiguous. These should be very short and precise called for is necessary and that the informants are in a mood to cooperate
so that respondents do not take a very long time in understanding them. and respond.
The questions should be closely inter-linked with each other and it Although, all sorts of questions can be put in a schedule, select only
should appear that whole information is being sought in a rational those that are essential and proper, which can be analysed and subjected
manner and that one question is following by the other in a natural to statistical tests. In case tabulation of questions is not possible, then
sequence. It is very desirable that all the questions should be worded there is no use in collecting information on such questions.
and put in such a way that the respondents feel attracted to give their In order to have some checks and counter-checks, as well as to
suggestions. have in-depth information, it is better to sub-divide each question so
No question should be included in the schedule, which has no direct that the informants do not feel bore while replying and their faults and
bearing with the subject matter under study. Accordingly it is essential faltering can also be checked.
that before pulling the questions in the schedule, whole literature dealing In a schedule, idiomatic, technical, ambiguous, indefinite-
with the subject matter should be clearly and carefully studied. It is imaginative and private terms should be avoided because it is usually
also desirable that those persons who are experts on the subject should difficult for the investigators to clarify these and much of subjectivity
be consulted and their views obtained. Before finalizing the schedule gets introduced in the replies, which are recorded. When the same
the investigators may prepare a list of primary and secondary effects terms are differently understood both by the respondents as well as the
and also a list of tertiary effects, and should approach the affected investigators then the reply will become unreliable and undependable
persons in that order. and the whole study will become a futile attempt.
In order to get accurate response it is better to prepare the No questions should be included and asked, which develops a
schedule in a scientific way and also in a way that the respondents sense of shame in the respondents, on which they are dependent for
feel inspired to give correct information. The questions should not replying or on which they have no information and it is expected of
be of such a nature that while replying the respondents get bored. them to go and collect information from others.
Similarly informants will not like to give reply to a question, which While collecting information investigators should always leave
injures their feelings. In fact, after such a question has been put an impression that they are enjoying responses and being benefited by
to them, they will decline to cooperate with the investigators and the impression being provided by them, no matter whether they are
refuse to respond to remaining part of the schedule and questions actually enjoying the responses or not.
contained in that. A very conscious approach in this regard should, The investigators should never try to impose themselves on the
therefore, be adopted. respondents and never forget that the latter is under no obligation to
The investigators should not put such questions in which there respond to their question. Therefore, they should never behave arrogantly
is element of subjectivity because the people dislike both subjective and at no stage suffer from superiority complex.
142 Measurement and Evaluation in Education Tool Construction Procedure 143
♦ ♦
The investigators should allow their respondents to respond to to field workers. In so far as part I, namely prefatory, is concerned it
questions without any intervention from them. They should intervene should include the name of the survey and surveyor with their full
when their help is needed or when some stimuli is needed. address, name of the sponsoring agency, reference number, name of
Need for pre-testing: As in the case of questionnaire, it is essential the respondents along with their age, sex, education, profession and
that the whole schedule should be pre-tested. For this, a sample should address.
be picked out of the universe and tested. Needless to say that this sample Second part should cover the title, sub-title and columns of each
should be a representative. Once the defects of the questions in the question.
schedule have come to light these should be removed and questions Third part of the schedule should give clear instructions to the
should be modified based on this. It should, however, not be taken field workers. It should include as to what is to be covered and what
for granted that modified schedule is perfect in all respects and ready is not to be covered. Similarly, it should state as to what is meant by
for use. This should again be used on representative sample and the each term included in the questions. Other relevant instructions are
process should be repeated till it is clear that there is communication to be clearly given so that element of subjectivity and arbitration is
accuracy and the informant understands tile question in the same sense removed to a considerable extent?
in which investigator is putting that. Difference between schedule and questionnaire: There is difference
Of course, in a schedule method the investigators’ aid is available between a schedule and a questionnaire. The basic difference is that a
to informant, but the latter may not need the help of the former and may questionnaire is mailed and there is no investigator to help the informant
like to respond to the questions without investigators’ help. But all the in filling the same, i.e., questions included in the questionnaire, whereas in
more it is essential and also convenient that in the schedule good paper the case of a schedule the investigators personally take the questionnaire
should be used. Its size should not be unwieldy. There should not be with them to their respondents. They put the questions and fill the replies
too many folds. There should be sufficient space for noting down all themselves. Wherever necessary they also helps the respondents in
the information that is provided by the informant. There should also be filling tile questionnaire and provide necessary clarifications if needed.
good space for making additional notes about what the investigators In the case of schedule the investigators also provide stimulus to the
observe during the course of discussions with their respondents. As in respondents which is not the case with a questionnaire. These are only
the case of questionnaire it will always be better if the investigators do a few differences. There are many other points of differences between
not take more than half an hour with the respondents in finishing their the two. Some of these are as follows:
work. There should be sufficient marginal space. In between the two Direct method of collecting data: In the questionnaire, investigators
questions there should be sufficient space, so that the information supplied do not go to the field. They mail the questionnaire and get the replies
does not get mixed up. It is always better if the schedule is got printed while sitting at home. Thus, they may or may not visit the field at all.
instead of getting it typed or cyclostyled, because a printed schedule In this way questionnaire is an indirect method of collecting data. On
always attracts the respondents and also increases reliability of the study. the other hand, in a schedule the investigators go to the field. They
If the study is to be carried out in an area where the people are meet their informants, put questions to them and personally help them
not highly qualified, there is no harm if pictures are also used in the in filling up the questionnaire and so on. Thus, it is a direct method
schedule. In fact, in many cases use of pictures attracts people and of data collection.
they feel tempted to reply to the questions with eagerness. The use Difference in nature of study: In questionnaire method all such
of pictures in some cases also helps in understanding even complex studies can be undertaken in which vast area is to be covered and the
problems quickly and promptly. nature of information does not require going deep into the issue. It is
Organizations of schedule: A schedule should be organized in three because questionnaires are to be mailed and mailing can be done in
broad categories: (1) prefatory, (2) main schedule, and (3) direction any part of the world. On the other hand, this is not possible in the
144 Measurement and Evaluation in Education Tool Construction Procedure 145
♦ ♦
case of a schedule because the investigators are required personally all these things count, yet not to that extent, because the respondents
to go and get the desired information collected. They can, therefore, themselves are to use the schedule and if that is not as good as in
cover a small area, but being personally present in the field can seek appearance as the questionnaire, to some extent even that can be
in-depth information from their respondents. tolerated.
Importance of covering letter: As against schedule, in a questionnaire, Gap between investigator and respondent: In the case of schedule
covering latter plays an important role. Since the investigators are not there is no wide gap between the investigator and the respondent, because
present before the respondent and the former cannot seek any clarification they are face to face with each other. They help each other in filling
from the latter, they have to get every information from the covering up replies to the questions and clear each other’s doubts. On the other
letter, i.e., why the information is being collected, who is seeking and hand, in the case of questionnaire method there is a wide gap between
at whose instance and so on. In fact, who, why and what are to be the investigator and the respondent. It is because they have not met
covered in the covering letter. Even each word in the covering letter, each other and there is every possibility that throughout the study they
directly or indirectly, positively or negatively affects the attitude and even may not meet at all. Of course, this helps the study as a whole.
behaviour of the respondents. Difference in mortality rate: Usually, mortality rate is very high in
On the other hand, covering letter does not haves that much questionnaire method. The respondents, due to various reasons, do not
significance in so far as schedule is concerned because the investigators feel tempted to return the questionnaires in spite of several reminders.
are present to clarify every point that is raised by the informant. In Even if the replies are received, in many cases it is found that either
some cases, they may not even like to spend their time in reading what the answers have not been given in the proper way or many questions
is written in a covering letter. have been left unreplied. Some times, the respondents do not reply
Difference in reliability of information: Obviously, information to such questions in which they are not interested or which according
collected through schedule method is more important than the one to them require time or labour that are too close to their personal life.
collected through questionnaire. It is because in the former case the There being no method of persuasion the investigators have to satisfy
investigators themselves go to the field and clarify all doubts. They themselves with the information, which they receive. In many cases
know it fully well that the information which they have recorded is mortality rate is as high 70 per cent, which practically frustrates the
first hand and unbiased and as such is most reliable. On the other hand, purpose of mailing questionnaire.
in the case of questionnaire, the investigators have to depend on the But the position is different in the case of schedule method. The
information sent to them. They do not know whether that is right or investigators take the schedule themselves, collect information, persuade
wrong. Hence, it is less reliable. In questionnaire method, there is no the informants to reply to even such questions in which they are not
method to verify whether the information supplied by the respondents directly interested and carry all the information with them. Mortality
are right or wrong, whereas in schedule method it is possible. Similarly, rate in this case is very low and arises only when the respondents are
if anything is left out in the case of questionnaire, it is difficult to get that not available, not cooperative or not inclined to give information due
completed without loss of time. In the case of schedule, it is, however, to various other reasons about questions put to them.
possible to get the incomplete information completed. Difference in coverage: In so far as schedule method is concerned
Difference in format: In the case of questionnaire format counts a even such persons who have no high standard of education can be
lot. It is well known that if the paper used is good, printing is attractive, covered, because investigators are present to give all clarifications.
spacing is proper, there are not many folding, handling is convenient, Moreover, in schedule method the questions are not very complex.
ink does not spread on the paper and the letter requesting for sending of But in questionnaire method only such persons can be covered who
information is appealing, then there is every possibility of respondents have obtained a particular standard of education and can well answer
replying soon. On the other hand, in the case of schedule, although the questions without the help of any investigator.
146 Measurement and Evaluation in Education Tool Construction Procedure 147
♦ ♦
Difference in obligation: In questionnaire method, the respondents In the word of Barr, Davis and Johnson (1953), ‘rating is a term
are not under any obligation to answer the questions. They may or may applied to the expression of opinion or judgement regarding some
not return the questionnaire sent to them. Moreover, they are also free situation, object or character (person)’. These opinions are usually
to write whatever they like and can reply even confidential questions expressed on a scale or by categories of values, either quantitatively
with confidence because nobody is present right before them. On the or qualitatively.
other hand, in so far as schedule method is concerned the respondents The rating scale procedures are the most popular, widely used and
are under obligation to reply because the investigators are right before easy to administer among all research procedures that depend upon
them to fill the schedule. In their presence the respondents may be human judgement. They are used in the evaluation of individuals,
hesitant to some extent while replying to such questions. Thus, they their reactions and in the psychological evaluation of stimuli. They
may not frankly come out with their views. may be used to describe the behaviour of individuals, the activities
Difference in use in sampling method: Schedule method can of an entire group, the changes in the situation surrounding them,
easily and safely be used in sampling method of research. Whoever or many other types of data. They are also used to record quantified
is covered in sample is under obligation to reply because of the observations of a social situation. A number of rating techniques have
presence of the investigators. Moreover, the replies received are also been developed, which help the observers to ascribe numerical values
complete. On the other hand, sampling method may or may not be or ratings to their judgements of behaviour. There are seven types of
successful in questionnaire method because questionnaires mailed rating scales:
may not come back and even if received back the replies may be • Descriptive rating
incomplete or vague.
• Percentage of group scale
Problem of rapport: In schedule method, there is serious problem of
rapport. If somehow investigators are in a position to establish rapport • Numerical scales
with the respondents they shall not be in a position to get information • Graphic scales
from them and their project will not react to its logical conclusions. But • Standard scales
in questionnaire method, there is no importance for rapport because • Rating by cumulative points
the investigators and respondents are not facing each other.
• Forced choice ratings
Rating Scale
Construction of Rating Scales
Rating scale refers to a scale with a set of points, which describe
various dimensions of an attribute being observed. It ascertains the Main steps for construction of rating scales are as follows:
degree, intensity or frequency of a variable. To construct such a scale, • The knowledge of general rules.
the investigators have to identify the factors to be measured, place unit • The judges who will do the ratings.
or categories on a scale to differentiate varying degree of that factor
• The phenomenon to be rated.
and describe these units in some manner.
No established rule governs the number of units that should be • The continuum along which they will be rated, i.e., the type of
placed on a scale, but having too many categories tends to produce rating scale.
crude measure that have little meaning and having too many categories The researcher after considering there steps will decide the type
makes it difficult for the rater to discriminate between one step and of rating scale. The scale may take any number of different forms;
the next on the scale. it may be simply a series of numbers, a graduated line, quantitative
Rating means the judgment of one person by another. terms such as good and poor, a series of named attributes peculiar to
148 Measurement and Evaluation in Education Tool Construction Procedure 149
♦ ♦
each scale or a series of carefully worded descriptions of statements • Variety: The use of the same terms in all or many of the cues may
representing different degrees of each aspect to be rated. fail to differentiate them sufficiently. Vary the language used at
The procedure may consist of statements which describe various different scale levels.
forms of individual’s behaviour, such as talkativeness, accurate but • Objectivity: Cues with implications of good or bad, worthy or
very deliberate, works well without supervision and so on. Sometimes, unworthy, desirable or undesirable should generally be avoided.
specimens of work, i.e., handwriting representing various levels of
• Uniqueness: The cues for each trait should be unique to that trait.
merit, for example, may be placed on a continuum according to values
Avoid using cues of a very general character such as ‘excellent’,
determined by a jury.
‘superior’, ‘average’, ‘poor’ and the like.
General Rules There are no hard and fast rules concerning the number of steps
or scale divisions to be used in a rating scale. In general, five to seven
There are certain main points, which an investigator or researcher
point scales are seen to serve adequately. With willing, motivated,
should keep in view while constructing a rating scale.
serious and cooperative raters, much finer divisions of the scales
A trait to be rated should be given a trait name and definition.
prove profitable.
Guilford (1954) has suggested the following rules for defining and
describing the trait: Types of Rating Scales
• Traits should be described univocally, objectively and specifically. The following are the different types of rating scales.
• Each trait should refer to a single type of activity.
Descriptive Rating
• A trait that is to be rated should not be a composite of a number
of traits that vary independently. In descriptive rating, the raters put a check () in the blank before
the characteristic or trait is described in a phrase. In order to judge the
• Traits should be grouped according to the accuracy with which
pupil’s initiative, for example, the rater may be asked to tick mark the
they can be rated.
most befitting description out of the following:
• In describing traits, avoid the use of general terms such as ‘very’,
• Shows marked originality
‘extreme’, ‘average’ or ‘excellent’.
• Willing to take initiative
• Finally, do not use scales for traits.
A rating scale should make use of good ‘cues’. Guilford (1954) • Quite inventive
on the basis of a study of cues by Chambney (1941) has listed six • On the whole unenterprising.
requirements for good cues: • Very dependent on others.
• Clarity: Use of short statements, in simple and unambiguous
terminology. Percentage of Group Scales
• Relevance: The cue should be consistent with trait name and its Here, the rater is asked to give the percentage of the group that
definition as well as with other cues. possesses the trait on which the individual is rated, for example, for
rating the honesty of an individual, the rater may check one of the
• Precision: A good cue applies to a point or a very short range on
following:
the continuum. There should be no doubt about its position among
other cues and if possible it should not overlap them in quantitative • Falls in the top 1 per cent
meaning. • Falls in the top 10 per cent but not in the top 1 per cent
making judgements. They have to report in terms of descriptive cues and then th
assign numbers to them. For example, while rating performance in a drama, the cu
following: Extremely poor, very poor, poor, average, good, very good and extrem
150 Measurement and Evaluation in Education Tool Construction Procedure 151
♦ these cues♦the numbers 1–7 may be assigned by the researchers.
• Falls in the top 25 per cent but not in the top 10 per cent very good and extremely good. To these cues the numbers 1–7 may
• Falls in the top 50 per cent but not in the top 25 per cent be assigned by the researchers.
Graphic Scales
• Falls in the lower half, but not in the bottom 25 per cent Graphic Scales
• Falls in the bottom 25 per cent but not in the bottom 10 per cent In this scaleInathis
straight
scale line is shown,
a straight line isvertically or horizontally,
shown, vertically with various cu
or horizontally,
• Falls in the bottom 10 per cent but not in the bottom 1 per cent
rater. Thewith
line various cues to help the rater. The line is either segmented
is either segmented in units or it is continuous. If the line is se
• Falls in the bottom 1 per cent in units or it is continuous. If the line is segmented, the number
number ofofparts
partscan
canbebevaried.
varied.The
Theexamples
examples of
of such
such a scale are
are shown
shown in
in Figure 10
Numerical Scales Figure 10.1.
In a numerical scale, numbers are assigned to each trait. Here, How effective was the presentation of lessons in the class by the
How effective was the presentation of lessons in the class by the teacher?
a sequence of defined numbers is supplied to the rater. The rater teacher?
assigns to each stimulus, to be rated, an appropriate number in the
line with these definitions or descriptions. One example of such a
scale while rating performance of colour combinations in pictures
is as follows:
• Extremely pleasant
• Moderately pleasant
• Mildly pleasant
• Indifferent
Figure-10.1: Graphic scales
• Mildly unpleasant Figure 10.1 Graphic scales
• Moderately unpleasant The graphic scales are simple and easy to administer. They are
• Extremely unpleasant The graphic scales
interesting to are simple
the raters andand easylittle
require to administer. TheyThe
added motivation. areraters
interesting to t
In this type of rating it has been seen that rater usually avoids can fill them quickly as such scales do not bother them with numbers.
require little added motivation. The raters can fill them quickly as such scales do no
terminal categories. Here, the rater would tend to avoid categories 1 The graphic scale provides opportunity for as fine discrimination as
and 7, thus range of ratings gets shortened. To avoid this shortcoming with numbers.
that ofThe graphic
which scaleareprovides
the raters opportunity
capable and forofasscoring
the fineness fine discrimination
can be as
it is suggested to expand the scale beyond the categories, which as great as desired.
the raters are capable and the fineness of scoring can be as great as desired.
the researchers want to include in the scale. If the researchers want
an effective scale of five points, they may make use of additional Standard Scales
two categories so that the desired dispersions of five-point rating is In standard scales, a set of standards is presented to the rater. The
achieved. Hence, they should have seven-point scale for effective standards are usually objects of same kind to be rated with pre-established
five-point scale. scale values. In its best form, this type is like that of the scales for
In some numerical scales, the raters are not provided with numbers, judging the quality of handwriting. The scales of handwriting provide
which they have to use in making judgements. They have to report several standard specimens that have been spread over a common scale
in terms of descriptive cues and then the researchers assign numbers by the methods of equal appearing intervals of pair comparisons. With
to them. For example, while rating performance in a drama, the cues the help of these standard specimens, a few samples of handwriting
may be the following: Extremely poor, very poor, poor, average, good, can be equated to one of the standards.
Man to man scale is another example of standard scale. In this scale, the individuals are asked
to rate the raters by comparing them with the person mentioned on the scale and assuming the
152 Measurement and Evaluation in Education Tool Construction Procedure 153
ratee‘s position. For example A, B, C, D and E are the persons who♦have been already rated ♦
as
Man to man scale is another example of standard scale. In this scale, they are also allowed to use their own name. Examples or statements
very persistent, everyonearenot
the individuals easily
asked stops,
to rate the works
raters byquite steadily,
comparing themsomewhat
with changeable and gives
used in this kind of technique are as follows:
up easily. the person mentioned on the scale and assuming the ratee’s position. • There is a person who is disliked by others and has lots of enemies.
For example A, B, C, D and E are the persons who have been already
• There is a person who is always doing little things to make others
rated as very persistent, everyone not easily stops, works quite steadily,
happy.
Example somewhat changeable and gives up easily. The score for each student is the sum of the number of times one
Example is chosen for each descriptive statement. If the positive statements
Is he generallyIsahe
persistent indicate socially desirable qualities and the negative statements indicate
generally person?
a persistent person?
undesirable one, the total sum of the child will be the algebraic sum.
For example, if a student is mentioned positively nine times and
negatively three times, the score will be 9 + (−3) = 6. If there are
several positive and negative statements for a number of behavioural
attributes, it is possible to get a score for each attribute for each child
of the group. These scores are useful in the study of individual roles
This type of scale was originally developed for use in connection and serve as measure of reputation.
This type ofmilitary
with scale was originally
personnel developed
and is of for use Moreover,
little use elsewhere. in connection
becausewith military personnel and
of subjectivity
is of little use elsewhere. element, the usebecause
Moreover, of this type of scale is very
of subjectivity Forced-choice Ratings
limited. the use of this type of scale
element,
In forced-choice rating method the rater is asked, not to say whether
is very limited.
Rating by Cumulated Points the ratee has a certain trait or to say how much of a trait the ratee has
The ‘Checklist method’ and the ‘Guess technique’ belong to this but to say essentially whether the ratee has more of one trait than
category of rating. The common feature of rating by cumulated points another of a pair.
Rating by Cumulated Points
is in the method of scoring. The rating score for an object or individual In the construction of a ‘Forced-choice rating’ descriptions are
is the sum or average of the weighted or unweighted points. obtained concerning persons who are recognized as being at the highest
The ‗Checklist ‘Checklist
method‘ methods’ are ‗Guess
and the applicable in the evaluation
technique‘ belong of to the
this category of rating. The
and lowest extremes of the performance continuum for the particular
performance of personnel in a job. This method was used by Hartshorne group to be rated. Descriptions are analysed into simple behaviour
common feature
and Mary of (1929)
ratingfor
byevaluating
cumulated thepoints is of
character in children.
the method
A listof scoring. The rating score for
of 80 qualities, and stated in very short sentences or by trait names, which
trait names describing some favourable and unfavourable qualities
an object or individual is the sum or average of the weighted or unweighted points. are known as elements.
like cruel, co-operative, thoughtful humane and greedy was prepared. These elements are used to construct items and then ‘discrimination
Each rater has to check every term in the list that is to be applied to a value’ and ‘preference-value’ are determined for each element. In forming
‗Checklist methods‘
child. are applicable
The weights of + 1 and −1in were
the evaluation of thefavourable
assigned to every performance of personnel in a job.an item, elements are paired. Two terms of statements with about the
and unfavourable trait, respectively. The algebraic sum of the weights
This methodwaswasthe used
child’sby Hartshorne
total score. and Mary (1929) for evaluating the character of children. same A high preference values are paired, one of which is valid and the
other is not.
Guess technique was developed by Hartshorne and Mary (1929) Two terms or statements with about equally low preference values
for use particularly with child rater. In this technique, the students are are also paired, one being valid and the other not. Two pairs of terms
asked to read each descriptive statement presented to them and then or statements, one pair with high preference value and the other with
to write down the name of the student who best fits that description. low preference value, are combined in a tetrad to form an item. An
The students may use more than one name for each statement, and example of tetrad given by Guilford (1954) is as follows:
154 Measurement and Evaluation in Education Tool Construction Procedure 155
♦ ♦
• Careless Error of Leniency
• Serious-minded The raters would not like to run down their own people by giving
• Energetic them low ratings. The result is that high ratings are given in almost all
cases, such raters are called ‘easy raters’. Some raters become aware
• Snobbish
of the failure of easy rating and consequently rate individuals lower
The rater is asked to react to each tetrad as an item, saying which
than they should. Such raters are called ‘hard raters’.
one of the four best fits the ratee and which one is least appropriate.
The leniency error refers to a general and constant tendency
The tool is tried out in a sample for which there is an outside criterion
for a rater to rate too high or too low for whatever reasons. When
for the purpose of validating the responses. Then the ‘discriminating
rating is too high, the constant error is one of positive leniency. On
responses’ are determined and ‘differential weights’ are assigned to
the other hand, the constant error is one of negative leniency when
each item. Taylor and Wherry (1952) found that leniency error gets
rating is too low. The positive leniency error is most common and an
reduced in forced-choice rating when compared with graphic ratings.
arrangement of ‘cues’ given by Guilford (1964) may prove helpful
Advantages of Rating Scales to counteract it.
In this example, only one unfavourable ‘cue’ is given and most of
The main advantages of rating scales are following:
the ranges are given to degrees of favourable report. The researchers
• They are commonly employed in judging contests of various kinds evidently anticipate a mean rating somewhere near the cue good and
such as speaking, declamation contests and music competitions. a distribution symmetrical about that point.
• They have been put to extensive uses in the filed of rating teaching
and teachers. Halo Error
• They are also used for testing the validity of many objective Halo means a tendency to rate in terms of general impressions
instruments like paper-pencil inventories of personality. about the rates formed on the basis of some previous performances.
• They are also used for personality ratings, sociological surveys, For example, one tends to rate a person with a pleasing personality high
school appraisals including appraisal of courses, practices and on traits like initiative and loyalty also. Halo effect appears frequently
programmes. when the raters have to rate a number of factors on some of which they
have no evidence for judgement.
• They are advantageous in several other ways, that is, they are
Guilford (1954) suggests that the practice of rating one trait at a
helpful in:
time on all rates, facilitated by having one trait per page rather than
(i) Writing reports to parents one rate per page, and the practice of the forced-choice technique may
(ii) Filling out admission blanks for colleges be used to counteract the halo effect.
(iii) Finding out student needs
Error of Central Tendency
(iv) Making recommendations to employers
There is a tendency in some raters to rate most of ratees near the
(v) Supplementing other sources of understanding about the child
mid point of the scale. They would like to put most of the rates as
(vi) Their stimulating effect upon the individuals who are rated average. It is more common among the individuals who are unknown
to the raters. To counteract the error, greater differences in meaning
Limitations of Rating Scales
may be introduced between ‘cues’ near the ends of the scale than
Rating scales have several limitations, some of them are discussed between ‘cues’ near the centre.
as follows:
156 Measurement and Evaluation in Education Tool Construction Procedure 157
♦ ♦
Logical Error Achievement Test
Such an error occurs when the characteristics or the traits to be Education or teaching is an activity and its effect is learning
rated is misunderstood. It is due to the fact that judges are likely to or modification of mental or conative behaviour. This effect in
give similar rating for traits, which they feel logically related to each education has to be evaluated or measured, i.e., it is to be discovered
other. Guilford suggests that this error can be avoided by calling for as what and how much the child has learnt out of one subject or
rating based on judgement of objectively observable actions rather one situation.
than abstract and semantically overlapping traits.
Old Concept of Evaluation
Contrast Error In the old system of education evaluation took the form of school
This error is due to the tendency of the raters to rate others in examinations. They are better termed as essay-type of examinations. It
the opposite direction from themselves in a trait. For example, in a was believed that intelligence and other esoteric qualities of personality
study the raters were asked to rate individuals in the trait of ‘need were expressed through writing answers in the form of essays. How
for orderliness’. It was seen that the raters who themselves were much a student had achieved in a particular subject could also be
high in orderliness tended to see others as being less orderly than gauged from these essays in that subject. Thus, the score of these
they were, and the raters low in underlines tended to see others as essay-type examinations were considered good enough to measure the
being more orderly than they were. This error can be avoided to educational effect and also to forecast how much one could achieve
some extent by making the raters feel about the existence of such in future. Essay-type tests were the only instruments to know about
a phenomenon. the progress of students.
Rating scale is the type of inquiry, which is devised and
administered for the purpose of securing judgements of certain persons Demerits of Traditional Examinations
about certain limited aspects, individuals, groups or performances. Soon it was discovered that these essay-types were not accurate
They measure the degree or amount of the indicated judgements. tests of student’s ability. Numerous researches were carried out as a
Construction of rating scale involves knowledge of general rules, result of which following demerits were revealed.
the judges who will do the ratings, the phenomena to be rated and
finally the type of rating scale, i.e., the continuum along which the Unreliability
phenomena will be rated. An instrument is reliable if it measures exactly the same amount
The researchers should also consider the limitations and advantages of everything. In order to be reliable its results must be the same on
of a particular type of rating before finally deciding about the type every occasion it is used or same when used by different persons in a
of rating scale for their study. The main types of rating scales are single case. Our essay-type examinations do not give such results. It
descriptive rating, numerical scale, the graphic scale, the percentage has been found that different examiners have scored the same answer
of group scale, standard scale, rating by cumulated points and forced- differently. The range of marks varied sometimes from 30 to 80 in the
choice ratings. same essay. Hence, this type of assessment is unreliable.
Rating scales have been put to uses in the field of rating teaching
and teachers, personality ratings, school appraisals, judging contests Validity
of various kinds such as speaking, declamation contests and music A test or an examination is valid when it measures the same trait,
competitions. The rating scales suffer from many errors and limitations, quality or function for which it is made. The traditional examinations
namely, leniency error, the halo error, the error of central tendency, were meant for measuring education achievement in different school
the logical error and the contrast error. subjects. But, it has been found that they are not able to single out the
158 Measurement and Evaluation in Education Tool Construction Procedure 159
♦ ♦
knowledge achieved and are sometimes tests of mere rote memory which • Trying out the test
cannot be the function of school examination. Hence, they are invalid. • Evaluating the test
There are other demerits also. They have become unwieldy and are Planning of Attainment Test. While planning the attainment test
difficult to be organized. They have distorting effect on the curriculum. one has to record the functional effects and the behavioural changes
The questions set in the examinations are generally selective. The desired to be produced by a particular subject. It means that they have
students prepare only important portions and leave others. Thus, the to thrash out various aims of teaching that subject. Then, an analysis of
whole curriculum is distorted and its purpose is defeated. If they are the courses prescribed in that subject is to be made. What is expected
reliable, their results must be the same on every occasion they are used from the child with regard to these courses and aims is very carefully
or same when used by different persons in a single case, also highly determined. For each class or grade a certai n level of achievement or
subjective as the scores are influenced by the whims, opinions and standard of knowledge and skill is expected. This has do be kept in
interest of the examiners. mind. Planning must take into consideration the class for which the
test is being constructed.
The New Concept
Preparing Test. After above exploration, the test is prepared. In
As a result of these demerits of the old examination system and drafting the test questions one has to see that the courses of the subject
emerging science of psychology, various attempts have been made to make for that class are well represented. For example, memory items, grasping
the tests of achievements objective, reliable and valid. These attempts have of the subject and interpretations, all should be tested by the questions.
resulted into construction of objective tests, which are called achievement Care should be taken while testing discrimination, judgment, intellectual
test, or attainment tests in various subjects. In foreign countries, their use and emotional attitudes, appreciations and application of knowledge.
has become very popular and wide. In India, an attempt is being made Questions should be appropriate to the material to be included. In the
to supplement the prevalent examinations with these tests. first draft there should be as many as double the number of questions
to be retained in the final test. For example, if you want to keep 100
Attainment Test
questions in the final test, there must be about 200 questions in the
Achievement tests measure one’s acquisition of knowledge in a first draft. Questions should vary in difficulty from very easy to very
particular subject. In a technical sense, they measure the functional difficult to suit all students’ ability. As to difficulty, the test should
effect and behavioural changes produced by a school subject within be such as the average boy should be able to make about 50 of the
child’s personality. Teaching of language has a definite function, which possible scores. Easier items should be placed at the beginning and
is to produce such mental and behavioural changes as acquisition of difficult ones at the end. The test is prepared in two forms: 100 items
vocabulary and its use (verbal behaviour), fluency in speech and in test A and 100 items in Test B.
expression (linguistic behaviour), improvement in writing (expressional Try-out. This is the stage at which one tries to know how good
behaviour), grasping and understanding (receptive behaviour), refinement the test is. At this stage, they have to make a selection of most suitable
of feeling and emotions (affective behaviour). Similarly, each subject questions and reject others, which are unsuitable. For this purpose
has a functional effect, which has to be located, defined and assessed. the test is administered to about 200 or 300 boys representative of all
Construction. In the construction of a psychology test four steps grades of abilities. It is felt that there may be chances for the students
are involved. This is so with the attainment test also. Following are to guess the answers of the questions. Hence to eliminate this effect a
the four steps: correction formula is applied.
• Planning the test
W
• Preparing the test which is S= R −
N −I
in which S is the score corrected for guessing,
160 Measurement and Evaluation in Education Tool Construction Procedure 161
♦ ♦
R is the number of right responses, W is the number of wrong aspects: general and statistical. The first aspect means that it is relative
responses and N is the number of responses for each time. to some standards. The statistical aspect means that a test is said to be
standardized if its standard deviation (SD) is not more than one third
Evaluating Test of the average, which means that the questions of the test are genuine
This stage decides which question should be retained in the test and and suitable for 68 per cent of the sample population.
which should not. For this purpose, an item-analysis of the test is made. Every test is standardized with regard to its material, method and
Note: For detailed study of this, students are advised to study results. To standardize its material the constructed test is administered
‘measurement and evaluation’. to a large number of students and each question or item is analysed
After this, reliability and validity of the whole test are found out. For in the same way as it was done in try-out process. Standardization
reliability purpose any of the following three methods may be adopted. means that in the second try-out necessary changes are made in the
• Repetition of the same test on different occasions and finding out instructions required for administering the test. The results of the test
the correlation between two sets of scores. This is called test-retest are standardized through the following methods.
method. (i) Mean and SD method
• Use of two parallel forms of the same test, i.e., two forms are (ii) Percentile method
prepared of the same test and correlation is found out between (iii) Age-basis method
their scores. The mean score of the whole population is calculated and the
• Split-half method: In this, test is split into two equal halves and position of each testee, then, can be calculated in terms of SD on the
correlation is found out between their scores. plus or minus side of the average. This is called SD method. In the
Validity of the test is found by getting the correlation between the percentile method, percentiles are calculated instead of mean and SD.
scores of the test and some outside criterion such as school marks or If one students score is 62.3 and we want to know the relative position
teachers’ estimates of the students. in the population, we have to find which percentile values is 62.3 or
The time is also fixed for the test. Time taken by each test is near about this. If it is 60th percentile, it means that the student is better
noted and the average of all is calculated. This is the standard time than 60 of the whole group.
for the test. In this way the test is standardized. Following are the chief
What type of questions is framed in the test? For example, simple characteristics of a standardized test.
recall type, completion type, multiple-choice type, true and false type
or matching type. Characteristics of Standardized Test
A standardized test is more valid. It means that it measures only the
Standardized Test same thing for which it is meant. Unlike the traditional examinations in
Tests are of two types: teacher-made and standardized tests. There which speed of writing, guess, chance and linguistic ability are many
is a difference between these two. A teacher-made test is one that is factors that influence the scores of the student, no such factors remain
prepared by teachers for their pupils, but is not standardized. It goes operative and effective in the standardized test.
through the steps of planning, preparing, trying out and evaluation, They are very reliable. They give the same results on repeated
but stops before standardization. application. This is due to the fact that they include a large number of
Standardization of a test goes beyond the try-out process. At this questions fairly representative of the content.
stage the test is administered to a very large population of students, for They have a meaning and a value. Each testee’s score can be
example, 2000, 3000 or more. The concept of standardization has two easily compared with the average which is calculated taking about
162 Measurement and Evaluation in Education Tool Construction Procedure 163
♦ ♦
2000 or 3000 students. If a boy’s score is 65.4 and the test says that This feature is especially clear in multiple-choice questions, although
this is a score of the pupil of 13+ who has an educational age of 15+, the recognition of the correct response choice is also the major response
then the boy will be considered to have had an educational quotient given in true-false, matching rearrangement and other variants. As
(EQ) of 15 at the chronological age of 13+. EQ is calculated in the objective items fame to be widely accepted, they can be written to tap
same way as IQ by dividing the educational age by chronological age complex thinking processes, reasoning, evaluation of arguments and
and multiplying by 100. the application of knowledge to new situation. Moreover, in objective
questions, each item requires much less of the examinee’s time than
5.3. NATURE OF ACHIEVEMENT TESTS does a typical essay question.
Surpassing all other types of standardized tests in sheer number, In summary, objective items have largely replaced essay questions
achievement tests are designed to measure the effects of a specific in standardized testing programs, not only because of time restriction
programme of instruction or training. They generally represent a terminal in test scoring, but also and more importantly because they provide
evaluation of the individual’s status on the completion of training. broader subject matter coverage, yield more reliable and valid scores
The emphasis in such tests is on what the individual can do at and are fairer to individual. Easy writing should be encouraged and
the time. They measure the effects of learning. The course-oriented developed primarily as an instructional procedure to foster clear, correct
achievement test covers narrowly defined technical skills or factual and effective communication in all content areas.
information. A test in English vocabulary or television maintenance
is an example of this category. Uses of Achievement Tests
The broadly oriented achievement test commonly used today is Many roles that objective-type achievement tests can play in the
to asses the attainment of major long-term educational goals. Here we educational process have long been recognized. Achievement test scores
find tests focusing on the understanding and application of scientific are used in deciding which grade a student is suitable for. They also
principles, the interpretation of literature and the appreciation of art; constitute an important feature of remedial teaching programme. In this
still broader in orientation are tests of basic cognitive skills that affect connection, they are useful both in the identification of students with
the individual’s performance in a wide variety of activities. They may special educational disabilities and in the measurement of progress in
be such as reading comprehension and arithmetic computation. the course of remedial work.
At the broadest level, we find achievement test designed to measure The periodic administration of achievement tests serves to facilitate
the effects of education on logical thinking, critical evaluation of learning. They reveal the weakness in part learning, give direction
conclusions problem-solving techniques and imagination. to subsequent learner and motivate the learner. They also provide a
means for adapting instruction to individual needs. Teaching can be
Essay-type Questions Versus Objective Questions most fruitful when it meets the learner at whatever stage they happen
The traditional school examination began as a set of questions to to be. Ascertaining what individuals are already able to do and what
be answered either orally or in writing. In either case, the examinee they already know about a subject is thus a necessary first step for
composed and formulated the response. The term ‘essay question’ has effective teaching.
to be used broadly to cover all free response questions, including not The growth of all testing programmes points to the increasing use
only those demanding a lengthy essay, but also those requiring the of test results as a basis for planning what is to be taught to a class as
examinee to produce a short answer or to workout the solution for a a whole and what modifications and adjustments need to be made in
mathematical problem. Objective questions, by contract, call for the individual cases.
choice of a correct answer out of several responses provided for each Further example of the role of achievement tests in the teaching
question. process can be found in connection with criterion-referenced testing,
164 Measurement and Evaluation in Education Tool Construction Procedure 165
♦ ♦
individually tailored instructional systems, mastery leaning and To guard against imbalances and disproportions in coverage of
computer-aided learning procedures. the syllabus, test specifications should be drawn up before items are
Finally, achievement test may be employed in the evaluation and prepared. For drawing up test specifications the test constructor should
improvement of teaching and in the formulation of educational goals. study two types of literature critically:
They can provide information on the adequacy with which essential (i) Relating to test construction
contents and skills are actually being taught. They can likewise indicate
(ii) Syllabi and university or board examination question papers in the
how much of the course content is retained and for how long. Moreover,
subject areas, which would help the test constructor to decide the
public demands for educational accountability require proper use of
weightage to be given to each independent topic in the syllabus.
well-constructed achievement tests to assess the results of the educational
Thus, the test constructor will prepare a blue-print by taking into
process.
consideration the relative importance of the content and also the amount
Construction of Achievement Tests of time spent in giving instruction in each category by the teachers.
This blue-print will then be discussed with 20–25 teachers dealing
The following are the main steps in the construction of achievement
with the subject in the various institutions.
tests:
• Administration of test for pre-try-out: The number of items
• Planning test
in test draft should be nearly one and a half times or double the
• Administration of the test for pre-tryout number required in the final test. The items with wide range of
• Try-out testing for item analysis includes: difficulty should be constructed. The instructions to be given to
(i) Sample for try-out testing the testes should be framed. The typed draft on the test should be
submitted to various teachers teaching the subject, the supervisor,
(ii) Instruction to the testes
and experts with long experiences of test construction, for frank
(iii) Time limit opinion and criticism.
(iv) Scoring Many false assumptions, slips and oversights are corrected in
(v) Item analysis of data this process. Then 50 cyclostyled or photocopies of the test are
(vi) Item selection for final draft administered to 50 students of the class for which the test is to be
constructed, and the answers are checked with the help of the scoring
(vii) Reliability and validity of the test key. A few further modifications will come to light during this stage,
(viii) The final form of the test called the pre-try-out stage. Then the modifications will be made
(ix) Time limit for final test and tests will be printed and administered to a sample selected for
• Planning test: The test constructor who plunges directly into item try-out testing.
writing is likely to produce a lopsided test. Without an advance • Try-out testing for item analysis: The next step is to select a
plan, some areas of the syllabus will be over-represented while sample for try-out testing.
others may remain untouched. A test constructed without a blue- (i) Sample for try-out testing:A true representative sample of 400
print is likely to be overloaded with relatively impertinent and testees will be selected by the following appropriate technique
less important material. Many of the criticisms of objective tests of sampling. As the test constructor needs 371 scripts for the
stem from the common overemphasis of rote memory and trivial item analysis, about 400 scripts are taken to keep enough
details in poorly constructed tests. margin for discarding the spoilt ones.
166 Measurement and Evaluation in Education Tool Construction Procedure 167
♦ ♦
(ii) Instructions to the testees: The test constructor has to write the internal consistency discrimination index and the
comprehensive instructions to be printed on the title page difficulty value, the middle 46 per cent of the papers will
of try-out test. The instructions should be self-explanatory, be kept aside.
yet how to answer a matching or multiple-choice items may After the formation of two groups, the number of correct responses
be explained orally with the help of example taken from to an item in each group will be found and tabulated. These numbers,
daily life of the testees. The oral instructions may be as naturally, show the percentage of correct response for each item for
follows: both the groups, as each group comprises of 100 testees. The percentage
1. Do not discuss anything with your neighbours. will be easily converted into proportions.
2 Do not make unnecessary haste to finish the test. Difficulty Value
3 Please see that no item is left out. You have to answer all the The average of the proportion of the correct responses on each item
items. At the end we shall check whether you have answered in the two end groups will be taken to be an estimate of the difficulty
all the items. value of that particular item. The formula for calculating difficulty
4. Please go through the written instructions carefully before value dv of each item is,
you start your work.
P + p1
2. Time limit: There should be no time limit for taking the try-out dv = u
2
tests. The test is to be administered to all the testees of the sample where dv = difficulty value of the item.
of 400 students selected for try-out testing. The test is to be taken Pu= Proportion of correct responses to the item from the upper group.
back from the testees only when, except three or four testees, all Pl = Proportion of correct responses to the item from the lower group.
had completed the test in the class or section.
3. Scoring: The scoring is to be done with the help of scoring key 5.4. INTERNAL CONSISTENCY
prepared by the test constructors, on the basis of one mark for a The relationship between the total scores derived from a test and
correct answer and an out right zero for an incorrect one. item scores are referred to as internal consistency discrimination index
4. Item analysis of data: After the try-out testing and scoring, the of an item. The internal consistency discrimination index of each
test constructor has to take 371 answer sheets by deleting the rest item will be found by reading the biserial coefficient of correlation
at random. After that: between the items and the total score form the J.C. Flanagan’s abac.
Flanagan’s abac was designed for use when the middle 46 per cent
• All the 371 answer sheets will be arranged in a descending
of the examinees on total score have been eliminated and each tail
order from highest score paper at the top to the lowest score
contains 27 per cent.
at the bottom.
The proportion passing the item in the upper criterion group will
• From the above pile upper 100 papers which form the ‘upper be read from the ordinate and the proportion passing the item in the
group’ and lower 100 papers which form the ‘lower group’ lower criterion group from the abscissa, and the value of coefficient
will be taken. Thus, 27 per cent of the 371 trainees, 100 will be read at the intersection of the perpendiculars.
scripts, making the highest scores constituted the superior After determining the difficulty values and the internal consistency
group, and the 27 per cent, 100 scripts, making the lowest discrimination indices of each item as discussed above, a list will be
scores comprised the inferior group. Only these two top and drawn for all items of the test showing the discrimination value and
bottom piles will be taken into consideration for computing difficulty value of each item.
168 Measurement and Evaluation in Education Tool Construction Procedure 169
♦ ♦
Item Selection for Final Draft 5.5. RELIABILITY AND VALIDITY OF ACHIEVEMENT TEST
The item for the final test will be selected on the basis of the The reliability coefficient of the test may be found by split-half
following criteria: (odd-even) method. It should be above 0.80. The higher the reliability
• Internal consistency: With regard to the internal consistency coefficient, the better it is. The validity of achievement tests is taken
discrimination index or item validity Garrett (1967) says, ‘as a for granted, because they are constructed after keeping in view the
general rule, item with validity indices of 0.20 or more are regarded weightage of the different portions of the syllabi in view.
as satisfactory’. This point of view is supported by Guilford (1954), who says
According to Thorndike (1949), ‘an item with a validity coefficient that, ‘there are some measures whose validity is taken for granted for
as high as 0.25 usually repre-sents and outstandingly valid item’. Keeping example, achievement test is formulated by analysis of curriculum
these in view, it may be decided to retain only items, having internal and textbooks and by the pooled judgement of recognized authorities
consistency of 0.25 and above. The higher the value, better it is to retain in the field. Under these circumstances a well constructed text may
the item. constitute the best available measure of criterion, in a sense the test
itself defines the function it is to measure. Such tests may be described
• Difficulty value: It is desirable to select most of the items of
as self-defining’.
medium difficulty and a few of higher and lower difficulty values.
Content validation procedure is commonly used in evaluating
Lindeman (1971) writes, ‘some easy items should be included
achievement tests. This involves essentially the systematic examination
in a test in order to encourage the students of low ability. Some
of the test content to determine whether it covers a representative sample
difficult items should be included to challenge the abler students.
of the behaviour domain to be tested must be systematically analysed
However, in the interest of constructing a measuring instrument
to make certain that all major aspects are covered by the test items
of maximum quality and utility, most items included be in the
and in the correct proportions. A well-constructed achievement test
middle range of difficulty’.
should cover the objectives of instruction, not just its subject matter.
A bivariate scatter-diagram will be prepared for the test placing each
Therefore, content must be broadly defined to include major objectives.
item in the appropriate column and row according to its difficult value
Among the various types of tests used in school, achievement
and discrimination index, respectively. Then, items will be selected
tests are the commonest. They propose to measure the present level
keeping the above criteria of dv and rb in view.
of performance of individuals or groups in academic learning. They
Final Form of Test also propose to measure how much students have learnt as a result of
instruction.
After selecting items for final test; re-arrange them in accordance
Achievement test scores are used in assigning grades to students.
with the principles laid down by experts. It is desirable that items should
They are utilized for evaluating courses of study or efficiency of teachers
be re-arranged from easy to difficult in the final form, i.e., easiest item
and teaching methods. Achievement tests may be standardized or
at Serial No. 1 and the most difficult item as the last item. On the cover
non-standardized. Objective-type achievement tests are designed to
page of the test, the standardized instructions for the testees will be
measure the effects of education on logical thinking, critical evaluation
printed as in the case of try-out. The scoring key for the final test will
of conclusions, problem-solving techniques and imagination.
also be prepared.
In standardized achievement tests, objective items have largely
The time limit for the final test will be fixed after administering
replaced easy items because they provide broader subject matter
the test in a section of a class for which the test is developed. The time
coverage, yield more reliable and more valid scores and are fairer to
taken by 90 per cent students to complete the test will provide time
individual.
for the final draft.
170 Measurement and Evaluation in Education Tool Construction Procedure 171
♦ ♦
In the construction of achievement tests main steps are planning • Items should be arranged in such a way as they are discriminative
the test, administration of the test for pre-try-out, try-out, item analysis in quality.
and item selection for the final draft. The reliability of achievement Four common styles of constructing checklists: Homer Kempfer
tests is found by split-half (odd-even) method. suggests the following common styles of arrangement.
The reliability coefficient for achievement test should be very In the following arrangement all items found in a situation are to
high, i.e., above 0.80. Content validity is the main type of validity be checked. For example, a subject may be asked to check (/) in the
needed for achievement tests. Content validation involves the blank beside each activity undertaken in a school.
systematic coverage of large content area of the syllabi in correct • Games and sports
proportions. Content must be broadly defined to include major
• NCC training
objectives of instruction.
• ACC training
• Evaluate Yourself
• Scouting
(i) Enlist the tools of research in brief.
• Gardening
(ii) What are characteristics of standardized test?
• Dramatics
(iii) Reliability and validity of test.
• Musicals
(iv) Evaluate good questionnaire.
• Debates
• Checklist : Meaning: It is a type of questionnaire in the form of a
In this form, the respondents are asked to check with a ‘yes’
set of items, which the respondent is asked to check. Characteristics:
or ‘no’ or asked to encircle or underline the response to the items
This tool systematizes and facilitates the recording of observation
given.
and helps to ensure the consideration of all important aspects of
the object or act observed.
1. Does your school have a house system? yes/no
It is an important tool in normative surveys, case histories, studies
Example: 2. Do you observe the open-shelf system in your yes/no
of behaviour and educational appraisal studies. The list of items in the
school library?
checklist may be continuous or divided into groups of related items.
It may be administered by post or otherwise. It records facts, not In this type, items are positive statements with checks (/) to be
judgments. marked in a column on the right.
• Construction of Checklist : The following points must be taken
(i) One half of the students of this school are girls.
into consideration while constructing a checklist. Example:
• An intensive survey of the literature should be made to determine (ii) The school works as a community centre.
the type of checklists to be used in an investigation.
Here items can be embedded in sentences and the appropriate
• Terms used should be clearly defined. words can be checked, underlined or encircled.
• Items should be complete and relevant.
• Items should be arranged in a logical or psychological order. Example: (a) Staff meetings are held—Fortnightly, monthly, quarterly,
irregularly.
• Related items should be grouped together.
(b) The dramatic club meets for 90–119/129–149/150 and
• Checklists prepared and used for educational research by various above minutes on 1 - 2 - 3 - 4 - 5 - 6 - 7 days/week.
investigators may be examined closely.
172 Measurement and Evaluation in Education Tool Construction Procedure 173
♦ ♦
Analysis and interpretation of checklist data: In the analysis and in traits. Its main use is clinical and diagnostic. Administration and
interpretation of checklist data, the same procedure as is followed scoring requires training. It has high prediction validity (MMPI).
in the questionnaire response holds good. It consists of counting of There are nine clinical scales in the inventory. Items: ‘I wish I could
frequencies, calculating percentages and averages and computing of be as happy as others seem to be’ T.F. cannot say. ‘I believe I am
means, medians and co-efficient of correlation as and when needed. being plotted against.’
Keeping in view the limitations of the tool and of the respondents, MPI appeared in 1940 and first manual in 1943. It is available in
conclusions should be arrived at carefully and judiciously. individual (card) and group (booklet) forms.
• Inventories : Some questions are listed to answer the individual Allport and Allport: A-S Reaction study for men and women. The
paper and pencil tests. Questionnaires personality inventories may study has 33 items for men and 34 items for women. Situations are
be classified into four types: presented verbally.
California test of personality: There are five scales: (1) primary,
• Those that assess specific traits confidence.
(2) elementary, (3) intermediate, (4) secondary and (5) adults reliability
• Those that evaluate adjustment to several of the environment (e.g., (0.80–0.94).
home, school community). Items: ‘Do you find that a good many people are mean?’
• Those that classify into clinical groups (e.g., psychopathic, • Evaluation of Personality Inventory
personality)
• High reliability, validity inadequate.
• Those that screen subjects into two or three groups (e.g.,
psychosomatic disorder vs normal). • Items are sometimes very ambiguous.
In personality inventories an effort is made to estimate the • We do not know any norm for ideal adjustment or behaviour.
presence and strength of each specified trait through a number of • They have very low diagnostic value.
items—representing a variety of situations in which the individual’s
generalized mode of responding may be sampled. • They are useful in the study of group trends, in differentiating
Some outstanding personality inventories are as follows: between group of adjusted and maladjusted rather than between
Bell adjustment inventory: It has two forms—one for adults and individuals.
the other for students. It has 223 item and measures 4 categories: (1) • Summary
Home, (2) health, (3) social and (4) emotional adjustment, 36 items • Research tools are of many kinds and employ distinctive ways
each, reliability is 0.80–0.90. of describing and qualifying the data.
Bernreuter personality inventory: Consists of 125 items, measures
• As research scholar should be familiar with the nature, merits,
(1) neuroticism, (2) self-sufficiency, (3) extroversion, (4) dominance,
limitations of theses tools and should also attempt to learn
(5) sociability, (6) lack of self-confidence. Reliability is 0.80–0.95,
how to construct and use them properly.
used 9 and 16, also adults.
Items: Do people ever come to you for advice. • There are various types of tests used in educational instructions,
Minnesota Multiphasic Personality Inventory (MMPI): Consists achievement tests are the commonest.
of 550 items, used in 16 years or above. Every item or statement is • Achievement tests may be standardized or non-standardized.
printed on separate cards and the subject sorts it into three groups: • In construction of achievement tests main steps are planning,
true/false, cannot say, items in 26 heads, e.g., family, general health, administration, pre try-out, try out, item analysis and item
attitude, religion phobias, delusions, delusions and items grouped selection for the final draft.
174 Measurement and Evaluation in Education

• The reliability of achievement test is estimated by split-half
method.
• The construction and procedure of research tools depends on
situation and nature of research.
• Keywords
• Achievement Test: It is a test designed to measure a person’s REFERENCES
knowledge, skills, and so on in a given area at a particular
time.
• Discrimination Index: It is a measure of the ability of an item
in a test to discriminate between students of high and low • Arum, R., & Roksa, J. (2010). Academically adrift: Limited learning on
ability. college campuses. Chicago, IL: University of Chicago Press.
• Association of American Colleges and Universities. (2007). College
• Instruction: Means of providing knowledge, skills, and so on.
learning for the new global century. Washington, DC: Author.
• Item: This is a single component in a test. • Astin, A. (1993). Assessment for Excellence: the philosophy and practice of
• Item Difficulty: This is the extent to which items in a test are assessment and evaluation in higher education. Phoenix, AZ: Oryx Press.
arranged in the order of increasing difficulty. • Astin, W., Astin, H., & Lindholm, J. (2010). Cultivating the spirit: How
• Halo: Means a tendency to rate in terms of general impressions college can enhance students’ inner lives. San Francisco, CA: Jossey–Bass.
about the rate formed on the basis of some previous • Bechger, T., B´eguin, A., Maris, G., & Verstralen, H. (March, 2003).
performances. Combining classical test theory and item response theory. Measurement
and Research Department Reports No. 2003-4. Arnhem, Netherlands:
• Criteria: A characteristic measurement with which other CITO, National Institute for Educational Measurement.
characteristics or measurements are compared.
• Bloom, B. S. (Ed.). (1956). Taxonomy of educational objectives: Handbook
• Grade: A grade is a label representing an evaluation. 1, Cognitive domain. New York, NY: Longman.
• Generosity Error: Error refers to a general and constant tendency • Bock, R. D., Brennan, R. L., & Muraki, E. (2002). The information in
of a rater to rate all ratees high. multiple ratings. Applied Psychological Measurement, 26, 364–375.
• Contrast Error: Error due to the tendency of the raters to rate • Bollen, K. A., & Lennox, R. (1991). Conventional wisdom on
others in opposite direction from themselves in a trait. measurement: A structural equation perspective. Psychological Bulletin,
110, 305–314.
• Rating: Means the judgement of one person by another.
• Borsboom, D. (2008). Latent variable theory. Measurement, 6, 25–53.
• Borsboom, D., Mellenbergh, G. J., & Van Heerden, J. (2003). The theoretical
status of latent variables. Psychological Review, 110, 203–219.
• Borsboom, D., Mellenbergh, G. J., & Van Heerden, J. (2004). The concept
of validity. Psychological Review, 111, 1061–1071.
• Brennan, R. L. (1983). Elements of generalizability theory. Iowa City,
IA: ACT, Inc.

175
176 Measurement and Evaluation in Education References 177
♦ ♦
• Brennan, R. L. (1992). Elements of generalizability theory (rev. ed.). • Ewell, P. (2002). An emerging scholarship: A brief history of assessment.
Iowa City, IA: ACT, Inc. In T. W. Banta, & Associates (Eds.), Building a scholarship of assessment
• Brennan, R. L. (1995). The conventional wisdom about group mean (pp. 3–25). San Francisco, CA: Jossey–Bass.
scores. Journal of Educational Measurement, 14, 385–396. • Feldt, L. S. & Brennan, R. L. (1989). Reliability. In R. L. Linn (Ed.),
• Brennan, R. L. (1998a). Misconceptions at the intersection of measurement Educational Measurement (3rd ed.) (pp. 105–146). New York: American
theory and practice. Educational Measurement: Issues and Practice, 17 Council on Education and MacMillan.
(1), 5–9, 30. • Feuer, M. J., Holland, P. W., Green, B. F., Bertenthal, M. W., & Hemphill,
• Brennan, R. L. (1998b). Raw-score conditional standard errors of F. C. (Eds.) (1999). Uncommon measures. Washington, DC: National
measurement in generalizability theory. Applied Psychological Academy of Sciences.
Measurement, 22, 307–331. • Fisher, R. A. (1925). Statistical methods for research workers. London:
• Brennan, R. L. (2001a). An essay on the history and future of reliability Oliver & Bond.
from the perspective of replications. Journal of Educational Measurement, • Glas, C. A. W., & van der Linden, W. J. (2003). Computerized adaptive
38, 295–317. testing with item cloning. Applied Psychological Measurement, 27,
• Brennan, R. L. (2001b). Generalizability theory. Springer-Verlag. Brennan, 247–261.
R. L. (2001c). Some problems, pitfalls, and paradoxes in educational • Green, B. F. (2008). Book review: Educational measurement (4th ed.).
measurement. Educational Measurement: Issues and Practice, 20 (4), 6-18. Journal of Educational Measurement, 45, 195–200.
• Center of Inquiry in the Liberal Arts. (n.d.). Wabash National Study 2006– • Gulliksen, H. (1950). Theory of mental tests. New York: Wiley. [Reprinted
2009. Retrieved from: http://www.liberalarts.wabash.edu/study-overview by Lawrence Erlbaum Associates, Hillsdale, NJ, 1987.]
• Chickering, A., & Reisser, L. (1993). Education and identity. San Francisco, • Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals
CA: Jossey-Bass. of item response theory. Newbury Park, CA: Sage.
• Cizek, G. J. (2008). Assessing Educational Measurement: Ovations, • Holland, P. W., & Hoskens, M. (2003). Classical test theory as a first-
omissions, opportunities. Educational Researcher, 37, 96–100. order item response theory: Application to true-score prediction from a
• Council of Regional Accrediting Commissions. (2003). Regional possibly nonparallel test. Psychometrika, 68, 123–149.
accreditation and student learning: Principles of good practice. Retreived • Howell, R. D., Breivik, E., & Wilcox, J. B. (2007). Reconsidering formative
from http://www.msche.org/publications/Regnlsl050208135331.pdf measurement. Psychological Methods, 12, 201–218.
• Cronbach, L. J. (1991). Methodological studies—A personal retrospective. • Jones, & K. Black (Eds.), Designing effective assessment: Principles and
In R. E. Snow & D. E. Wiley (Eds.), Improving inquiry in social science: profiles of good practice (pp. 46–49). San Francisco, CA: Jossey-Bass.
Avolume in honor of Lee J. Cronbach (pp. 385–400). Hillsdale, NJ: • Junker, B. W., & Sijtsma, K. (2001). Cognitive assessment models with
Erlbaum. few assumptions, and connections with nonparametric item response
• Edwards, J. R., & Bagozzi, R. P. (2000). On the nature and direction of theory. Applied Psychological Measurement, 25, 258–272.
relationships between constructs and measures. Psychological Methods, • Kane, M. T. (1996). The precision of measurements. Applied Measurement
5, 155–174. in Education, 9, 355–379.
• Evans, N., Forney, D., Guido, F., Patton, K., & Renn, K. (2009). Student • Kolen, M. J. & Brennan, R. L. (2004). Test equating, scaling, and linking:
development in college: Theory, research, and practice (2nd ed.). San Methods and practices. (2nd ed.). New York: Springer-Verlag.
Francisco, CA: Jossey–Bass.
• Kolen, M. J. & Harris, D. J. (April, 1987). Amultivariate test theory
• Ewell, P. (1984). The self-regarding institution: Information for excellence. model based on item response theory and generalizability theory. A paper
Boulder, CO: National Center for Higher Education Management Systems.
178 Measurement and Evaluation in Education References 179
♦ ♦
presented at the Annual Meeting of the American Educational Research • No Child Left Behind Act of 2001, Pub. L. No. 107–110, 115 Stat. 1425
Association, Washington, DC. (2002)
• Lee, W., Brennan, R. L., & Kolen, M. J. (2000). Estimators of conditional • Nugent, W. R., & Hankins, J. A. (1992). A comparison of classical, item
scale-score standard errors of measurement: A simulation study. Journal response, and generalizability theories of measurement. In D. F. Gillespie
of Educational Measurement, 37, 1–20. and C. Glisson (Ed.), Quantitative methods in social work: State of the
• Leighton, J., & Gierl, M. (2007). Cognitive diagnostic assessment for art (pp. 11–39). Binghamton, NY: Haworth Press.
education: Theory and applications. Cambridge, UK: Cambridge University • Pearl, J. (2000). Causality: Models, reasoning, and inference. Cambridge,
Press. UK: Cambridge University Press.
• Linn, R. L. (1993). Linking results of distinct assessments. Applied • Shavelson, R. J. & Webb, N. M. (1991). Generalizability theory: Aprimer.
Measurement in Education, 6 (1), 83–102. Newbury Park, CA: Sage.
• Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test • Spirtes, P., Glymour, C. N., & Scheines, R. (2000). Causation, prediction,
scores. Reading, MA: Addison-Wesley. and search. New York: Springer.
• Mellenbergh, G. J. (1996). Measurement precision in test score and item • Stevens, S. S. (1946). On the theory of scales of measurement. Science,
response models. Psychological Methods, 1, 293–299. 103, 667–680.
• Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement. • Thissen, D., & Wainer, H. (2001). Test Scoring. Mahwah, NJ: Erlbaum.
(pp. 13–103). Washington, DC: American Council on Education and • Wainer, H. (2007). A psychometric cicada: Educational Measurement
National Council on Measurement in Education. returns. Educational Researcher, 36, 485–486.
• Michell, J. (1997). Quantitative science and the definition of measurement • Wright, B. D., & Stone, M. H. (1979). Best test design. Chicago, IL:
in psychology.British Journal of Psychology, 88, 355–383. MESA Press.
• Michell, J. (1999). Measurement in psychology: A critical history of a
methodological concept. New York: Cambridge University Press.
  
• Michell, J. (2000). Normal science, pathological science, and psychometrics.
Theory and Psychology, 10, 639–667.
• Michell, J. (2008). Is psychometrics pathological science? Measurement,
6, 7–24.
• Millsap, R. E. (1997). Invariance in measurement and prediction: Their
relationship in the single-factor case. Psychological Methods, 2, 248–260.
• Millsap, R. E. (2007). Invariance in measurement and prediction revisited.
Psychometrika, 72, 463–473.
• Mislevy, R. J. (1992). Linking educational assessments: Concepts, issues,
methods, and prospects. Princeton, NJ: Educational Testing Service.
• Mislevy, R. J. (1993). Some formulas for use with Bayesian ability
estimates. Educational and Psychological Measurement, 52, 315–328.
• Muthén, B. O. (2002). Beyond SEM: General latent variable modeling.
Behaviormetrika, 29, 81–117.
View publication stats

You might also like