Test Construction, Writing and Validation
Test Construction, Writing and Validation
1
Decide the population and the sample: This means
answering the question ‘what is the group on which you
will develop the test?’ For example, college students may
constitute the population, and 50 male students and 50
female students chosen from Kyambogo University
graduate courses may constitute the sample.
Characteristics of good/well-written items:
Items should be situational in nature,
Should be of moderate length,
Should be of moderate difficulty and
Should not use technical and culturally biased words
and phrases.
Examples of some poorly written items:
I think leadership involves nurturance and quid pro quo:
a) Agree
b) Disagree
c) Often
d) Sometimes
‘A leader is a dealer in hope … merchant in faith …
manager of dreams … leader is a person who aligns
people along the contours of common sympathies for the
realization of shared goals.’
a) Yes
b) No
c) Can not say.
Examples of some well-written items:
I take an initiative and I am willing to take appropriate
levels of risk to achieve my goal:
a) Yes
b) No
2
c) Undecided
I am proactive while dealing with people:
a) Yes
b) No
c) Undecided
2) Preparing the preliminary draft of the test,
Selecting test items: Only after the definite and
organized planning of the test, the tester prepares for the
test's preliminary tryout form. First of all, he/she selects
the various items from other sources according to the
basis of the subject matter and him/herself constructs
the test. He/she collects the items on the basis of his
experience; from the available standardized or
constructed tests, he selects the items; or by
constructing the tests from other sources which could
represent the subject matter of the test. Those collected
items are displayed in an organized way on the basis of
objectives and the subject matter. All types of items
required for the test are constructed in a pre-tryout form.
3
3. Avoid exceptionally long items. Long items are often
confusing or misleading.
4. Keep the level of reading difficulty appropriate for
those who will complete the scale.
5. Avoid “double-barreled” items that convey two or
more ideas at the same time. For example, consider an
item that asks the respondent to agree or disagree with the
statement, “I vote Democratic because I support social
programs.” There are two different statements with which
the person could agree: “I vote Democratic” and “I support
social programs.”
6. Consider mixing positively and negatively worded
items. Sometimes, respondents develop the “acquiescence
response set.” This means that the respondents will
tend to agree with most items. To avoid this bias, you can
include items that are worded in the opposite direction. For
example, in asking about depression, “I felt depressed”, “I
felt hopeful about the future”.
4
because by doing so, the administration becomes easy.
For including different types of items, separate directions
have to be given. The inclusion of only one type of item
not only saves time but the process also becomes easier.
Although the use of various types of items in the test
results in complications which measure different types of
abilities, certain test constructors prefer to keep up the
interest by using various types of items that beget
different kinds of responses.
Examples of Test Item Types/Formats
The Dichotomous Format: This offers two alternatives
for each item. Usually a point is given for the selection of
one of the alternatives. The most common example of
this format is the true-false examination. This test
presents testees with a series of statements. The testee’s
task is to determine which statements are true and
which are false.
The advantages of true-false items include their obvious
simplicity, ease of administration, and quick scoring.
Another attractive feature is that the truefalse items
require absolute judgment. The test taker must declare
one of the two alternatives. However, there are also
disadvantages. For example, true-false items encourage
students to memorize material, making it possible for
students to perform well on a test that covers materials
they do not really understand. Furthermore, “truth” often
comes in shades of gray, and true-false tests do not allow
test takers the opportunity to show they understand this
complexity. Also, the mere chance of getting any item
correct is 50%. Thus, to be reliable, a true-false test
must include many items. Overall, dichotomous items
tend to be less reliable, and therefore less precise than
some of the other item formats.
The polytomous format (sometimes called
polychotomous) resembles the dichotomous format
except that each item has more than two alternatives.
Typically, a point is given for the selection of one of the
5
alternatives, and no point is given for selecting any other
choice. Because it is a popular method of measuring
academic performance in large classes, the multiple-
choice examination is the polytomous format you have
likely encountered most often. Multiple-choice tests are
easy to score, and the probability of obtaining a correct
response by chance is lower than it is for truefalse items.
A major advantage of this format is that it takes little
time for test takers to respond to a particular item
because they do not have to write. Thus, the test can
cover a large amount of information in a relatively short
time. When taking a multiple-choice examination, you
must determine which of several alternatives is “correct.”
Incorrect choices are called distractors.
It is worthwhile to consider some of the issues in the
construction and scoring of multiple-choice tests. First,
how many distractors should a test have? Psychometric
theory suggests that adding more distractors should
increase the reliability of the items. However, in practice,
adding distractors may not actually increase the
reliability because it is difficult to find good ones. The
reliability of an item is not enhanced by distractors that
no one would ever select. Studies have shown that it is
rare to find items for which more than three or four
distractors operate efficiently. Ineffective distractors
actually may hurt the reliability of the test because they
are time-consuming to read and can limit the number of
good items that can be included in a test. A review of the
problems associated with selecting distractors suggests
that it is usually best to develop three or four good
distractors for each item. Well-chosen distractors are an
essential ingredient of good items. Most multiple-choice
tests have followed the suggestion of four or five
alternatives.
Common Problems in Multiple Choice Item Writing
Unfocused Stem: The stem should include the
information necessary to answer the question. Test
6
takers should not need to read the options to figure out
what question is being asked.
Negative Stem: Whenever possible, the stem should
exclude negative terms such as not and except
Window Dressing: Information in the stem that is
irrelevant to the question or concept being assessed is
considered “window dressing” and should be avoided.
Unequal Option Length: The correct answer and the
distractors should be about the same length.
Negative Options: Whenever possible, response options
should exclude negatives such as “not”
Clues to the Correct Answer: Test writers sometimes
inadvertently provide clues by using vague terms such as
might, may, and can. Particularly in the social sciences
where certainty is rare, vague terms may signal that the
option is correct.
Heterogeneous Options: The correct option and all of
the distractors should be in the same general category.
7
Some studies have demonstrated that the Likert format
is superior to methods such as the visual analogue scale
for measuring complex coping responses. Others have
challenged the appropriateness of using traditional
parametric statistics to analyze Likert responses because
the data are at the ordinal rather than at an interval level
Nevertheless, the Likert format is familiar and easy to
use. It is likely to remain popular in personality and
attitude tests.
Examples of Likert Scale Items
The following is a list of statements. Please indicate your
most appropriate choice by circling your answer to the right
of the statement.
Five choice format with neutral point
Some leaders can be trusted SD D N A SA
I am confident that I will SD D N A SA
achieve my life goals
I am comfortable talking to SD D N A SA
my parents about personal
problems
Six choice format without neutral point
Some leaders can be trusted SD MoD MiD MiA MoA SA
I am confident that I will SD MoD MiD MiA MoA SA
achieve my life goals
I am comfortable talking to SD MoD MiD MiA MoA SA
my parents about personal
problems
8
with 10-point rating systems because we are regularly
asked questions such as, “On a scale from 1 to 10, with
1 as the lowest and 10 as the highest, how would you
rate your new boyfriend in terms of attractiveness?”
Doctors often ask their patients to rate their pain on a
scale from 1 to 10, where 1 is little or no pain and 10 is
intolerable pain. A category scale need not have exactly
10 points; it can have either more or fewer categories.
Although the 10-point scale is common in psychological
research and everyday conversation, controversy exists
regarding when and how it should be used.
9
Value assessment process (scoring): At the same stage,
the tester should also devise the value-assessment
process. For this, he will have to determine what score
(weighed score or standard score) should be given. Even
if the responses are to be received in yes or no, he/she
will have to make it clear with the help of the scoring
stencil that in the Item No. 1, if it is ‘yes’, what would it
mean.
Pilot study: After the construction of the test and its
pre-tryout form, an effort is made to evaluate the test for
its quality, validity and reliability, and to delete the
unnecessary items. Therefore, prior to the construction
of the final form, it is essential to test the pre-tryout
form. This is also known as pilot study. This is done for
the following purposes:
By this check, the weak and erroneous items,
those with double meanings, uncertain items,
inadequate items, those with incomplete meaning,
very difficult and very simple items should be
deleted from the test.
The test objectives should be reflected in all the
selected items in order to ensure the validity of
every individual item.
To indicate the actual number of items included in
the final form of the test.
To express or bring out the shortcomings of the
responses of the testee and the tester.
To determine the inter-item correlations and, thus,
prevent overlap in item content.
To arrange all the items of the test in sub-parts.
To determine the test instructions, related
precautions and groups to be affected.
To know the actual limit of the final form of the
test.
10
To determine the value assessment of the test.
Where;
• S=pure score for guessing
• R=number of correct responses
• W=number of wrong responses
11
• N=total number of responses available
12
correctly solved by all the group members or so difficult
that they are solved by none. Hence, for the evaluation of
the test, study of its difficulty level is the first step.
Steps used to calculate item difficulty include;
Step1: Find half of the difference between 100% success
and chance performance
100-25/2
75/2 = 37.
Step2: Add this value to the probability of performing
correctly by chance
37.5+25 = 62.5
Item discrimination: Usually, a test should include
those items which can differentiate/discriminate between
extreme groups such as the upper scoring and lower
scoring groups. Items with zero or negative index value
are not included. The views and criticism of those who
take the test can be considered. According to their
suggestions, the language of an item can be changed or
an item may be discarded.
Finding the item discrimination index
Step1. Identify a group of students who have done well
on the test—or example, those in the 67th percentile and
above. Also identify a group that has done poorly—for
example, those in the 33rd percentile and below.
Step 2. Find the proportion of students in the high group
and the proportion of students in the low group who got
each item correct.
Step 3. For each item, subtract the proportion of correct
responses for the low group from the proportion of
correct responses for the high group. This gives the item
discrimination index (di ).
Example:
Item Proportion correct for Proportion correct of Discriminab
13
students in the top students in the ility index
third of class bottom third of class (di=Pt-Pb)
1 89 34 55
2 76 36 40
3 97 45 52
4 98 95 3
5 56 74 -18
14
this stage, all the important aspects should be properly
organized because the reliability and validity of the test
depend on the final format of the test.
15
The race of the tester. Because of concern about bias,
the effects of the tester’s race have generated
considerable attention. Some parents feel that their
children should not be tested by anyone except a
member of their own race
Strereotype threats. Being evaluated by others can be
very threatening. Most people worry about how well they
will perform on tests. The situation may be made worse
for groups victimized by negative stereotyping. Test
takers may face a double threat. First, there is personal
concern about how one will be evaluated and whether
they will do well on the test. For people who come from
groups haunted by negative stereotypes, there may be a
second level of threat. As a member of a stereotyped
group, there may be extra pressure to disconfirm
inappropriate negative stereotypes. For example, some
people hold the inaccurate belief that women have less
mathematical aptitude than men. Studies have shown
that women underperform on difficult mathematic
status, but not on easy tests. When men and women are
told they are taking a test that captures gender
differences in test performance, men score higher than
equally qualified women.
Language of test takers. The amount of linguistic
demand can put non-English speakers at a
disadvantage. Even for tests that do not require verbal
responses, it is important to consider the extent to which
test instructions assume that the test taker understands
English. Some of the new standards concern testing
individuals with different linguistic backgrounds. The
standards emphasize that some tests are inappropriate
for people whose knowledge of the language is
questionable. For example, the validity and reliability of
tests for those who do not speak English is suspect.
Translating tests is difficult, and it cannot be assumed
that the validity and reliability of the translation are
comparable to the English version.
16
Training of test administrators. Different assessment
procedures require different levels of training. Many
behavioral assessment procedures require training and
evaluation but not a formal degree or diploma.
Psychiatric diagnosis is sometimes obtained using the
Structured Clinical Interview for DSM-V (SCID). Typical
SCID users are licensed psychiatrists or psychologists
with additional training on the test. There are no
standardized protocols for training people to administer
complicated tests such as the Wechsler Adult Intelligence
Scale-Revised (WAIS-R).
Expectancy effects. Beliefs held by people
administering and scoring tests might also get translated
into inaccurate test scores. A well-known line of research
in psychology has shown that data sometimes can be
affected by what an experimenter expects to find.
Effects of reinforcing responses. Because
reinforcement affects behavior, testers should always
administer tests under controlled conditions. Several
studies have shown that reward can significantly affect
test performance. For example, incentives can help
improve performance on IQ tests for specific subgroups
of children. Many studies have shown that children will
work quite hard to obtain praise such as, “You are doing
well”. The potency of reinforcement requires that test
administrators exert strict control over the use of
feedback. Because different test takers give different
responses, one cannot ensure that the advantages
resulting from reinforcement will be the same for all
people. As a result, most test manuals and interviewer
guides insist that no feedback be given.
Computer-Assisted Test Administration. Computer
technology affects many fields, including testing and test
administration. Today, virtually all educational
institutions and most households enjoy access to the
Internet. This easy access has caused test administration
on computers to blossom. Interactive testing involves the
17
presentation of test items on a computer terminal or
personal computer and the automatic recording of test
responses. The computer can also be programmed to
instruct the test taker and to provide instruction when
parts of the testing procedure are not clear
Mode of administration. A variety of studies have
considered the difference between self-administered
measures and those that are administered by a tester or
a trained interviewer. Studies on health, for example,
have shown that measures administered by an
interviewer are more likely to show people in good health
than are measures that are self-completed. Another
study showed that measures administered via telephone
yielded higher health scores than those that required
people to fill out the questionnaires on their own. Most
studies show that computer administration leads to more
accurate results. Men, in particular, may be more likely
to offer random responses using the older paper-and-
pencil versions of tests Even though mode of
administration has only small effects in most situations,
it should be constant within any evaluation of patients
Subject variables. A final variable that may be a serious
source of error is the state of the subject. Motivation and
anxiety can greatly affect test scores. For example, many
college students suffer from a serious debilitating
condition known as test anxiety. Such students often
have difficulty focusing attention on the test items and
are distracted by other thoughts such as, “I am not doing
well” or “I am running out of time”. It may seem obvious
that illness affects test scores. When you have a cold or
the flu, you might not perform as well as when you are
feeling well. Many variations in health status affect
performance in behavior and in thinking. Some
populations need special consideration. For example, the
elderly may do better with individual testing sessions,
even for tests that can be administered to groups
18
19
a) Coding,
b) Analysis and
c) Interpretation
20