0% found this document useful (0 votes)
56 views

Test Construction, Writing and Validation

The document discusses test construction and different types of test items. It provides guidance on the steps involved in test construction, including planning the test, preparing a preliminary draft, selecting test items, determining the number of items, and choosing an item format. Common item formats discussed are dichotomous (true/false) and polytomous (multiple choice). The document emphasizes defining what is being measured, generating an appropriate item pool, avoiding long or double-barreled items, and being sensitive to cultural differences when writing items.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views

Test Construction, Writing and Validation

The document discusses test construction and different types of test items. It provides guidance on the steps involved in test construction, including planning the test, preparing a preliminary draft, selecting test items, determining the number of items, and choosing an item format. Common item formats discussed are dichotomous (true/false) and polytomous (multiple choice). The document emphasizes defining what is being measured, generating an appropriate item pool, avoiding long or double-barreled items, and being sensitive to cultural differences when writing items.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Test Construction, Writing and Validation

 Test construction means the final act of choosing the


most appropriate items or questions that are to be
included in a test.
Steps Involved in Test Construction
 Generally, for all psychological and educational test
construction, the following five steps are used:
1) Planning the test:
 The first task of a test constructor is to produce the
outline of the desired test, that is, the plan of the test.
For this purpose, the subject, medium, administration,
procedure, sample, population, and so on, are
established and age, sex, educational qualification,
mother tongue, rural/urban, socio-economic status and
other environmental factors must be considered. The
particular mental or behavioural characteristics should
be clearly stated before test construction is undertaken.
 Thus, it is a fact that without any purpose, a test cannot
be constructed. The test constructor himself sets the
purpose of the test, which must be clear, relevant and in
tune with the behaviour of the testees.
Practical Tips for Planning a Test
 Specify the objective of your test clearly.
 Specify the method of meeting these objectives.
 Specify the theoretical background involved in the use
and measurement of constructs.
 Review of the literature: What is the previous research on
the construct? If previous research exists then in what
way is your work an extension to earlier work?
 Give operational definition of the construct: This means
providing an objective and measurable definition of the
construct.

1
 Decide the population and the sample: This means
answering the question ‘what is the group on which you
will develop the test?’ For example, college students may
constitute the population, and 50 male students and 50
female students chosen from Kyambogo University
graduate courses may constitute the sample.
Characteristics of good/well-written items:
 Items should be situational in nature,
 Should be of moderate length,
 Should be of moderate difficulty and
 Should not use technical and culturally biased words
and phrases.
Examples of some poorly written items:
 I think leadership involves nurturance and quid pro quo:
a) Agree
b) Disagree
c) Often
d) Sometimes
 ‘A leader is a dealer in hope … merchant in faith …
manager of dreams … leader is a person who aligns
people along the contours of common sympathies for the
realization of shared goals.’
a) Yes
b) No
c) Can not say.
Examples of some well-written items:
 I take an initiative and I am willing to take appropriate
levels of risk to achieve my goal:
a) Yes
b) No

2
c) Undecided
 I am proactive while dealing with people:
a) Yes
b) No
c) Undecided
2) Preparing the preliminary draft of the test,
 Selecting test items: Only after the definite and
organized planning of the test, the tester prepares for the
test's preliminary tryout form. First of all, he/she selects
the various items from other sources according to the
basis of the subject matter and him/herself constructs
the test. He/she collects the items on the basis of his
experience; from the available standardized or
constructed tests, he selects the items; or by
constructing the tests from other sources which could
represent the subject matter of the test. Those collected
items are displayed in an organized way on the basis of
objectives and the subject matter. All types of items
required for the test are constructed in a pre-tryout form.

NOTE1: Writing test items can be difficult. DeVellis (2016)


provided several simple guidelines for item writing. Here are
six of them:
1. Define clearly what you want to measure. To do this,
use substantive theory as a guide and try to make items as
specific as possible.
2. Generate an item pool. Theoretically, all items are
randomly chosen from a universe of item content. In
practice, however, care in selecting and developing
items is valuable. Avoid redundant items. In the initial
phases, you may want to write three or four items for each
one that will eventually be used on the test or scale.

3
3. Avoid exceptionally long items. Long items are often
confusing or misleading.
4. Keep the level of reading difficulty appropriate for
those who will complete the scale.
5. Avoid “double-barreled” items that convey two or
more ideas at the same time. For example, consider an
item that asks the respondent to agree or disagree with the
statement, “I vote Democratic because I support social
programs.” There are two different statements with which
the person could agree: “I vote Democratic” and “I support
social programs.”
6. Consider mixing positively and negatively worded
items. Sometimes, respondents develop the “acquiescence
response set.” This means that the respondents will
tend to agree with most items. To avoid this bias, you can
include items that are worded in the opposite direction. For
example, in asking about depression, “I felt depressed”, “I
felt hopeful about the future”.

Note2: Times change, and tests can get outdated. When


“writing items, you need to be sensitive to ethnic and
cultural differences.

 Number of test items: To make the test interesting for


the testees, there should invariably be more than one
item included in it. Therefore, the testers are confronted
with the problem of selecting the test items and
presenting them with appropriate response criteria. For
example, the type of items to be included in the test
should be determined, such as true/false, yes/no, recall,
multiple-choice type, comparative parallel type, and so
on. The selection of the forms of the items determines the
score's product reliability.
 Type of test items/format: Most testers emphasize the
inclusion of a particular type of item for the full test,

4
because by doing so, the administration becomes easy.
For including different types of items, separate directions
have to be given. The inclusion of only one type of item
not only saves time but the process also becomes easier.
Although the use of various types of items in the test
results in complications which measure different types of
abilities, certain test constructors prefer to keep up the
interest by using various types of items that beget
different kinds of responses.
Examples of Test Item Types/Formats
 The Dichotomous Format: This offers two alternatives
for each item. Usually a point is given for the selection of
one of the alternatives. The most common example of
this format is the true-false examination. This test
presents testees with a series of statements. The testee’s
task is to determine which statements are true and
which are false.
 The advantages of true-false items include their obvious
simplicity, ease of administration, and quick scoring.
Another attractive feature is that the truefalse items
require absolute judgment. The test taker must declare
one of the two alternatives. However, there are also
disadvantages. For example, true-false items encourage
students to memorize material, making it possible for
students to perform well on a test that covers materials
they do not really understand. Furthermore, “truth” often
comes in shades of gray, and true-false tests do not allow
test takers the opportunity to show they understand this
complexity. Also, the mere chance of getting any item
correct is 50%. Thus, to be reliable, a true-false test
must include many items. Overall, dichotomous items
tend to be less reliable, and therefore less precise than
some of the other item formats.
 The polytomous format (sometimes called
polychotomous) resembles the dichotomous format
except that each item has more than two alternatives.
Typically, a point is given for the selection of one of the

5
alternatives, and no point is given for selecting any other
choice. Because it is a popular method of measuring
academic performance in large classes, the multiple-
choice examination is the polytomous format you have
likely encountered most often. Multiple-choice tests are
easy to score, and the probability of obtaining a correct
response by chance is lower than it is for truefalse items.
A major advantage of this format is that it takes little
time for test takers to respond to a particular item
because they do not have to write. Thus, the test can
cover a large amount of information in a relatively short
time. When taking a multiple-choice examination, you
must determine which of several alternatives is “correct.”
Incorrect choices are called distractors.
 It is worthwhile to consider some of the issues in the
construction and scoring of multiple-choice tests. First,
how many distractors should a test have? Psychometric
theory suggests that adding more distractors should
increase the reliability of the items. However, in practice,
adding distractors may not actually increase the
reliability because it is difficult to find good ones. The
reliability of an item is not enhanced by distractors that
no one would ever select. Studies have shown that it is
rare to find items for which more than three or four
distractors operate efficiently. Ineffective distractors
actually may hurt the reliability of the test because they
are time-consuming to read and can limit the number of
good items that can be included in a test. A review of the
problems associated with selecting distractors suggests
that it is usually best to develop three or four good
distractors for each item. Well-chosen distractors are an
essential ingredient of good items. Most multiple-choice
tests have followed the suggestion of four or five
alternatives.
Common Problems in Multiple Choice Item Writing
 Unfocused Stem: The stem should include the
information necessary to answer the question. Test

6
takers should not need to read the options to figure out
what question is being asked.
 Negative Stem: Whenever possible, the stem should
exclude negative terms such as not and except
 Window Dressing: Information in the stem that is
irrelevant to the question or concept being assessed is
considered “window dressing” and should be avoided.
 Unequal Option Length: The correct answer and the
distractors should be about the same length.
 Negative Options: Whenever possible, response options
should exclude negatives such as “not”
 Clues to the Correct Answer: Test writers sometimes
inadvertently provide clues by using vague terms such as
might, may, and can. Particularly in the social sciences
where certainty is rare, vague terms may signal that the
option is correct.
 Heterogeneous Options: The correct option and all of
the distractors should be in the same general category.

 The Likert Format : A scale using the Likert format


consists of items such as “I am afraid of heights.” Instead
of asking for a yes-no reply, five alternatives are offered:
strongly disagree, disagree, neutral, agree, and strongly
agree. In some applications, six options are used to avoid
allowing the respondent to be neutral. The six responses
might be strongly disagree, moderately disagree, mildly
disagree, mildly agree, moderately agree, and strongly
agree.
 Scoring requires that any negatively worded items be
reverse scored and the responses are then summed. This
format is especially popular in measurements of attitude.
For example, it allows researchers to determine how
much people endorse statements such as “The
government should not allow pregnant girls in schools.”

7
 Some studies have demonstrated that the Likert format
is superior to methods such as the visual analogue scale
for measuring complex coping responses. Others have
challenged the appropriateness of using traditional
parametric statistics to analyze Likert responses because
the data are at the ordinal rather than at an interval level
Nevertheless, the Likert format is familiar and easy to
use. It is likely to remain popular in personality and
attitude tests.
 Examples of Likert Scale Items
The following is a list of statements. Please indicate your
most appropriate choice by circling your answer to the right
of the statement.
Five choice format with neutral point
Some leaders can be trusted SD D N A SA
I am confident that I will SD D N A SA
achieve my life goals
I am comfortable talking to SD D N A SA
my parents about personal
problems
Six choice format without neutral point
Some leaders can be trusted SD MoD MiD MiA MoA SA
I am confident that I will SD MoD MiD MiA MoA SA
achieve my life goals
I am comfortable talking to SD MoD MiD MiA MoA SA
my parents about personal
problems

 The Category Format: A technique that is similar to the


Likert format but that uses an even greater number of
choices is the category format. Most people are familiar

8
with 10-point rating systems because we are regularly
asked questions such as, “On a scale from 1 to 10, with
1 as the lowest and 10 as the highest, how would you
rate your new boyfriend in terms of attractiveness?”
Doctors often ask their patients to rate their pain on a
scale from 1 to 10, where 1 is little or no pain and 10 is
intolerable pain. A category scale need not have exactly
10 points; it can have either more or fewer categories.
Although the 10-point scale is common in psychological
research and everyday conversation, controversy exists
regarding when and how it should be used.

 Validation of test items: After collecting the items of the


test and determining its item types, it is essential for the
tester to evaluate the items for the preliminary tryout of
the test. First, the collected items of the test should be
sent to two to three specialists of that field for knowing
their views. Each item should have the same response
format. In case of multiple choice items, the wrong
response should also be considered. Besides the clarity
of words, their usefulness, sufficiency of test material,
the forms of the test, their arrangement, and so on,
should also be reviewed. The preliminary tryout should
be given to specialists for testing. They should be
informed of the standards of age, education and other
important points of the target group, so that
improvements are made on their suggestions.
 Instructions for testees and test administrators: After
the preliminary tryout is amended, the tester should
write down the instruction separately for the testee and
the test administrators. The testee should divide the
instructions in two parts: (a) ordinary instructions, the
form of the test and the description of the objectives and
(b) detailed special instructions relating to the test
should be given which should be clear and
understandable. The usefulness of the instructions can
be checked by solving the exercise items, which a testee
has to solve in the final real test.

9
 Value assessment process (scoring): At the same stage,
the tester should also devise the value-assessment
process. For this, he will have to determine what score
(weighed score or standard score) should be given. Even
if the responses are to be received in yes or no, he/she
will have to make it clear with the help of the scoring
stencil that in the Item No. 1, if it is ‘yes’, what would it
mean.
 Pilot study: After the construction of the test and its
pre-tryout form, an effort is made to evaluate the test for
its quality, validity and reliability, and to delete the
unnecessary items. Therefore, prior to the construction
of the final form, it is essential to test the pre-tryout
form. This is also known as pilot study. This is done for
the following purposes:
 By this check, the weak and erroneous items,
those with double meanings, uncertain items,
inadequate items, those with incomplete meaning,
very difficult and very simple items should be
deleted from the test.
 The test objectives should be reflected in all the
selected items in order to ensure the validity of
every individual item.
 To indicate the actual number of items included in
the final form of the test.
 To express or bring out the shortcomings of the
responses of the testee and the tester.
 To determine the inter-item correlations and, thus,
prevent overlap in item content.
 To arrange all the items of the test in sub-parts.
 To determine the test instructions, related
precautions and groups to be affected.
 To know the actual limit of the final form of the
test.

10
 To determine the value assessment of the test.

3) Trying out the preliminary draft of the test,


 For the pre-tryout, the test is usually administered on 15
to 20 per cent of the total population with the objective of
finding out its main shortcomings and remove them
somehow. Thus, the test is administered on a small
group by which many aspects related to the test are
estimated and the main shortcomings are removed.
However, this test administration does not allow any
individual item analysis. In this process of evaluation,
first, that sample is determined for which the test is
being constructed and, for its evaluation, it is
administered to a representative sample of the same
group. The time to be taken for the administration of the
test is also determined. If the test has a parallel form,
then the difference in the time taken in the
administration of both the tests should be recorded. In
the same way, the instructions of the test should be
devised in brief and with similarity, so that these are
easily and clearly understood by the testees. Besides, the
instructions given to the subjects, the insufficiency of the
test and exercise items should also be noted. At the same
time, the scoring process which is easy or familiar should
be used. If, under some circumstances, the possibility of
guessing or the effect of situational factors exist, then the
following correction formula should be adopted:

Where;
• S=pure score for guessing
• R=number of correct responses
• W=number of wrong responses

11
• N=total number of responses available

Note: Omitted responses are not included; they provide


neither credit nor penalty.

Example, suppose that your roommate randomly filled


out the answer sheet to your psychology test. The test had
100 items, each with four choices. By chance, her expected
score would be 25 correct. Let’s assume that she got
exactly that, though in practice this may not occur, because
25 is the average random score. The expected score
corrected for guessing would be:
S = R- W/N-1
S= 25- 75/4-1
S= 25-75/3
S= 25-25 =0

4) Evaluating the test


 After checking the test, it is evaluated. The test which is
appropriate for measuring specific variables will provide
the best results. The evaluation of the test is done on the
following condition:
 Item difficulty: For a test that measures achievement or
ability, item difficulty is defined as the proportion of
people who succeed on a particular test item (i.e., the
number of people who get a particular item correct). The
higher the proportion of people who get the item correct,
the easier the item. Other obvious limits are the extremes
of the scale. An item that is answered correctly by 100%
of the respondents offers little value because it does not
discriminate among individuals. Usually, items with 50%
difficulty level are considered appropriate. A test should
not include items which are either so easy that they are

12
correctly solved by all the group members or so difficult
that they are solved by none. Hence, for the evaluation of
the test, study of its difficulty level is the first step.
Steps used to calculate item difficulty include;
 Step1: Find half of the difference between 100% success
and chance performance
100-25/2
75/2 = 37.
 Step2: Add this value to the probability of performing
correctly by chance
37.5+25 = 62.5
 Item discrimination: Usually, a test should include
those items which can differentiate/discriminate between
extreme groups such as the upper scoring and lower
scoring groups. Items with zero or negative index value
are not included. The views and criticism of those who
take the test can be considered. According to their
suggestions, the language of an item can be changed or
an item may be discarded.
Finding the item discrimination index
 Step1. Identify a group of students who have done well
on the test—or example, those in the 67th percentile and
above. Also identify a group that has done poorly—for
example, those in the 33rd percentile and below.
 Step 2. Find the proportion of students in the high group
and the proportion of students in the low group who got
each item correct.
 Step 3. For each item, subtract the proportion of correct
responses for the low group from the proportion of
correct responses for the high group. This gives the item
discrimination index (di ).
Example:
Item Proportion correct for Proportion correct of Discriminab

13
students in the top students in the ility index
third of class bottom third of class (di=Pt-Pb)
1 89 34 55
2 76 36 40
3 97 45 52
4 98 95 3
5 56 74 -18

 In this example, items 1, 2, and 3 appear to discriminate


reasonably well. Item 4 does not discriminate well
because the level of success is high for both groups; it
must be too easy. Item 5 appears to be a bad item
because it is a “negative discriminator.” This sometimes
happens on multiple-choice examinations when
overprepared students find some reason to disqualify the
response keyed as correct.

 Standardization: For evaluation, the test can be


compared with some other standardized test meant for
the same purpose.
 Reliability: The reliability of the test is also to be
determined. Low reliability indicates that the result of the
test cannot be relied upon. Different methods are
available to determine test reliability.

5) Construction of the final draft of the test.


 The final test construction starts after initially testing
and evaluating the test at different levels. Usually, the
final test includes those items which are valid and have
appropriate difficulty level. Instructions for the testees
are given properly and clearly, so that the test can be
used in a scientific manner. The limits (the range of
scores) and the scoring method are also determined. At

14
this stage, all the important aspects should be properly
organized because the reliability and validity of the test
depend on the final format of the test.

Test Item Analysis


 The effectiveness and usefulness of any test depends
upon the qualities of the items that are included in it.
The score of the test is obtained as a result of its validity,
reliability and the intercorrelation between two items.
Hence, to make the test more effective, the test
constructor should study one by one all the items which
are to be included in it. This process is known as item
analysis. In other words, in this method of item analysis,
all items of the test are studied individually to see as to
what number of persons of a group or percentage has
actually tried to respond or solve each item.

Factors to Consider in Test Administration


 In the actual application of tests, we must consider many
other potential sources of error, including the testing
situation, tester characteristics, and test-taker
characteristics.
 The relationship between examiner and test taker.
Rapport creation during test administration enhances
performance of test takers on the test. This implies that a
positive relationship should be created during test
administration. Studies also demonstrates that
familiarity with the test taker, and perhaps preexisting
notions about the test taker’s ability, can either positively
or negatively bias test results. In most testing situations,
examiners should be aware that the interaction with test
takers can influence the results. They should also keep
in mind that subtle cues given by the test administered
can affect the level of performance expected by the
examiner.

15
 The race of the tester. Because of concern about bias,
the effects of the tester’s race have generated
considerable attention. Some parents feel that their
children should not be tested by anyone except a
member of their own race
 Strereotype threats. Being evaluated by others can be
very threatening. Most people worry about how well they
will perform on tests. The situation may be made worse
for groups victimized by negative stereotyping. Test
takers may face a double threat. First, there is personal
concern about how one will be evaluated and whether
they will do well on the test. For people who come from
groups haunted by negative stereotypes, there may be a
second level of threat. As a member of a stereotyped
group, there may be extra pressure to disconfirm
inappropriate negative stereotypes. For example, some
people hold the inaccurate belief that women have less
mathematical aptitude than men. Studies have shown
that women underperform on difficult mathematic
status, but not on easy tests. When men and women are
told they are taking a test that captures gender
differences in test performance, men score higher than
equally qualified women.
 Language of test takers. The amount of linguistic
demand can put non-English speakers at a
disadvantage. Even for tests that do not require verbal
responses, it is important to consider the extent to which
test instructions assume that the test taker understands
English. Some of the new standards concern testing
individuals with different linguistic backgrounds. The
standards emphasize that some tests are inappropriate
for people whose knowledge of the language is
questionable. For example, the validity and reliability of
tests for those who do not speak English is suspect.
Translating tests is difficult, and it cannot be assumed
that the validity and reliability of the translation are
comparable to the English version.

16
 Training of test administrators. Different assessment
procedures require different levels of training. Many
behavioral assessment procedures require training and
evaluation but not a formal degree or diploma.
Psychiatric diagnosis is sometimes obtained using the
Structured Clinical Interview for DSM-V (SCID). Typical
SCID users are licensed psychiatrists or psychologists
with additional training on the test. There are no
standardized protocols for training people to administer
complicated tests such as the Wechsler Adult Intelligence
Scale-Revised (WAIS-R).
 Expectancy effects. Beliefs held by people
administering and scoring tests might also get translated
into inaccurate test scores. A well-known line of research
in psychology has shown that data sometimes can be
affected by what an experimenter expects to find.
 Effects of reinforcing responses. Because
reinforcement affects behavior, testers should always
administer tests under controlled conditions. Several
studies have shown that reward can significantly affect
test performance. For example, incentives can help
improve performance on IQ tests for specific subgroups
of children. Many studies have shown that children will
work quite hard to obtain praise such as, “You are doing
well”. The potency of reinforcement requires that test
administrators exert strict control over the use of
feedback. Because different test takers give different
responses, one cannot ensure that the advantages
resulting from reinforcement will be the same for all
people. As a result, most test manuals and interviewer
guides insist that no feedback be given.
 Computer-Assisted Test Administration. Computer
technology affects many fields, including testing and test
administration. Today, virtually all educational
institutions and most households enjoy access to the
Internet. This easy access has caused test administration
on computers to blossom. Interactive testing involves the

17
presentation of test items on a computer terminal or
personal computer and the automatic recording of test
responses. The computer can also be programmed to
instruct the test taker and to provide instruction when
parts of the testing procedure are not clear
 Mode of administration. A variety of studies have
considered the difference between self-administered
measures and those that are administered by a tester or
a trained interviewer. Studies on health, for example,
have shown that measures administered by an
interviewer are more likely to show people in good health
than are measures that are self-completed. Another
study showed that measures administered via telephone
yielded higher health scores than those that required
people to fill out the questionnaires on their own. Most
studies show that computer administration leads to more
accurate results. Men, in particular, may be more likely
to offer random responses using the older paper-and-
pencil versions of tests Even though mode of
administration has only small effects in most situations,
it should be constant within any evaluation of patients
 Subject variables. A final variable that may be a serious
source of error is the state of the subject. Motivation and
anxiety can greatly affect test scores. For example, many
college students suffer from a serious debilitating
condition known as test anxiety. Such students often
have difficulty focusing attention on the test items and
are distracted by other thoughts such as, “I am not doing
well” or “I am running out of time”. It may seem obvious
that illness affects test scores. When you have a cold or
the flu, you might not perform as well as when you are
feeling well. Many variations in health status affect
performance in behavior and in thinking. Some
populations need special consideration. For example, the
elderly may do better with individual testing sessions,
even for tests that can be administered to groups

18
19
a) Coding,
b) Analysis and
c) Interpretation

20

You might also like