Unit 3-Psychological Testing
Unit 3-Psychological Testing
History of testing
Types of tests
Achievement tests- designed to measure what people have learned-skills. Not often
used by psychologists.
Ability tests- ability testing focuses on the question of what people can do when they
are at their very best. In other words, ability tests are designed to measure capacity or
potential rather than actual achievement.
o Most of the tests of these sorts are called tests of intelligence or tests of
aptitude.
Personality tests- they are designed to measure characteristic patterns such as
attitudes, interests etc.
Apgar test- baby’s first test conducted immediately after birth. It is a quick,
multivariate assessment of heart rate, respiration, muscle tone, reflex irritability, and
color. The total Apgar score (0-10) helps determine the need for any immediate
medical attention.
Psychological assessment-
o Measurement
Correct/incorrect item responses – tests (intelligence, aptitude etc.).
Not using correct/ incorrect responses- questionnaires, inventories.
o Non measurement
Interviews, observations etc.- sample test
Unstructured other questionnaires/checklists- sample test
Reliability
A good test should be highly reliable. This means that the test should give similar
results even though different testers administer it, different people score it, different
forms of the test are given, and the same person takes the test at two or more different
times.
It is the consistency of the test/measure.
In actual practice, psychological tests are never perfectly reliable. One reason is that
real, meaningful changes do occur in individuals over time.
The best intelligence tests usually yield reliability correlation coefficient of .90 or
higher. Almost all personality tests have lower reliability than this; this may be due
partly to the instability of things like attitudes and feelings, which they are designed to
measure.
To improve reliability, ensure that the test was administered and scored by a truly
standard procedure.
With increase in the length of the test, both reliability and validity also increases.
Test-retest reliability- also called as temporal consistency. Same test must be
administered more than one time (at least 2 times) on the same sample population
after a period of time (2-8 weeks at least). Here the correlation scores between the two
tests should have r value greater or equal to 0.30. Stability is tested.
o Time sampling error occurs here. Too much or too little time gap is given.
o Practice effect can occur.
Parallel/Alternate form reliability- also known as equivalent form reliability.
Compare the different/revised versions of the same test. It can be defined as measures
that is obtained by conducting assessment of the same phenomenon with the
participation of the same sample group via more than one assessment method. The
earlier test should be already standardized. It is done with or without a time gap. Here
the correlation scores between the two tests should have r value greater or equal to
0.30. Stability is tested.
o Content (correlating two very different tests) and time sampling error.
Inter-rater reliability- used in projective or open-ended tests. Also called inter-
scorer reliability. Process for quantitatively determining the level of agreement
between 2 or more observers or judges. Equivalence is assessed. At least 80%
agreement should be there. Correlation is measured by Cohen’s kappa.
Internal consistency (homogeneity)- is assessed using item to total correlation, split
half reliability, Kuder-Richardson coefficient (same as Cronbach’s alpha but only for
dichotomous items), spearman-brown correction (adjust the test length) and
Cronbach’s/coefficient alpha (average of all possible split half reliabilities).
Split half reliability/odd-even reliability- test is divided into 2 parts and then both
are correlated to each other. There are many ways of doing split half.
Validity
The test must really measure what it has been designed to measure.
Assessing the validity of any test requires careful selection of appropriate criterion
measures.
We are usually satisfied with validities of .50 or .60. validities of .30 and .40 is
common.
One reason that validity coefficients are lower than reliability coefficients is that the
reliability of a test sets limits on how valid the test can be. A test that cannot give us
reliable scores from one testing to the next is not likely to show dependable
correlations with any validity-criterion measure either.
On the other hand, high reliability is no guarantee that a test is valid.
Face validity- most common, associated with the highest level of subjectivity.
Superficial level of accuracy. Appears to measure what the test is measuring by the
look of it.
Construct validity- relates to assessment of suitability of measurement tool to
measure the phenomenon being studied. Truest form of validity. Construct is formed
either by observation or by theory that we need to validate.
o Convergent validity- one theoretical construct is compared with another
related one on the same sample. Checks whether it measures the construct
only. Here 0.30 or more correlation should be present.
o divergent validity/discriminate validity- one construct is compared with
unrelated construct on the same sample to check whether the questionnaire is
measuring something else. No correlation should be present.
Content validity- whether what you want to measure is measured correctly by the
test including all the aspects.
Sampling validity- similar to content validity. Ensures that the area of coverage of
the measure within the research area is vast.
Criterion related validity- involves comparison of tests result with the outcome.
Meets the external criteria that is already set and also correlates with other measures
who measure the same construct.
o Predictive validity- predicts about the consequence/outcome. check for
validity more than once on the same population with time gap.
o Concurrent- check for validity more than once on the same population with
no time gap.
Specificity- is an index of how well it avoids picking out those who do not have the
target condition (how few false positives are there). Hish specificity used in
diagnostic tool.
Sensitivity- is an index how well the measure picks out those patients who have the
target condition (how few false negatives are there). High sensitivity used in
screening tool.
Norms
Norms are set of scores obtained by representative groups of people for whom the test
is intended. The scores obtained by these groups provide a basis for interpreting any
individual’s score.
Assessing intelligence
Stanford Binet intelligence test- developed by Binet and Simon to identify mentally
retarded children in French schools.
o Lewis Terman of Stanford university gave English language version in 1916.
o Binet devised his test by age levels. This was because he observed that
mentally retarded students seemed to think like non-retarded children at young
age.
o Within these scales, the task at each level is those which average children of
that age should find moderately difficult. Children are given only the level of
their age.
o For testing purposes, the highest level at which all items are passed by a given
child is that child’s basal age.
o Starting with that basal age, the tester adds additional credits for each item the
child passes until the child reaches a ceiling age- that is the lowest level at
which all items within the level are failed.
o Binet and Terman worked from a notion of intelligence as an overall ability
related to abstract reasoning and problem solving.
o Intelligence quotient- MA/CA ratio, proposed by William Stern in 1912.
o IQ = MA/CA* 100
o A ratio IQ is not useful with adults because mental age does not increase in a
rapid orderly fashion after the middle teens.
Wechsler tests-
o David Wechsler developed a family of tests for people at various age levels. It
includes the Wechsler adult intelligence scale (WAIS), WAIS-R (1981),
Wechsler preschool and primary scale of intelligence (WPPSI, 1967),
Wechsler intelligence scale for children, WISC-R (1974).
o The subtests can be grouped into two categories, verbal and performance.
o Wechsler devised the deviation IQ. It is a type of standard score- that is an IQ
expressed in standard deviation units.
o Wechsler’s tests yield 3 different deviation IQs, one for the verbal subtests,
another for the performance subtests and a third full scale IQ.
o Standard score = X-M/SD. X is the individual’s score, M is the mean, and SD
is the standard deviation.
Process oriented assessment of intellectual development
o Ina Uzgiris and J. McV. Hunt (1975) developed a set of 6 developmental
scales intended to measure “progressive levels of cognitive organization” in
the first 2 years of life. It was based on Piaget’s theory.
o It was designed to capture 6 different processes of cognitive development, all
occurring within what Piaget labeled the sensorimotor stage.
o They did not standardize their scale because they focus on where a given
infant is in relation to a sequential process of development, not how the infant
compares to other babies of the same age.
Nonverbal tests (adults)- raven’s progressive matrices, Cattell’s culture fair
intelligence test (CCFIT). All are culturally fair.
Non-verbal tests for children- colored progressive matrices, NNAT. All are culturally
fair.
Performance test- Koh’s block test, Alexander passalong, Bhatia’s performance, cube
construction test.
For children draw a person test.
Personality assessment
Neuropsychological tests
Creativity test
Assess novel, original thinking and the capacity to find unusual or unexpected
solutions, especially for vaguely defined problems.
Torrence test of creative thinking. It consists of 2 sections- verbal and figural.
Remote associates test (RAT) developed by Mednick and Mednick.
Test construction
Bean defines item as “a single task or question that usually cannot be broken down
into any smaller units.”
Item writing- an item must have the following characteristics.
o There should be no ambiguity regarding its meaning.
o Should not be too easy or difficult.
o Should have discriminatory power, that is, it must clearly distinguish between
those who posses a trait and those who do not.
o Should only measure the significant aspects of knowledge or understanding.
o Should nor encourage guesswork.
o Should not be such that its meaning is dependent upon another item and/or it
can be answered by referring to another item.
Studies have revealed that usually 25-30 dichotomous items are needed to have a
reliability coefficient as high as 0.80 whereas 15-20 items are needed to reach the
same reliability when multipoint items are used.
An item writer should always write almost twice the number of items to be retained
finally.
Items are of two types-
o Essay item or free answer item- examinee relies upon his memory and past
associations to answer the questions in a few words only. They can answer in
whatever manner they like. Most appropriate to measure higher mental
processes. They are of two types.
Short answer type.
Long answer type or extended answer essay type.
o There are 2 general methods of scoring the responses of essay items-
Sorting method- the answer sheets of the examinees are first sorted
into different groups according to the fairness of the answers. After
this, the scorer assigns the weightage to be given to each group and
accordingly, he gives weightage to each answer and finally adds it to
constitute a total score. It is done quickly and chances for erratic
marking are reduced considerably.
Point score method- well suited to score short answer types. A
grading key consisting of the correct answers and the points to be
assigned is prepared beforehand by the scorer.
o Objective item or new type of item- when there is only one fixed correct
answer. All objective items can be divided into two broad categories-
Supply type- when the examinee has to write down the correct answer
on his own. They are divided into two main categories.
Unstructured short answer type- given in a question form.
Completion item or fill in item- present in the form of an
incomplete sentence.
Selection type- examinee has to select or identify the correct answer
from a few given. Nunally refers to such items as identification items.
They are of different types-
Two-alternative item- two answers are provided from which
the examinee is required to select the one which he thinks is
correct.
Multiple choice item- most common, effective and flexible of
all objective test. Also known as polytomous or polychotomous
format item.
Matching item- items on the left column are to be paired with
the items on the right column.
o Important methods of scoring objective test items-
Overlay key method- a cut out key is prepared, that is, a window which
may display the answer to item on each page.
Strip key method- the correct answer to each item is printed vertically
on a strip of paper on a cardboard.
Item analysis- is a set of procedures, that is applied to know the indices for the
truthfulness (or validity) of items. It demonstrates how effectively a given test item
functions within the total tests.
The main objectives of item analysis are-
o It provides an index of the difficulty value to each item.
o It indicates the discrimination value of each item. This is k/a item validity.
o Indicates effectiveness of the distractors in multiple choice questions. Also k/a
distractor analysis.
o Also indicates why a particular item in the test has not function effectively and
how this might be modified so that its functional significance can be
increased.
Power test- examinee is allowed sufficient time for answering all items of the test.
Thus, emphasis here is upon measurement of the ability (or power) of the examinee
and not the speed.
Item difficulty- to find the difficulty value of the item. It is the proportion or
percentage of the examinee or individuals who answer the item correctly.
o The maximum number of possible discriminations between examinees of any
item is 50*50= 2500. This occurs when an item is answered correctly by 50%
of the examinees and answered wrongly by 50%.
o The proportion passing an item is inversely related to the difficulty of an item.
o The higher the proportion or percentage of getting the items right (higher the
index of difficulty), the easier the item and vice versa.
o Test items must have a normal distribution with respect to indices of
difficulty.
o Moderate difficulty indices are preferred because they signal the maximum
variance.
o As the index increase or decreases, the variance of the item gradually
decreases, that is, its ability to make comparisons among those who pass and
those who fail decreases.
o There are two important methods of determining the difficulty value of an
item-
Method of judgement- difficulty id determined on the basis of the
judgement of experts.
Empirical method- also k/a statistical method.
For dichotomous item, the index of difficulty can be
determined by the formula: p = R/N
P= difficulty index, R= number of examinees who pass the test,
N=total number of examinees.
Index of difficulty can also be determined from a certain
portion of the group of examinees: p = R(u) + R(L)/N(u) +
N(L).
R(u) is the number of examinees answering correctly in the
upper group; R(L) answering correctly in the lower group; N(u)
is the number of examinees in upper group; N(L) is the number
of examines in the lower group.
Another way when two equal extreme groups have been set up
is averaging the two proportions, to get D. covert them first into
proportion- number of correct/incorrect divided by total, and
then averaging both.
o A test which consists of items having D (difficulty) values close to 0.5, is k/a
peaked test.
o D value affect the reliability coefficient and also item-total score correlation.
o The formula for correcting the total score for chance success is: S=R- W/K-1;
s=corrected score, R= number of correct responses, w= number of incorrect
responses, k= number of response options or choices used in item.
o Best index of the item intercorrelation is the phi-coefficient.
It is high when items are nearly equal in difficulty.
Such tests have high reliability.
Item discrimination- also known as item validity index. Ability of the items on the
basis of which the discrimination or distinction is made between superiors and
inferiors.
Discriminatory power or validity (V) may be defined as the extent to which success
and failure on the item indicate the possession of the trait or achievement being
measured.
o Positively discriminating items- proportion or percentage of correct answers
is higher in the upper group. Only this is taken ahead after item analysis.
o Negatively discriminating items- proportion or percentage of correct answers
is lower in the upper group.
o Nondiscrimination items- percentage or proportion of correct answers is
equal or approximately equal in both the upper and lower groups.
There are 2 common ways of determining the index of discrimination-
o Test of significance of the difference between two proportions or
percentages- here examinees are divided preferably into 2 equal groups on the
basis of the total score. We use upper 27% and lower 27%. Critical ratios can
be applied. If a difference comes to be significant then item is accepted.
Guilford (1954) has recommended the use of chi-square test as a
measure of index of discrimination when there are equal number of
examinees in both the extreme groups. X2(2 is the power) = N(P(u)-
P(L))2 (2 is the power)/4pq.
N=number of total examinees; p(u) and p(L) refer to the proportion of
examinees passing in the upper and lower group respectively; p is the
arithmetic mean of p(u) and p(L); q is 1-p.
Chi-square can only be used with large sample.
Net D index of discrimination- an unbiased index of absolute
difference in the number of discriminations made between the upper
group and the lower group-it is proportional to the net discrimination
made by the item between the two groups. Formula- V= R(u)/N(u)-
R(L)/N(L) or V= R(u)-R(L)/N(u), because N(u)= N(L).
V= net D; R is the number examinees who gave correct answers; N is
the number of examinees.
V is negative then item is dropped.
V above 0.40 are thought to be discriminating well.
o Correlation techniques-
Item-total correlation or internal consistency- each item is correlated
against the internal criterion of the total score. It tells how well the
item is measuring the function which the test itself is measuring.
Best index of discrimination.
For multipoint items- use product-moment correlation; two alternative
responses (dichotomous items)- use biserial or point biserial r; when
total score is also dichotomous- tetrachoric r or phi-coefficients are
used.
Distractor analysis- effectiveness of distractors or foils. It means to examine the
distractibility of the incorrect options.
o Any distractor to be called a good distractor must be answered by more
examinees in the lower group.
o If distractor is answered by more examinees in the upper group, it is poor and
is rewrite or modified.
o Nonfunctional distractors- which contributes nothing to the test.
o If extreme group method is not used, then formula for calculating distractor is
number of persons expected to choose each distractor = number of persons
answering items incorrectly/ number of distractors.
Speed test: tests that emphasize on the speed of the response of the examinee.
o Index of difficulty- p=R/N(r). p is the index of difficulty; R is the number of
examinees who gave correct answers; N (r) is the number of examinees who
actually reached that item.
o Formula for corrected index of difficulty of an item in a speed test is
Pc= R-(W/K-1)/N-HR.
o Pc= corrected proportion of the index of difficulty, R refers to number of
correct answers; W refers to the number of incorrect answers; K is the number
of response options in the item; N is the number of total examinees; HR is the
number of examinees who could not reach the item within the time limit.
o Index of discrimination: items are selected not on the basis of item-total
correlation but on the basis of index of difficulty as well as upon ideal time
experimentally determined for the test.
Factors affecting index of difficulty and index of discrimination-
o Learning experience or previous experience of examinee
o Complex or ambiguous answers
o The nature of response alternatives (multiple choice, alternatives)
Item characteristic curve (ICC) and item response theory-
o ICC- graphic representation of the probability of giving the correct answer to
an item as a function of the level of the attributes assessed by the test. Used to
illustrate discrimination power and item difficulty.
Steepness or slope conveys information about the discriminating power
of an item.
Item-total correlation is positive- slope of ICC is positive and
vice versa.
When it is near zero- the slope is near zero, flat.
Position of ICC curve gives indication about the difficulty of each
item.
Difficult items- rise on the right-hand side of the plot.
Easier items- rise on the left-hand side of the plot.
item indicating the blue line is the easiest while that of the
green is the most difficult.
o Item response theory- understanding how individual differences in attributes
of the examinees affect his behavior when confronted with a specific item.
Also known as latent trait theory or item characteristic curve theory.
This theory states that the probability of a particular response to a test
item is a joint function of one or more characteristics of the individual
respondents and one or more characteristics of the test item.
According to the IRT each item on a test has an independent item
characteristic curve that describes the probability of getting each
particular item right or wrong, given a certain ability level of the
examinee.
It provides measures that are generally sample invariant.
Attitude scales