0% found this document useful (0 votes)
10 views

Unit 3-Psychological Testing

Psychological testing involves structured techniques to assess behavior and infer psychological attributes such as intelligence and personality. Tests can be categorized into achievement, ability, personality, and specific assessments like the Apgar test, with reliability and validity being crucial for their effectiveness. Historical developments in testing include the work of Galton, Binet, and Wechsler, leading to various methods for assessing intelligence, aptitudes, and personality traits.

Uploaded by

Pamisha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Unit 3-Psychological Testing

Psychological testing involves structured techniques to assess behavior and infer psychological attributes such as intelligence and personality. Tests can be categorized into achievement, ability, personality, and specific assessments like the Apgar test, with reliability and validity being crucial for their effectiveness. Historical developments in testing include the work of Galton, Binet, and Wechsler, leading to various methods for assessing intelligence, aptitudes, and personality traits.

Uploaded by

Pamisha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Unit 3: Psychological Testing

 A psychological test is a structured technique used to generate a carefully selected


sample of behavior. This behavior sample is used in turn to make inferences about the
psychological attributes of the person who has been tested like intelligence, self-
esteem.
 Some tests involve open-ended situations with standard stimuli (such as a set of
pictures); these are often used to bring individualistic responses.
 Other tests involve very structured situations in which the range of possible responses
is narrow, and the answers are either right or wrong.
 Tests are standard ways of generating samples of people’s behavior. Their special
value lies in the fact that they are-
o Uniform- the procedures are specified precisely so that different testers will
follow the same steps every time they administer the test. This means that the
test performances of different people (or of the same person tested at different
times) can be compared directly.
o Objective- the rules for scoring are spelled out, like the rules for test
administration. Thus, the subjective input of the individual tester is minimized,
and personal biases are kept under control.
o Interpretable- makes test scores meaningful to the psychologist.

History of testing

 Galton set up a psychometric laboratory in London at the international health


exhibition in 1884.
 It was later transferred to the London Museum, where it was maintained for 6 years.
 Various anthropometric and psychometric measures were arranged on a long table at
one side of a narrow room.
 James McKeen studied the new experimental psychology with both Wundt and
Galton before settling at Columbia university were, for 26 years, he was the
undisputed dean of American psychology.
 With Wundt, he did a series of painstakingly elaborate RT studies, measuring with
great precision the fractions of a second presumably required for different mental
reactions.
 Cattell (1890) invented the term mental test in his famous paper entitled “mental tests
and measurement.”
 Binet invented the first modern intelligence test in 1905.

Types of tests

 Achievement tests- designed to measure what people have learned-skills. Not often
used by psychologists.
 Ability tests- ability testing focuses on the question of what people can do when they
are at their very best. In other words, ability tests are designed to measure capacity or
potential rather than actual achievement.
o Most of the tests of these sorts are called tests of intelligence or tests of
aptitude.
 Personality tests- they are designed to measure characteristic patterns such as
attitudes, interests etc.
 Apgar test- baby’s first test conducted immediately after birth. It is a quick,
multivariate assessment of heart rate, respiration, muscle tone, reflex irritability, and
color. The total Apgar score (0-10) helps determine the need for any immediate
medical attention.
 Psychological assessment-
o Measurement
 Correct/incorrect item responses – tests (intelligence, aptitude etc.).
 Not using correct/ incorrect responses- questionnaires, inventories.
o Non measurement
 Interviews, observations etc.- sample test
 Unstructured other questionnaires/checklists- sample test

Reliability

 A good test should be highly reliable. This means that the test should give similar
results even though different testers administer it, different people score it, different
forms of the test are given, and the same person takes the test at two or more different
times.
 It is the consistency of the test/measure.
 In actual practice, psychological tests are never perfectly reliable. One reason is that
real, meaningful changes do occur in individuals over time.
 The best intelligence tests usually yield reliability correlation coefficient of .90 or
higher. Almost all personality tests have lower reliability than this; this may be due
partly to the instability of things like attitudes and feelings, which they are designed to
measure.
 To improve reliability, ensure that the test was administered and scored by a truly
standard procedure.
 With increase in the length of the test, both reliability and validity also increases.
 Test-retest reliability- also called as temporal consistency. Same test must be
administered more than one time (at least 2 times) on the same sample population
after a period of time (2-8 weeks at least). Here the correlation scores between the two
tests should have r value greater or equal to 0.30. Stability is tested.
o Time sampling error occurs here. Too much or too little time gap is given.
o Practice effect can occur.
 Parallel/Alternate form reliability- also known as equivalent form reliability.
Compare the different/revised versions of the same test. It can be defined as measures
that is obtained by conducting assessment of the same phenomenon with the
participation of the same sample group via more than one assessment method. The
earlier test should be already standardized. It is done with or without a time gap. Here
the correlation scores between the two tests should have r value greater or equal to
0.30. Stability is tested.
o Content (correlating two very different tests) and time sampling error.
 Inter-rater reliability- used in projective or open-ended tests. Also called inter-
scorer reliability. Process for quantitatively determining the level of agreement
between 2 or more observers or judges. Equivalence is assessed. At least 80%
agreement should be there. Correlation is measured by Cohen’s kappa.
 Internal consistency (homogeneity)- is assessed using item to total correlation, split
half reliability, Kuder-Richardson coefficient (same as Cronbach’s alpha but only for
dichotomous items), spearman-brown correction (adjust the test length) and
Cronbach’s/coefficient alpha (average of all possible split half reliabilities).
 Split half reliability/odd-even reliability- test is divided into 2 parts and then both
are correlated to each other. There are many ways of doing split half.

Validity

 The test must really measure what it has been designed to measure.
 Assessing the validity of any test requires careful selection of appropriate criterion
measures.
 We are usually satisfied with validities of .50 or .60. validities of .30 and .40 is
common.
 One reason that validity coefficients are lower than reliability coefficients is that the
reliability of a test sets limits on how valid the test can be. A test that cannot give us
reliable scores from one testing to the next is not likely to show dependable
correlations with any validity-criterion measure either.
 On the other hand, high reliability is no guarantee that a test is valid.
 Face validity- most common, associated with the highest level of subjectivity.
Superficial level of accuracy. Appears to measure what the test is measuring by the
look of it.
 Construct validity- relates to assessment of suitability of measurement tool to
measure the phenomenon being studied. Truest form of validity. Construct is formed
either by observation or by theory that we need to validate.
o Convergent validity- one theoretical construct is compared with another
related one on the same sample. Checks whether it measures the construct
only. Here 0.30 or more correlation should be present.
o divergent validity/discriminate validity- one construct is compared with
unrelated construct on the same sample to check whether the questionnaire is
measuring something else. No correlation should be present.
 Content validity- whether what you want to measure is measured correctly by the
test including all the aspects.
 Sampling validity- similar to content validity. Ensures that the area of coverage of
the measure within the research area is vast.
 Criterion related validity- involves comparison of tests result with the outcome.
Meets the external criteria that is already set and also correlates with other measures
who measure the same construct.
o Predictive validity- predicts about the consequence/outcome. check for
validity more than once on the same population with time gap.
o Concurrent- check for validity more than once on the same population with
no time gap.
 Specificity- is an index of how well it avoids picking out those who do not have the
target condition (how few false positives are there). Hish specificity used in
diagnostic tool.
 Sensitivity- is an index how well the measure picks out those patients who have the
target condition (how few false negatives are there). High sensitivity used in
screening tool.

Norms

 Norms are set of scores obtained by representative groups of people for whom the test
is intended. The scores obtained by these groups provide a basis for interpreting any
individual’s score.

Assessing intelligence

 Stanford Binet intelligence test- developed by Binet and Simon to identify mentally
retarded children in French schools.
o Lewis Terman of Stanford university gave English language version in 1916.
o Binet devised his test by age levels. This was because he observed that
mentally retarded students seemed to think like non-retarded children at young
age.
o Within these scales, the task at each level is those which average children of
that age should find moderately difficult. Children are given only the level of
their age.
o For testing purposes, the highest level at which all items are passed by a given
child is that child’s basal age.
o Starting with that basal age, the tester adds additional credits for each item the
child passes until the child reaches a ceiling age- that is the lowest level at
which all items within the level are failed.
o Binet and Terman worked from a notion of intelligence as an overall ability
related to abstract reasoning and problem solving.
o Intelligence quotient- MA/CA ratio, proposed by William Stern in 1912.
o IQ = MA/CA* 100
o A ratio IQ is not useful with adults because mental age does not increase in a
rapid orderly fashion after the middle teens.
 Wechsler tests-
o David Wechsler developed a family of tests for people at various age levels. It
includes the Wechsler adult intelligence scale (WAIS), WAIS-R (1981),
Wechsler preschool and primary scale of intelligence (WPPSI, 1967),
Wechsler intelligence scale for children, WISC-R (1974).
o The subtests can be grouped into two categories, verbal and performance.
o Wechsler devised the deviation IQ. It is a type of standard score- that is an IQ
expressed in standard deviation units.
o Wechsler’s tests yield 3 different deviation IQs, one for the verbal subtests,
another for the performance subtests and a third full scale IQ.
o Standard score = X-M/SD. X is the individual’s score, M is the mean, and SD
is the standard deviation.
 Process oriented assessment of intellectual development
o Ina Uzgiris and J. McV. Hunt (1975) developed a set of 6 developmental
scales intended to measure “progressive levels of cognitive organization” in
the first 2 years of life. It was based on Piaget’s theory.
o It was designed to capture 6 different processes of cognitive development, all
occurring within what Piaget labeled the sensorimotor stage.
o They did not standardize their scale because they focus on where a given
infant is in relation to a sequential process of development, not how the infant
compares to other babies of the same age.
 Nonverbal tests (adults)- raven’s progressive matrices, Cattell’s culture fair
intelligence test (CCFIT). All are culturally fair.
 Non-verbal tests for children- colored progressive matrices, NNAT. All are culturally
fair.
 Performance test- Koh’s block test, Alexander passalong, Bhatia’s performance, cube
construction test.
 For children draw a person test.

Testing for aptitudes

 We use intelligence tests to give us a broad assessment of intellectual capacity, and


aptitude tests to measure the more specialized abilities required in specific
occupations and activities.
 Scholastic aptitudes- if we are trying to predict success in academic training, we
speak of scholastic aptitude. Examples are SAT, GRE, MAT etc.
 Vocational aptitudes and interests- many tests are intended for specific jobs. For
example, tests for mechanical abilities or psychomotor tests. Counsellors often
administer test batteries, combinations of test covering a wide spectrum of abilities.
Some of the vocational interests’ tests are Strong-Campbell vocational interest
inventory and Kuder occupational interest survey.

Personality assessment

 Personality testing aims for typically, for a representative picture of individuals as


they usually are. It does not involve levels of success or even “right” or “wrong”
answers; its objective is not to gauge how successful a person will be, but rather, what
the person is usually like (in thoughts, feelings, and behavior patterns).
 Pen and paper tests
o Questionnaires- it asks questions or give simple statements to be marked yes
or no, true or false. Some questionnaires offer people the option of answering
“doubtful” or “uncertain.” This type of pen and paper personality test was first
used widely during WW1 to weed out emotionally unstable draftees.
o Minnesota Multiphasic Personality Inventory (MMPI)- it contains 566
statements, or items, for people to answer about themselves.
 They used empirical approach to test construction. They ignored the
content of people’s answers on the test, choosing instead to match the
patterns of people’s answers to the patterns shown by “criterion”
groups of people with known characteristics of special interests.
 MMPI also has several validity scales. With these, deceptiveness and
attempts to create especially good or bad impressions can often be
detected.
 To interpret the test, the psychologist looks at the total profile, not just
the separate scales.
 It has also spawned new personality tests- tests built in whole or in part
from MMPI items. Like Manifest Anxiety scale, California
psychological inventory etc.
o The 16-personality factor questionnaire (16PF)- given by Raymond Cattell
and his associates. They began with a list of 4500 objectives applicable to
human behavior and then reduced the list to 170.
 He used factor analysis, to identify groupings, or factors, among the
items.
 The 16 factor thus identified were said by Cattell to reflect key
characteristics, or source traits, of the human personality.
 Forced choice rating questionnaire.
 Projective methods- these personality tests are deliberately designed to evoke highly
individual responses.
o They call for the test taker to respond to stimuli such as inkblots or pictures
but provide few guidelines as to what the response should be.
o The scoring procedures are also less structured; the interpreter must often rely
heavily on a subjective evaluation of the responses.
o Projective tests are based on projective hypothesis, derived from Freud’s
personality theory. The basic idea is that the way people respond to a vague or
ambiguous situation is often a projection of their underlying feelings and
motives.
o A related assumption about projective tests is that the test taker responds to the
relatively unstructured test stimuli in ways that give meaning to the stimuli,
and that much of that meaning comes from within the person responding.
o Thus, these tests are intended to provide access to unconscious impulses and
other aspects of personality of which the test taker themselves may not be
aware.
o The Rorschach Inkblot Technique- he produced a set of 10 inkblots. The
blots, some black and white, some multicolored, appear on separate cards.
Subjects are presented with the cards, one at a time, and asked questions such
as, ‘what might this be?’ or “what does this remind you of?”
o After writing down as many answers as the subject cares to give for each blot,
the tester goes back through the set, asking the subject for more details,
including what it was about the blot that determined the subject’s response.
o The first phase of the test is called the free-association phase; the second phase
is called the inquiry.
o The content and style of responses become grist for the interpreter’s mill in the
subjective aspect of scoring.
o Thematic Apperception test (TAT)- developed by Christina Morgan and
Henry Murray in 1938. It is based on Murray’s theory of need. It is designed
to ferret out people’s basic needs by having them tell stories. To guide the
story production tester presents a series of pictures and asks the subject to
make up a story about what is happening, what went before, what is going to
happen, and what the people involved are thinking and feeling.
 It includes a standard set of 30 pictures, but not all are included.
 The test is built on the assumption that people’s stories reveal
important aspects of their needs and self-perceptions as well as their
views about “significant others” in their lives.

Neuropsychological tests

 Cognitive assessment is useful to test for cognitive impairment- a deficiency in


knowledge, thought process, or judgement. Psychiatrists often perform cognitive
testing during the MSE.
 Neuropsychological examination systematically assesses functioning in the realms of
attention and concentration, memory, language, spatial skills, sensory and motor
abilities as well as executive functioning and emotional status.
 The ideal tool would have high sensitivity, specificity, and positive predictive value,
take minimal time to conduct, and be easy to administer and score.
 Clock drawing test (CDT)- simplest test to administer. The patient is given a blank
sheet of paper and asked to draw a large circle, then to write numbers inside the circle
so that it resembles the face of a clock. Once this is completed, the patient is
instructed to “draw the hands on the clock to read ten past eleven.”
o There are multiple scoring systems for the CDT, points given for accuracy of
placement of the numbers and the size and position of the hands.
o Lower scores generally indicate greater impairment.
o It has also been shown to be highly predictive of driver safety.
 Battery approach- this approach typically includes a large variety of tests that
measure most cognitive domains as well as sensory and motor skills.
o Halstead-Reitan Neuropsychological test battery (HRNTB)- a set of tests
used to diagnose and localize brain damage by providing a comprehensive
assessment of cognitive functioning.
 It was the first neuropsychological battery and was published in 1947.
 The battery includes five core subtests given by Halstead-
 Category test
 Tactual performance test (Segun-Goddard formboard is used)
 Seashore rhythm test
 Speech sound perception test
 Finger tapping test.
 It includes 5 optional tests given by Reitan-
 Trail making test (attention and working memory)
 Reitan Indiana aphasia screening test
 Reitan-Klove sensory perceptual examination
 Grip strength test
 Lateral dominance examination
 Reitan had removed the critical flicker frequency test and time sense
test due to lower validity.
 This battery checks for language, attention, motor dexterity, sensory-
motor integration, abstract thinking, and memory.
o Luria-Nebraska Neuropsychological battery (LNNB)- published in 1980. It
was designed to provide information useful in the diagnosis and treatment of
brain damage or dysfunction.
 It consists of 269 items, each of which may be scored on a 2- or 3-
point scale. A score of 0 indicates normal performance. High scores
indicate abnormal performance.
 It is divided into 11 content scales. Here raw scores are converted into
t-scores.
 We have motor, rhythm, tactile, visual, receptive speech, expressive
speech, writing, reading, arithmetic, memory, and intellectual
processing scales.
 In addition, there are 3 derived scales: pathognomonic, left hemisphere
scale and right hemisphere scale.
o MSE
 Mini-mental status examination- takes about 15 minutes. Used as
screening tool. It includes assessment of attention, orientation,
registration and recall/short term memory, language and visuospatial
construction.
 Maximum score is 30 points, with impairment suspected in
subjects whose score is 25 or lower.
o Montreal Cognitive Assessment- executive functioning is also seen here
along with other domains of MMSE. Maximum score is 30 points, with
impairment suspected in subjects whose score is 25 or lower.
o Bender Visual Motor Gestalt test (1938)
 the test is used to evaluate visual motor maturity, to screen for
developmental disorders, or to assess neurological function or brain
damage.
 It consists of 9 figures, each on its own 3*5 card. It takes 10-20
minutes. The subject is shown each figure and asked to copy it onto a
piece of paper. The results are scored based on accuracy and other
characteristics.
 The figures in the test were derived from the work of the famous
Gestalt psychologist Wertheimer.
 It measures perceptual motor skills, perceptual motor development and
gives an indication of neurological intactness.
 Bender 2 contains 16 figures.
o Wisconsin card sorting test- at the beginning of the test, the patient is
confronted with 4 stimulus cards that differ from one another in the form,
color, and number of symbols they display. Here each time the patient learns a
new sorting principle, the principle is changes. Patients with frontal lobe
damage often continue to sort on the basis of one sorting principle.

Creativity test

 Assess novel, original thinking and the capacity to find unusual or unexpected
solutions, especially for vaguely defined problems.
 Torrence test of creative thinking. It consists of 2 sections- verbal and figural.
 Remote associates test (RAT) developed by Mednick and Mednick.

Levels of measurement (or measurement scales)

 Nominal or classificatory scale of measurement- lowest level of measurement.


Numbers are used to name, identify, or classify persons, objects etc., they are really
not scales and their purpose is to name objects.
o Members of any two groups are never equivalent but all members of any one
group are always equivalent.
o Statistical operations- frequency, percentage, proportion, mode and coefficient
of contingency.
 Ordinal scale of measurement- there is property of magnitude but not of equal
intervals or an absolute zero. Numbers denote the rank orders of objects or the
individuals.
o Statistical operations- median, percentile, rank correlation coefficient and all
of the nominals.
 Interval or equal interval scale of measurement- unit of measurement is equal and
constant means numerically equal distance on the scales indicates equal distances in
the properties of the objects being measured.
o The ratios of magnitudes are meaningless.
o Zero is not true rather arbitrary.
o Psychological tests and inventories have this scale.
o Statistical operations- arithmetic mean, Pearson r and others.
o Coefficient of variation cannot be applied. The reason is that coefficient of
variation is a sort of ratio of standard deviation to the arithmetic mean.
Standard deviation is a fixed deviation on the measurement scale because it is
not affected by any shift in the zero point. But the mean is likely to vary
wherever there occurs a shift in the zero point. When the mean is affected
coefficient of variation will also be affected. Therefore, not advisable to
calculate from interval measurements.
 Ratio scale of measurements- it has an absolute or true zero point. Ratio of any two
numbers is independent of the unit of measurements and therefore, it can
meaningfully be equated.
o All statistical operations can be applied.

Test construction

 Bean defines item as “a single task or question that usually cannot be broken down
into any smaller units.”
 Item writing- an item must have the following characteristics.
o There should be no ambiguity regarding its meaning.
o Should not be too easy or difficult.
o Should have discriminatory power, that is, it must clearly distinguish between
those who posses a trait and those who do not.
o Should only measure the significant aspects of knowledge or understanding.
o Should nor encourage guesswork.
o Should not be such that its meaning is dependent upon another item and/or it
can be answered by referring to another item.
 Studies have revealed that usually 25-30 dichotomous items are needed to have a
reliability coefficient as high as 0.80 whereas 15-20 items are needed to reach the
same reliability when multipoint items are used.
 An item writer should always write almost twice the number of items to be retained
finally.
 Items are of two types-
o Essay item or free answer item- examinee relies upon his memory and past
associations to answer the questions in a few words only. They can answer in
whatever manner they like. Most appropriate to measure higher mental
processes. They are of two types.
 Short answer type.
 Long answer type or extended answer essay type.
o There are 2 general methods of scoring the responses of essay items-
 Sorting method- the answer sheets of the examinees are first sorted
into different groups according to the fairness of the answers. After
this, the scorer assigns the weightage to be given to each group and
accordingly, he gives weightage to each answer and finally adds it to
constitute a total score. It is done quickly and chances for erratic
marking are reduced considerably.
 Point score method- well suited to score short answer types. A
grading key consisting of the correct answers and the points to be
assigned is prepared beforehand by the scorer.
o Objective item or new type of item- when there is only one fixed correct
answer. All objective items can be divided into two broad categories-
 Supply type- when the examinee has to write down the correct answer
on his own. They are divided into two main categories.
 Unstructured short answer type- given in a question form.
 Completion item or fill in item- present in the form of an
incomplete sentence.
 Selection type- examinee has to select or identify the correct answer
from a few given. Nunally refers to such items as identification items.
They are of different types-
 Two-alternative item- two answers are provided from which
the examinee is required to select the one which he thinks is
correct.
 Multiple choice item- most common, effective and flexible of
all objective test. Also known as polytomous or polychotomous
format item.
 Matching item- items on the left column are to be paired with
the items on the right column.
o Important methods of scoring objective test items-
 Overlay key method- a cut out key is prepared, that is, a window which
may display the answer to item on each page.
 Strip key method- the correct answer to each item is printed vertically
on a strip of paper on a cardboard.
 Item analysis- is a set of procedures, that is applied to know the indices for the
truthfulness (or validity) of items. It demonstrates how effectively a given test item
functions within the total tests.
 The main objectives of item analysis are-
o It provides an index of the difficulty value to each item.
o It indicates the discrimination value of each item. This is k/a item validity.
o Indicates effectiveness of the distractors in multiple choice questions. Also k/a
distractor analysis.
o Also indicates why a particular item in the test has not function effectively and
how this might be modified so that its functional significance can be
increased.
 Power test- examinee is allowed sufficient time for answering all items of the test.
Thus, emphasis here is upon measurement of the ability (or power) of the examinee
and not the speed.
 Item difficulty- to find the difficulty value of the item. It is the proportion or
percentage of the examinee or individuals who answer the item correctly.
o The maximum number of possible discriminations between examinees of any
item is 50*50= 2500. This occurs when an item is answered correctly by 50%
of the examinees and answered wrongly by 50%.
o The proportion passing an item is inversely related to the difficulty of an item.
o The higher the proportion or percentage of getting the items right (higher the
index of difficulty), the easier the item and vice versa.
o Test items must have a normal distribution with respect to indices of
difficulty.
o Moderate difficulty indices are preferred because they signal the maximum
variance.
o As the index increase or decreases, the variance of the item gradually
decreases, that is, its ability to make comparisons among those who pass and
those who fail decreases.
o There are two important methods of determining the difficulty value of an
item-
 Method of judgement- difficulty id determined on the basis of the
judgement of experts.
 Empirical method- also k/a statistical method.
 For dichotomous item, the index of difficulty can be
determined by the formula: p = R/N
 P= difficulty index, R= number of examinees who pass the test,
N=total number of examinees.
 Index of difficulty can also be determined from a certain
portion of the group of examinees: p = R(u) + R(L)/N(u) +
N(L).
 R(u) is the number of examinees answering correctly in the
upper group; R(L) answering correctly in the lower group; N(u)
is the number of examinees in upper group; N(L) is the number
of examines in the lower group.
 Another way when two equal extreme groups have been set up
is averaging the two proportions, to get D. covert them first into
proportion- number of correct/incorrect divided by total, and
then averaging both.
o A test which consists of items having D (difficulty) values close to 0.5, is k/a
peaked test.
o D value affect the reliability coefficient and also item-total score correlation.
o The formula for correcting the total score for chance success is: S=R- W/K-1;
s=corrected score, R= number of correct responses, w= number of incorrect
responses, k= number of response options or choices used in item.
o Best index of the item intercorrelation is the phi-coefficient.
 It is high when items are nearly equal in difficulty.
 Such tests have high reliability.
 Item discrimination- also known as item validity index. Ability of the items on the
basis of which the discrimination or distinction is made between superiors and
inferiors.
 Discriminatory power or validity (V) may be defined as the extent to which success
and failure on the item indicate the possession of the trait or achievement being
measured.
o Positively discriminating items- proportion or percentage of correct answers
is higher in the upper group. Only this is taken ahead after item analysis.
o Negatively discriminating items- proportion or percentage of correct answers
is lower in the upper group.
o Nondiscrimination items- percentage or proportion of correct answers is
equal or approximately equal in both the upper and lower groups.
 There are 2 common ways of determining the index of discrimination-
o Test of significance of the difference between two proportions or
percentages- here examinees are divided preferably into 2 equal groups on the
basis of the total score. We use upper 27% and lower 27%. Critical ratios can
be applied. If a difference comes to be significant then item is accepted.
 Guilford (1954) has recommended the use of chi-square test as a
measure of index of discrimination when there are equal number of
examinees in both the extreme groups. X2(2 is the power) = N(P(u)-
P(L))2 (2 is the power)/4pq.
 N=number of total examinees; p(u) and p(L) refer to the proportion of
examinees passing in the upper and lower group respectively; p is the
arithmetic mean of p(u) and p(L); q is 1-p.
 Chi-square can only be used with large sample.
 Net D index of discrimination- an unbiased index of absolute
difference in the number of discriminations made between the upper
group and the lower group-it is proportional to the net discrimination
made by the item between the two groups. Formula- V= R(u)/N(u)-
R(L)/N(L) or V= R(u)-R(L)/N(u), because N(u)= N(L).
 V= net D; R is the number examinees who gave correct answers; N is
the number of examinees.
 V is negative then item is dropped.
 V above 0.40 are thought to be discriminating well.
o Correlation techniques-
 Item-total correlation or internal consistency- each item is correlated
against the internal criterion of the total score. It tells how well the
item is measuring the function which the test itself is measuring.
 Best index of discrimination.
 For multipoint items- use product-moment correlation; two alternative
responses (dichotomous items)- use biserial or point biserial r; when
total score is also dichotomous- tetrachoric r or phi-coefficients are
used.
 Distractor analysis- effectiveness of distractors or foils. It means to examine the
distractibility of the incorrect options.
o Any distractor to be called a good distractor must be answered by more
examinees in the lower group.
o If distractor is answered by more examinees in the upper group, it is poor and
is rewrite or modified.
o Nonfunctional distractors- which contributes nothing to the test.
o If extreme group method is not used, then formula for calculating distractor is
number of persons expected to choose each distractor = number of persons
answering items incorrectly/ number of distractors.
 Speed test: tests that emphasize on the speed of the response of the examinee.
o Index of difficulty- p=R/N(r). p is the index of difficulty; R is the number of
examinees who gave correct answers; N (r) is the number of examinees who
actually reached that item.
o Formula for corrected index of difficulty of an item in a speed test is
Pc= R-(W/K-1)/N-HR.
o Pc= corrected proportion of the index of difficulty, R refers to number of
correct answers; W refers to the number of incorrect answers; K is the number
of response options in the item; N is the number of total examinees; HR is the
number of examinees who could not reach the item within the time limit.
o Index of discrimination: items are selected not on the basis of item-total
correlation but on the basis of index of difficulty as well as upon ideal time
experimentally determined for the test.
 Factors affecting index of difficulty and index of discrimination-
o Learning experience or previous experience of examinee
o Complex or ambiguous answers
o The nature of response alternatives (multiple choice, alternatives)
 Item characteristic curve (ICC) and item response theory-
o ICC- graphic representation of the probability of giving the correct answer to
an item as a function of the level of the attributes assessed by the test. Used to
illustrate discrimination power and item difficulty.
 Steepness or slope conveys information about the discriminating power
of an item.
 Item-total correlation is positive- slope of ICC is positive and
vice versa.
 When it is near zero- the slope is near zero, flat.
 Position of ICC curve gives indication about the difficulty of each
item.
 Difficult items- rise on the right-hand side of the plot.
 Easier items- rise on the left-hand side of the plot.

 item indicating the blue line is the easiest while that of the
green is the most difficult.
o Item response theory- understanding how individual differences in attributes
of the examinees affect his behavior when confronted with a specific item.
Also known as latent trait theory or item characteristic curve theory.
 This theory states that the probability of a particular response to a test
item is a joint function of one or more characteristics of the individual
respondents and one or more characteristics of the test item.
 According to the IRT each item on a test has an independent item
characteristic curve that describes the probability of getting each
particular item right or wrong, given a certain ability level of the
examinee.
 It provides measures that are generally sample invariant.

Area Classical test theory (CTT) Item response theory (IRT)


Model Linear ; X= T+E; there is no Nonlinear
relation between true score
(T) and error score (E). X is
the observed score.
Level Test Item
Assumption Weak (easy to meet with test Strong (more difficult to meet
data) with data)
Item ability relationship Not specified Item characteristic functions
Ability Test scores or estimated true Ability scores are reported
scores are reported
Invariance of item and Both are Sample dependent. Sample independent.
person statistics
Item statistics p, r B, a, c (all the three-
parameter model) plus
corresponding item
information functions.

Attitude scales

 Semantic differential scale- it is an attitude measuring device developed by Osgood


et al in 1957. Grew out of mediational theory of learning.
 It is a collection of subscales in which absolute ratings of concepts are done. The term
‘concept’ refers to the object which is to be rated. It does not permit a comparative
rating of responses.
 The purpose of SD is to measure the various facets of the meaning of the concept
which are represented by adjectives. The reason why adjectives are selected to convey
the meaning of the concept is that in our day-to-day language, most of our ideas are
appropriately communicated only through adjectives.
 There are 3 facets of meaning- denotation, connotation and association.
o Denotation- description of objects in terms of their physical properties.
o Connotation- sentiments and feelings of persons about an about.
o Association- objects which come to the mind of the person when he has heard
or seen a particular word.
 The sematic differential scale is a measure of mainly the connotative facet of
meaning.
 The concept is usually rated on a seven-point scale having bipolar adjectives at the
two extremes.
 To form SD, scale the first step is to choose a concept. In doing so, two primary
considerations are important.
o The concept must be the object or stimulus which can elicit different responses
from individuals.
o The concept must be relevant to the problem being investigated.
 Next step is to choose the scales or adjective pairs. Again, there are 2 considerations.
o It must be decided by the investigator as to what factors should represent the
adjective pairs.
 There are 3 kinds of factors- evaluation (good-bad), potency (strong-
weak), and activity (hot-cold) for the response, these are called clusters
of adjectives and represent 3 dimensions of meanings along which a
concept can be measured. These dimensions are called semantic space.
 E (evaluation) is the strongest. In the study of attitude or value, it is
wise to include only the scales of E factor.
o The pair of adjectives must be relevant to the concept being rated.
 In the final preparation of SD scale, each concept is written on a separate sheet of
paper with the same set of scales and the subject is asked to rate the concepts as
he/she sees them.
 Likert’s scale- this is also known as method of summated ratings. Was developed in
1932. Main steps involved in Likert’s method are-
o A large number of multiple-choice type statements usually with 5 alternatives
such as strongly agree, agree and more are collected by the investigator.
o Such statements are administered to a group of subjects who respond to each
item by indicating which of the given 5 alternatives they agree with.
o Every responded item is scored with different weights.
o Then a total score for each subject is found by adding the weights earned by
him on each item. That is why this method is called the method of summated
ratings.
o Finally, the selection of items is done through item analysis.
 Staples scale: Jan Stapel developed it, and it is presented vertically with an adjective
in the middle and 5 data points above and 5 data points below.
o It is a slight modification of semantic differential scale.
o It is a unipolar rating scale with 10 categories numbered from -5 (very
inaccurate) to +5 (very accurate), without a neutral point (0).
o Used when it becomes difficult to find bipolar adjectives.
o The higher the positive score the better the adjective describes the object and
vice versa.
 Computer based psychological testing-
o Refers to a test that is administered and scored on a computer.
o Responses are electronically recorded or assessed.
o Computerized adaptive testing (CAT) is a special type of computer-based
testing, which uses a set of statistical procedures that make it possible to tailor
the test for the individual being assessed based on their responses, which
allows for more accurate and efficient measurement.
o CBTI- computer based test interpretation.
o Empirically based programs are known as actuarial assessment programs
which use statistical analysis, linear regression equation, Bayesian rules.
o Clinically based programs are also known as automated response assessment.
 Applications of psychological testing in various settings-
o Organizational- work related, pre-employment, performance appraisal tools.
SIOP (society for industrial and organization psychology).
o Education- assess-monitoring, evaluation.
o Military- describing, explaining, predicting and modifying military behaviors.
For military personnel and their families. Defense institute of psychological
research, AFMC for military psychology.

You might also like