Psych Assessment
Psych Assessment
DEFINITIONS DEFINITIONS
Testing is a term used to refer to everything from the A test is a measurement device or technique used to quantify behavior or aid
administration of a test to the interpretation of a test score (Cohen in the understanding and prediction of behavior
& Swerdlik, 2018) (Kaplan & Saccuzzo, 2018)
Assessment acknowledges that tests are only one type of tool used
by professional assessors and that a test’s value is intimately linked A psychological test or educational test is a set of items that are designed to
to the knowledge, skill, and experience of the assessor (Cohen & measure characteristics of human beings that pertain to behavior
Swerdlik, 2018) (Kaplan & Saccuzzo, 2018)
Psychological Assessment - the gathering and integration of
psychology related data for the purpose of making a psychological TYPES OF TESTS
evaluation that is accomplished through the use of tools e.g. tests, Those that can be given to only one person at a time are known as individual
interviews, case studies, behavioral observation, and specially tests (Kaplan & Saccuzzo, 2018)
designed apparatuses and measurement procedures (Cohen &
Swerdlik, 2018) A group test, by contrast, can be administered to more than one person at a
Psychological Testing - the process of measuring psychology time by a single examiner (Kaplan & Saccuzzo, 2018)
related variables by means of devices or procedures designed to
obtain a sample of behavior (Cohen & Swerdlik, 2018)
• REPUBLIC ACT 9258 or the Guidance and Counseling act of 2004 Frequency Table
Crafted and designed to professionalize the practice of guidance Frequency table – is an ordered listing of number of individuals having
and counseling in the Philippines each of the different values for a particular variable.
Frequency table is called a frequency table because it shows how
• Guidance Counselor: A natural person who has been registered and issued a frequently (how many times) each score was used. A frequency table
valid Certificate of Registration and a valid Professional Identification Card by makes the pattern of numbers easy to see.
the PRB of Guidance and Counseling and PRC in accordance with RA 9258 and You can also use a frequency table to show the number of scores for each
who by virtue of specialized training, perform for a fee, salary, or other forms value (that is, for each category) of a nominal variable.
Grouped Frequency Table
Sometimes there are so many possible values that an ordinary frequency
table is too awkward to give a simple picture of the scores.
The solution is to make groupings of values that include all values in a
certain range.
interval - range of values in a grouped frequency table that are
grouped together. (For example, if the interval size is 10, one of the
intervals might be from 10 to 19.)
grouped frequency table frequency table in which the number of
individuals (frequency) is given for each interval of values.
Histograms
A graph is another good way to make a large group of scores easy to
understand. A picture may be worth a thousand words, but it is also
sometimes worth a thousand numbers.
Histogram is a barlike graph of a frequency distribution in which the Normal and Kurtotic Distribution
values are plotted along the horizontal axis and the height of each bar is NORMAL CURVE specific, mathematically defined, bell-shaped
the frequency of that value; the bars are usually placed next to each frequency distribution that is symmetrical and unimodal; distributions
other without spaces, giving the appearance of a city skyline. observed in nature and in research commonly approximate it.
When you have a nominal variable, the histogram is called a bar graph.
Since the values of a nominal variable are not in any particular order,
you leave a space between the bars.
Frequency Distribution
A frequency distribution shows the pattern of frequencies over the
various values.
A frequency table or histogram describes a frequency distribution
because each shows the pattern or shape of how the frequencies are
spread out, or “distributed.” Psychologists also describe a distribution in terms of whether the
Psychologists also describe this shape in words. middle of the distribution is particularly peaked or flat.
unimodal distribution - frequency The standard of comparison is a bell-shaped curve. In psychology
distribution with one value clearly research and in nature generally, distributions often are similar to this bell-
having a larger frequency than any shaped standard, called the normal curve.
other.
bimodal distribution - frequency Kurtosis is how much the shape of a distribution differs from a
distribution with two normal curve in terms of whether its curve in the middle is more peaked or
approximately equal frequencies, flat than the normal curve (DeCarlo, 1997). Kurtosis comes from the Greek
each clearly larger than any of the word kyrtos, “curve.”
others. KURTOSIS - extent to which a frequency distribution deviates from
multimodal distribution - a normal curve in terms of whether its curve in the middle is more peaked or
frequency distribution with two or flat than the normal curve.
more high frequencies separated
by a lower frequency; a bimodal BASIC STATISTICAL CONCEPTS IN PSYCH ASSESSMENT – Measure of Central
distribution is the special case of Tendency
two high frequencies.
rectangular distribution - CENTRAL TENDENCY
frequency distribution in which all The central tendency of a distribution refers to the middle of the group of
values have approximately the same frequency. scores.
Measures of central tendency refers to the set of measures that reflect
symmetrical distribution - distribution in which the pattern of
where on the scale the distribution is centered.
frequencies on the left and right side are mirror images of each
Three measures of central tendency: mean, mode, and median.
other.
Each measure of central tendency uses its own method to come up with a
skewed distribution - distribution in which the scores pile up on
single number describing the middle of a group of scores.
one side of the middle and are spread out on the other side;
The MEAN is the most commonly used measure of central tendency.
distribution that is not symmetrical.
THE MEAN
▪ MEAN - arithmetic average of a group of scores; sum of the scores divided
by the number of scores.
Outlier - a score with an extreme value (very high or very low) in
Figure: (a) approximately symmetrical, (b) skewed to the right (positively skewed), and (c) relation to the other scores in the distribution.
skewed to the left (negatively skewed)
MODE
floor effect situation in which many scores pile up at the low end of The MODE is another measure of central tendency. The mode is the most
a distribution (creating skewness to the right) because it is not common single value in a distribution.
possible to have any lower score. mode = value with the greatest frequency in a distribution
ceiling effect situation in which many scores pile up at the high end It can also be defined simply as the most common score, that is, the score
of a distribution (creating skewness to the left) because it is not obtained from the largest number of subjects. Thus, the mode is that
possible to have a higher score. value of X that corresponds to the highest point on the distribution.
In a perfectly symmetrical unimodal distribution, the mode is the same as
A distribution that is skewed to the right is also called positively skewed. the mean. However, what happens when the mean and the mode are not
A distribution skewed to the left is also called negatively skewed.
the same? In that situation, the mode is usually not a very good way of Think of the variability of a distribution as the amount of spread of
describing the central tendency of the scores in the distribution. the scores around the mean. In other words, how close or far from
the mean are the scores in a distribution.
MEDIAN There are three measures of the variability of a group of scores: the
Another alternative to the mean is the MEDIAN. If you line up all the scores range, variance and standard deviation.
from lowest to highest, the middle score is the median.
When you have an even number of scores, the median can be ▪ Measures of variability communicate three related aspects of the data:
between two different scores. In that situation, the median is the average 1st - the opposite of variability is consistency.
(the mean) of those two scores. 2nd - measures of variability indicate how spread out the scores and
The median is the score that corresponds to the point at or below the distribution are.
which 50% of the scores fall when the data are arranged in numerical order. 3rd - a measure of variability tells us how accurately the measure of
By this definition, the median is also called the 50th percentile. central tendency describes the distribution
REMEMBER: Measures of variability communicate the differences among the
USES OF THE MEAN, MEDIAN, AND MODE scores, how consistently close to the mean the scores are, and how spread
The mode is a score that actually occurred, whereas the mean and out the distribution is.
sometimes the median may be values that never appear in the data.
The mode also has the obvious advantage of representing the largest RANGE
number of people. One way to describe variability is to determine how far the lowest
The mode has the advantage of being applicable to nominal data, which score is from the highest score.
is not true of the median or the mean. The descriptive statistic that indicates the distance between the two
Disadvantage: the mode depends on how we group our data. Another most extreme scores in a distribution is called the range.
disadvantage is that it may not be particularly representative of the Range = Highest score – Lowest score
entire collection of numbers. The range does communicate the spread in the data. However, the
The major advantage of the median, which it shares with the mode, is range is a rather crude measure. It involves only the two most extreme
that it is unaffected by extreme scores. scores it is based on the least typical and often least frequent scores.
The median is the preferred measure of central tendency when the data Therefore, we usually use the range as our sole measure of variability
are ordinal scores. only with nominal or ordinal data.
The median is preferred when interval or ratio scores form a very
skewed distribution. VARIANCE
Computing the mean is appropriate whenever getting the “average” of The variance of a group of scores is one kind of number that tells you
the scores makes sense. Therefore, do not use the mean when describing how spread out the scores are around the mean. To be precise, the
nominal data. variance is the average of each score’s squared difference from the
Likewise, do not compute the mean with ordinal scores. The mean mean.
describes interval or ratio data. Mathematically, the distance between a score and the mean is the
Always compute the mean to summarize a normal or approximately difference between them which is the amount that a score deviates
normal distribution: The mean is the mathematical center of any from the mean. Thus, a score’s deviation indicates how far it is spread
distribution, and in a normal distribution, most of the scores are located out from the mean.
around this central point. Therefore, the mean is an accurate summary Of course, some scores will deviate by more than others, so it makes
and provides an accurate address for the distribution. sense to compute something like the average amount the scores
Only when the distribution is symmetric will the mean and the median deviate from the mean. Let’s call this the “average of the deviations.”
be equal, and only when the distribution is symmetric and unimodal will The larger the average of the deviations, the greater the variability.
all three measures be the same. The variance of a group of scores is one kind of number that tells you
The mean will inaccurately describe a skewed (nonsymmetrical) how spread out the scores are around the mean. To be precise, the
distribution. variance is the average of each score’s squared difference from the
The solution is to use the median to summarize a skewed distribution mean.
REMEMBER: Use the mean to summarize normal distributions of interval The more spread out the distribution has a larger variance because
or ratio scores; use the median to summarize skewed distributions. being spread out makes the deviation scores bigger.
If the deviation scores are bigger, the squared deviation scores and the
average of the squared deviation scores (the variance) are also bigger
The variance is rarely used as a descriptive statistic. This is because the
variance is based on squared deviation scores, which do not give a very
easy-to-understand sense of how spread out the actual, non-squared
scores are.
STANDARD DEVIATION
The most widely used number to describe the spread of a group of
scores is the standard deviation. The standard deviation is simply the
square root of the variance.
The measure of variability that more directly communicates the
“average of the deviations” is the standard deviation.
BASIC STATISTICAL CONCEPTS IN PSYCH ASSESSMENT – Measures of There are two steps in figuring the standard deviation.
Variability ❶ Figure the variance.
❷ Take the square root.
VARIABILITY The standard deviation is the positive square root of the variance.
Researchers also want to know how spread out the scores are in a
distribution. This shows the amount of variability in the REMEMBER: The standard deviation indicates the “average
distribution. deviation” from the mean, the consistency in the scores, and how far
Computing a measure of variability is important because without it scores are spread out around the mean.
a measure of central tendency provides an incomplete description
of a distribution. The mean, for example, only indicates the central
score and where the most frequent scores are.
REMEMBER: The variance and standard deviation are two If you start with a normal
measures of variability that indicate how much the scores are spread out distribution and move scores
around the mean. from both the center and the
We use the variance and the standard deviation to describe how tails into the shoulders, the
different the scores are from each other. curve becomes flatter and is
REMEMBER: Approximately 34% of the scores in a normal called platykurtic. This is
distribution are between the mean and the score that is 1 standard where the central portion of
deviation from the mean. the distribution is much too
flat.
Basic Concepts
BASIC STATISTICAL CONCEPTS IN PSYCH ASSESSMENT – NORMAL VARIABLE - is a condition or characteristic that can have different values.
DISTRIBUTION In short, it can vary.
ERROR VARIANCE
TRAIT Constructs Overt Behavior
the component of a test score attributable to sources other than
STATES
the trait or ability measured.
o Potential sources of error variance:
Trait - defined as “any distinguishable, relatively enduring way in which
• Assessees themselves are sources of error variance
one individual varies from another” (Guilford, 1959, p. 6).
• Assessors, too, are sources of error variance
States also distinguish one person from another but are relatively less
• Measuring instruments themselves are another source
enduring (Chaplin et al., 1988).
of error variance
Construct—an informed, scientific concept developed or constructed to
Measurement professionals tend to view error as simply an
describe or explain behavior
element in the process of measurement
Overt behavior refers to an observable action or the product of an
Classical Test Theory (CTT- also referred to as true score theory)-
observable action, including test- or assessment-related responses.
the assumption is made that each test taker has a true score on a
test that would be obtained but for the action of measurement
Traits that manifest in observable behavior are presumed to depend not only
error.
on the strength of the trait in the individual but also on the nature of the
A model of measurement based on item response theory (IRT) is an
situation.
alternative. However, whether CTT, IRT, or some other model of
measurement is used, the model must have a way of accounting for
Assumption #2: Psychological Traits and States Can Be Quantified and
measurement error.
Measured
AGGRESSIVE
Assumption #6: Testing and Assessment Can Be Conducted in a Fair and
Test developers and researchers have many different ways of
Unbiased Manner
looking at and defining the same phenomenon.
One source of fairness related problems is the test user who attempts
to use a particular test with people whose background and
Test developers must ensure the most appropriate test items to be
experience are different from the background and experience of
included in the assessment based on how the trait term is defined.
people for whom the test was intended.
Today all major test publishers strive to develop instruments that are
Test developers must also ensure appropriate ways to score the
fair when used in strict accordance with guidelines in the test
test and interpret the results.
manual. However, despite the best efforts of many professionals,
fairness-related questions and problems do occasionally arise.
Cumulative Scoring = There is the assumption that the more the
test taker responds in a particular direction as keyed by the test
Assumption #7: Testing and Assessment Benefit Society
manual as correct or consistent with a particular trait, the higher
In a world WITHOUT TESTS or other assessment procedures:
that test taker is presumed to be on the targeted ability or trait.
personnel might be hired on the basis of nepotism (favoritism/bias)
rather than documented merit
Assumption #3: Test-Related Behavior Predicts Non-Test-Related Behavior
teachers and school administrators could subjectively place children
The tasks in some tests mimic the actual behaviors that the test
in different types of special classes simply because that is where
user is attempting to understand. However, such tests yield only a
they believed the children belonged
sample of the behavior that can be expected to be emitted under
non-test conditions. there would be a great need for instruments to diagnose educational
The obtained sample of behavior is typically used to make difficulties in reading and math and point the way to remediation
predictions about future behavior, such as work performance of a there would be no instruments to diagnose neuropsychological
job applicant. impairments
Psychological tests may be used not to predict behavior but to there would be no practical way for the military to screen thousands
postdict it—that is, to aid in the understanding of behavior that has of recruits with regard to many key variables.
already taken place.
Reliability
Assumption #4: Tests and Other Measurement Techniques Have Strengths
and Weaknesses What is a Good Test?
Ꙭ Competent test users understand a great deal about the tests they • the criteria for a good test would include clear instructions for
use. They understand how a test was developed, the circumstances administration, scoring, and interpretation.
under which it is appropriate to administer the test, how the test • a test offered economy in the time and money it took to administer, score,
should be administered and to whom, and how the test results and interpret it
should be interpreted. • technical criteria that assessment professionals use to evaluate the quality
of tests refers to the psychometric soundness of tests
• two key aspects are reliability and validity
RELIABILITY
• the criterion of reliability involves the consistency of the measuring tool
• the precision with which the test measures and the extent to which error is
present in measurements.
• the perfectly reliable measuring tool consistently measures in the same
way.
Reliability Coefficient
• A reliability coefficient is an index of reliability, a proportion that indicates
the ratio between the true score variance on a test and the total variance.
Concept of Reliability
• Error refers to the component of the observed test score that does not have
to do with the test taker’s ability.
• If we use X to represent an observed score, T to represent a true score, and
E to represent error, then the fact that an observed score equals the true
score plus error may be expressed as follows:
X=T+E
• If σ2 represents the total variance, the true variance, and the error variance,
then the relationship of the variances can be expressed as σ2 = σ2th + σ2e
• The term reliability refers to the proportion of the total variance attributed Reliability Estimates
to true variance. Ꙭ Test-Retest Reliability
• The greater the proportion of the total variance attributed to true variance, One way of estimating the reliability of a measuring instrument is by
the more reliable the test. using the same instrument to measure the same thing at two
points in time.
Measurement Error This approach to reliability evaluation is called the test-retest method,
• refers to, collectively, all of the factors associated with the process of and the result of such an evaluation is an estimate of test-retest
measuring some variable, other than the variable being measured. reliability.
Is an estimate of reliability obtained by correlating pairs of scores
from the same people on two different administrations of the
same test
The test-retest measure is appropriate when evaluating the
reliability of a test that purports to measure something that is
relatively stable over time, such as a personality trait.
By determining the reliability of one half of a test, a test developer can use
the Spearman–Brown formula to estimate the reliability of a whole test.
Because a whole test is two times longer than half a test, n becomes 2 in the Using and Interpreting a Coefficient of Reliability
Spearman–Brown formula for the adjustment of split-half reliability. • How high should the coefficient of reliability be?
•“On a range relative to the purpose and importance of the decisions
Other Methods of Estimating Internal Consistency to be made on the basis of scores on the test.
• Inter-item consistency refers to the degree of correlation among all the Reliability is a mandatory attribute in all tests we use. However, we
items on a scale. A measure of inter-item consistency is calculated from a need more of it in some tests, and we will admittedly allow for less of it in
single administration of a single form of a test. others.
• An index of inter-item consistency, in turn, is useful in assessing the
homogeneity of the test. Tests are said to be homogeneous if they contain
items that measure a single trait.
Coefficient alpha
• Developed by Cronbach (1951) and subsequently elaborated on by
others (such as Kaiser & Michael, 1975; Novick & Lewis, 1967), coefficient
alpha may be thought of as the mean of all possible split half correlations,
corrected by the Spearman–Brown formula
• Coefficient alpha is appropriate for use on tests containing
nondichotomous items.
• Coefficient alpha is the preferred statistic for obtaining an estimate VALIDITY
of internal consistency reliability. What is a Good Test?
• Coefficient alpha is widely used as a measure of reliability, in part • technical criteria that assessment professionals use to evaluate the quality
because it requires only one administration of the test. of tests refers to the psychometric soundness of tests
• two key aspects are reliability and validity
Average proportional distance (APD)
• Rather than focusing on similarity between scores on items of a test, Validity
the APD is a measure that focuses on the degree of difference that exists • a judgment or estimate of how well a test measures what it purports to
between item scores. measure in a particular context.
• average proportional distance method is defined as a measure used • a judgment based on evidence about the appropriateness of inferences
to evaluate the internal consistency of a test that focuses on the degree of drawn from test scores
difference that exists between item scores. • No test or measurement technique is “universally valid” for all time, for all
uses, with all types of test taker populations.
Keep in Mind! • Rather, tests may be shown to be valid within what we would characterize
• Measures of reliability are estimates, and estimates are subject to as reasonable boundaries of a contemplated usage
error. The precise amount of error inherent in a reliability estimate will vary
with various factors, such as the sample of test takers from which the data
were drawn.
• A reliability index published in a test manual might be very
impressive. However, keep in mind that the reported reliability was achieved
with a particular group of test takers.
Incremental Validity
the degree to which an additional predictor explains
something about the criterion measure that is not explained
by predictors already in use.
3. Construct Validity
This is a measure of validity that is arrived at by executing a
comprehensive analysis of
a. how scores on the test relate to other test scores and measures,
and
b. how scores on the test can be understood within some
theoretical framework for understanding the construct that the
test was designed to measure.
The degree to which the measurement instrument measures the
theoretical constructs that it was intended to measure
Construct validity can be carried out through factor analysis
Validation is a judgment about the appropriateness of inferences drawn from
• the process of gathering and evaluating evidence about validity. test scores regarding individual standings on a variable called a
• both the test developer and the test user may play a role in the validation of construct.
a test for a specific purpose A construct is an informed, scientific idea developed or hypothesized
• Local validation studies are absolutely necessary when the test user plans to to describe or explain behavior.
alter in some way the format, instructions, language, or content of the test.
1. Content Validity
This is a measure of validity based on an evaluation of the subjects,
topics, or content covered by the items in the test.
describes a judgment of how adequately a test samples behavior
representative of the universe of behavior that the test was
designed to sample
In the interest of ensuring content validity, test developers strive to
include key components of the construct targeted for
measurement, and exclude content irrelevant to the construct
targeted for measurement.
The degree to which the items in the measurement scale represent all
aspects of the variable being measured.
Content validity is not evaluated numerically, it is judged by
researcher.
2. Criterion-Related Validity
This is a measure of validity obtained by evaluating the relationship of
scores obtained on the test to scores on other tests or measures
is a judgment of how adequately a test score can be used to infer an
individual’s most probable standing on some measure of interest
Characteristics of a Criterion:
1. An adequate criterion is relevant. Meaning, it is pertinent or
applicable to the matter at hand
2. The degree to which a measurement instrument is related to 4. Face Validity
independent measure of the relevant criterion relates more to what a test appears to measure to the person being
3. Criterion-related validity can be evaluated by computing the tested than to what the test actually measures.
multiple correlation (R) and performance is a judgment concerning how relevant the test items appear to be.
4. An adequate criterion measure must also be valid for the Stated another way, if a test definitely appears to measure what it
purpose for which it is being used. purports to measure “on the face of it,” then it could be said to be
5. A criterion is also uncontaminated high in face validity
Multiple Choice Item Analysis •Among the tools test developers might employ to analyze and
• An item written in a multiple-choice format has three elements: select items are
(1) a stem, ■ an index of the item’s difficulty
(2) a correct alternative or option, and ■ an index of the item’s reliability
(3) several incorrect alternatives or options variously ■ an index of the item’s validity
referred to as distractors or foils. ■ an index of item discrimination
Matching Item
• In a matching item, the test taker is presented with two columns: Item-difficulty index
premises on the left and responses on the right. • An index of an item’s difficulty is obtained by calculating the
• The test taker’s task is to determine which response is best proportion of the total number of test takers who answered the item
associated with which premise. correctly.
• For very young test takers, the instructions will direct them to • A lowercase italic “p” (p) is used to denote item difficulty, and a
draw a line from one premise to one response. Test takers other than subscript refers to the item number (so p1 is read “item difficulty index for
young children are typically asked to write a letter or number as a item 1”).
response. • The value of an item-difficulty index can theoretically range from 0
(if no one got the item right) to 1 (if everyone got the item right).
Completion Item Example: If 50 of the 100 examinees answered item 2 correctly,
• requires the examinee to provide a word or phrase that then the item difficulty index for this item would be
completes a sentence, as in the following example: p2
The standard deviation is generally considered the most useful = 50 / 100 = .5
measure of __________. Note that the larger the item-difficulty index, the easier the item. Because p
refers to the percent of people passing an item, the higher the p for an item,
Short-answer Item the easier the item.
• It is desirable for short-answer items to be written clearly enough
that the test taker can respond succinctly - that is, with a short answer
What descriptive statistic is generally considered the most useful
measure of variability?
Essay Item
• a test item that requires the test taker to respond to a question
by writing a composition, typically one that demonstrates recall of
facts, understanding, analysis, and/or interpretation.
• Example of an essay item:
Compare and contrast definitions and techniques of
classical and operant conditioning. Include examples of
how principles of each have been applied in clinical as
well as educational settings.
Test Tryout
• Having created a pool of items from which the final version of the
test will be developed, the test developer will try out the test. • An index of the difficulty of the average test item for a particular
• The test should be tried out on people who are similar in critical test can be calculated by averaging the item-difficulty indices for all the test’s
respects to the people for whom the test was designed. items.
• Equally important are questions about the number of people on • This is accomplished by summing the item-difficulty indices for all
whom the test should be tried out. An informal rule of thumb is that there test items and dividing by the total number of items on the test.
should be no fewer than 5 subjects and preferably as many as 10 for each
item on the test. In a true–false item, the probability of guessing correctly on the basis of
In general, the more subjects in the tryout the better chance alone is 1/2, or .50. Therefore, the optimal item difficulty is halfway
• The test tryout should be executed under conditions as identical as between .50 and 1.00, or .75. In general, the midpoint representing the
possible to the conditions under which the standardized test will be optimal item difficulty is obtained by summing the chance success proportion
administered. and 1.00 and then dividing the sum by 2, or
• All instructions, and everything from the time limits allotted for
completing the test to the atmosphere at the test site, should be as similar For a five-option multiple-choice item, the probability of guessing correctly on
as possible. any one item on the basis of chance alone is equal to 1/5, or .20. The optimal
item difficulty is therefore .60:
The higher the value of d, the greater the number of high scorers
answering the item correctly.
A negative d-value on a particular item is a red flag because it
indicates that low-scoring examinees are more likely to answer the item
correctly than high-scoring examinees. This situation calls for some action
such as revising or eliminating the item.
Item reliability index
• The item-reliability index provides an indication of the internal consistency The highest possible value of d is +1.00. This value indicates that all
of a test (Figure 8–4); members of the U group answered the item correctly whereas all members of
• the higher this index, the greater the test’s internal consistency. the L group answered the item incorrectly.
• This index is equal to the product of the item-score standard deviation (s) If the same proportion of members of the U and L groups pass the
and the correlation (r) between the item score and the total test score. item, then the item is not discriminating between test takers at all and d will
be equal to 0.
Factor analysis and inter-item consistency • A statistical tool useful The higher the value of d, the more adequately the item discriminates
in determining whether items on a test appear to be measuring the the higher-scoring from the lower-scoring test takers.
same thing(s) is factor analysis.
• Through the judicious use of factor analysis, items that
do not appear to be measuring what they were designed to
measure can be revised or eliminated.
If too many items appear to be tapping a particular
area, the weakest of such items can be eliminated.
▪ Having balanced all these concerns, the test developer comes out of
the revision stage with a better test.
▪ The next step is to administer the revised test under standardized
conditions to a second appropriate sample of examinees.
▪ On the basis of an item analysis of data derived from this
administration of the second draft of the test, the test developer may deem
the test to be in its finished form.
▪ Once the test is in finished form, the test’s norms may be developed
from the data, and the test will be said to have been “standardized” on this
(second) sample.