Cohen Based Summary of Psychological Testing Assessment
Cohen Based Summary of Psychological Testing Assessment
Standard Scores
Standard Score: raw score that has been converted from one scale to another
scale, where the latter has arbitrarily set mean and standard deviation
-used for comparison
Z-score
conversion of a raw score into a number indicating how many
standard deviation units the raw score is below or above the mean of
the distribution.
CHAPTER 5: RELIABILITY
RELIABILITY Does not affect score consistency
- Dependability and consistent
- Error implies that there will always be some inaccuracy in our SOURCES OF ERROR VARIANCE
measurements - TEST CONSTUCTION
- Tests that are relatively free of measurement error are deemed to be o Item sampling or content sampling – refer to variation
reliable among items within a test as well as to variation among
- Reliability estimates in the range of .70 and .80 are good enough for items between test\
most purposes in basic research The extent to which a test takers score is
affected by the content sampled on a test and
- Reliability coefficient: an index that indicates the ratio between the by the way the content is sampled (that is, the
true score variance on a test and the total variance way in which the item is constructed) is a
- HISTORY OF RELIABILITY: source of error variance
o Charles Spearman (1904): The Proof and Measurement of - TEST ADMINISTRATION
Association between Two Things o may influence the test takers attention or motivation
o Then Thorndike
o Item response theory has taken advantage of computer o Environment variables, test taker’s variables, examiner
technology to advance psychological measurement variables. Level of professionalism
significantly - TEST SCORING AND INTERPRETATION
o Based on Spearman’s ideas o Computer scoring and a growing reliance on objective,
- X = T + E CLASSICAL TEST THEORY computer-scorable items have virtually eliminated error
o assumes that each person has a true score that would be variance caused by scorer differences
o However, other tools of assessment still require scoring by
obtained if there were no errors in measurement
trained personnel
o Difference between the true score and the observed score
o If subjectivity is involved in scoring, then the scorer can be
results from measurement error
a source of error variance
o Despite rigorous scoring criteria set forth in many of the
o Assumption here is that errors of measurement are
better known test of intelligence, examiner occasionally
random
still are confronted by situations where an examinees
o Basic sampling theory tells us that the distribution of
response lies in a gray area
random errors is bell-shaped
The center of the distribution should represent
TEST-RETEST RELIABILITY
the true score, and the dispersion around the
- Also known as time-sampling reliability
mean of the distribution should display the
- Correlating pairs of scores from the same group on two different
distribution of sampling errors
o Classical test theory assumes that the true score for an administration of the same test
individual will not change with repeated applications of the - Measure something that is relatively stable over time
same test - Sources of Error variance:
o o Passage of time: the longer the time that passes, the
o Variance: standard deviation squared. It is useful because greater the likelihood that reliability coefficient will be
it can be broken into components: lower.
o True variance: variance from true differences are o Coefficient of stability: when the interval between testing
assumed to be stable is greater than 6 months,
o Error variance: random irrelevant sources - Consider possibility of carryover effect: occurs when first testing
- Standard error of measurement: we assume that the distribution of session influences scores from the second session
random errors will be the same for all people, classical test theory - If something affects all the test takers equally, then the results are
uses the standard deviation of errors as the basic measure of error uniformly affected and no net errors occurs
o Standard error of measurement tells us, on the average, - Practice tests may make this effect happen
how much a score varies from the true score - Practice can also affect tests of manual dexterity
o Standard deviation of the observed score and the reliability - Time interval between testing sessions must be selected and
of the test are used to estimate the standard error of evaluated carefully
measurement - Poor test-retest correlations do not always mean that a attest is
- Reliability: proportion of the total variance attributed to true unreliable – suggest that the characteristic under study has changed
variance.
o the greater portion of total variance attributed to true PARALLEL-FORM OR ALTERNATE FORMS RELIABILITY
variance, the more reliable the test - compares two equivalent forms of a test that measure the same
- Measurement error: refers to collectively, all of the factors associated attribute
with the process of measuring some variable, other than the variable - Two forms should be equally constructed, both format, etc.
being measured - When two forms of the test are available, one can compare
performance on one form versus the other – equivalent forms
o Random error: a source of error in measuring a targeted reliability or parallel forms
variable caused by unpredictable fluctuations and - Coefficient of equivalence: degree of relationship between various
inconsistencies of other variables in the measurement forms of a test can be evaluated by means of an alternate-forms
process - Parallel forms: each form of the test, the means and variances of
This source of error fluctuates from one testing observed test scores are equal
situation to another with no discernible pattern - Alternate forms: different versions of a test that have been
that would systematically raise or lower scores constructed so as to be parallel
o Systematic Error: - (1) two test administrations with the same group are required
A source of error in measuring a variable that is - (2) test scores may be affected by factors such as motivation etc.
typically constant or proportionate to what is - Problem: developing a new version of a test
presumed to be true value of the variable being INTERNAL CONSISTENCY
measured - How well does each item measure the content/construct under
Error is predictable and fixable consideration
CHAPTER 5: RELIABILITY
- How consistent the items together o Heterogeneity: degree to which a test measures
- Used when tests are administered once different factors
- If all items on a test measure the same construct, then it has a o Ex: homo=test that assesses knowledge only of #-D
good internal consistency television repair skills vs. a general electronics repair
- Split-half reliability, KR20, Cronbach Alpha test (hetero)
o The more homogenous a test is, the more inter-item
SPLIT-HALF RELIABILITY consistency it can be expected to have
- Correlating two pairs of scores obtained from equivalent halves of o Test homogeneity is desirable because it allows
a single test administered once. relatively straightforward test-score interpretation
- This is useful when it is impractical to assess reliability with two o Test takers with the same score on a homogenous test
tests or to administer test twice probably have similar abilities in the area tested
- Results of one half of the test are then compared with the results o Test takers with the same score on a heterogeneous
of the other test may have quite different abilities
o However, homogenous testing is often an insufficient
- Rules in splitting forms into half: tool for measuring multifaceted psychological variable
o Do not divide test in the middle because it would lower such as intelligence or personality
the reliability
o Different amounts of anxiety and differences in item Measures of Inter-Scorer Reliability
difficulty shall also be considered - In some types of tests under some conditions, the score may be more a
o Randomly assign items to one or the other half of the function of the scorer than of anything else
test - Inter-scorer reliability: is the degree of agreement or consistency
o use the odd-even system: where one subscore is between two or more scorers (or judges or rather) with regard to a
obtained for the odd-numbered items in the test and particular measure
another for the even-numbered items - Coefficient of inter-scorer reliability: coefficient of correlation to
- To correct for half-length, apply the Spearman-Brown formula, determine the degree of consistency among scorers in the scoring of a
which allows you to estimate what the correlation between the test
two halves would have been if each half had been the length of - Kappa statistic is the best method for assessing the level of agreement
the whole test among several observers
o Indicates the actual agreement as a proportion of the
o Use this if test user wish to shorten a test
potential agreement following the correction for chance
o Used to determine the number of items needed to
agreement
attain a desired level of reliability o Cohen’s Kappa – 2 raters
o Fleiss’ Kappa – 3 or more raters
- Reliability increases as the test length increases
HOMOGENEITY VS. HETEROGENEITY OF TEST ITEMS
KUDER-RICHARDSON FORMULAS OR KR20/KR21
- Homogeneous items has high degree of reliability
- Kuder-Richardson technique simultaneously considers all possible
ways of splitting the items
DYNAMIC VS. STATIC CHARACTERISTICS
- The formula for calculating the reliability of a test in which the
- Dynamic: trait, state, ability presumed to be ever-changing as a function
items are dichotomous, scored 0 or 1, is the Kuder-Richardson 20
of situational and cognitive experiences
(see p.114)
- Static: trait, state, ability relatively unchanging
- Introduced KR21 – uses an approximation of the sum of the pq
products – the mean test score
RESTRICTION OR INFLATION OF RANGE
- If it is restricted, reliability tends to be lower.
CRONBACH ALPHA
- If it is inflated, reliability tends to be higher.
- Cronbach developed a formula that estimates the internal
consistency of tests in which the items are not scored as 0 or 1 – a
SPEED TESTS VS. POWER TESTS
more general reliability estimate, which he called coefficient alpha
- Speed test: test is homogenous, means that it is easy but short time
- Sum the individual item variances
- Power test: Few items, but more complex.
o Most general method of finding estimates of reliability
through internal consistency
CRITERION-REFERENCED TESTS
- Domain sampling: define a domain that represents a single trait
- Provide an indication of where a testtaker stands with respect to some
or characteristic, and each item is an individual sample of this
variable or criterion.
general characteristic
- Tends to contain material that has been mastered in hierarchical
- Factor analysis deals with the situation in which a test apparently
fashion.
measures several different characteristics
- Scores here tend to be interpreted in pass-fail terms.
o Good for the process of test construction
- Measure of reliability depends on the variability of the test scores: how
- Most widely used as a measure of reliability because it requires
different the scores are from one another.
only one administration of the test
- Ranges from 0 to 1 “bigger is always better”
The Domain Sampling Model
Other Methods of Estimating Internal Consistencies
- This model considers the problems created by using a limited
- Inter-item consistency: refers to the degree of correlation among
number of items to represent a larger and more complicated
all the items on a scale
construct
o A measure of inter-item consistency is calculated from
- Our task in reliability analysis is to estimate how much error we
a single administration of a single form of a test
would make by using the score from the shorter test as an
o An index of inter-item consistency, in turn, is useful in
estimate of your true ability
assessing the homogeneity of the test
- Conceptualizes reliability as the ratio of the variance of the
o Tests are said to be homogenous if they contain items
observed score on the shorter test and the variance of the long-
that measure a single trait
run true score
o Definition: the degree to which a test measures a
- Reliability can be estimated from the correlation of the observed
single factor
test score with the true score
CHAPTER 5: RELIABILITY
Generalizability theory
- based on the idea that a persons test scores vary from testing to testing
because of variables in the testing situation
- Instead of conceiving of all variability in a persons scores as error,
Cronbach encouraged test developers and researchers to describe the
details of the particular test situation or universe leading to a specific
test score
- This universe is described in terms of its facets: which include things like
the number of items in the test, the amount of training the test scorers
have had, and the purpose of the test administration
- According to generalizability theory, given the exact same conditions of
all the facets in the universe, the exact same test score should be
obtained
- Universe score: the test score obtained and its analogous to a true
score in the true score model
- Cronbach suggested that tests be developed with the aid of a
generalizability study followed by a decision study
- Generalizability study: examines how generalizable scores from a
particular test are if the test is administered in different situations
- How much of an impact different facets of the universe have on the test
score
- Ex: is the test score affected by group as opposed to individual
administration
- Coefficients of generalizability: the influence of particular facts on the
test score is represented by this. These coefficients are similar to
reliability coefficients in the true score model
- Decision study: developers examine the usefulness of test scores in
helping the test user make decision
- The decision study is designed to tell the test user how test scores
should be used and how dependable those scores are as a basis for
decisions, depending on the context of their use
CHAPTER 6: VALIDITY
The Concept of Validity Content validity ratio (CVR):
- Validity: as applied to a test, is a judgment or estimate of how well a CVR= ne – (N/2)
test measures what it purports to measure in a particular context (N/2)
o Judgment based on evidence about the appropriateness of o CVR Content validity
inferences drawn from test scores ratio
o Validity of test must be shown from time to time to account o ne Number of panelists
for culture and advancement stating “essential”
- Inference: a logical result or deduction o N Total number of
- “Acceptable” or “weak” validity of tests and test scores panelists
- Validation: process of gathering and evaluating evidence about validity CVR is calculated for each item
o Test user and testtaker both have roles in validation of test o Culture and the relativity of content validity
o Test users may conduct their own validation studies: may Tests thought of as either valid or invalid
yield insights regarding a particular population of testtakers What constitutes historical fact depends to some
as compared to the norming sample (in manual) extent on who is writing the history
o Local validation studies: absolutely necessary when test Culture relativity
user plans to alter in some way the format, instructions, Politics (politically correct)
language, or content of the test Criterion-Related Validity
- Types of Validity (Trinitarian view) *not mutually exclusive all - Criterion-related validity: judgment of how adequately a test score can
contribute to a unified picture of a test’s validity/ critique approach is be used to infer an individual’s most probable standing on some
fragmented and incomplete measure of interest (measure of interest being the criterion)
o Content validity: measure of validity based on an evaluation - 2 types:
of the subjects, topics, or content covered by the items in o Concurrent validity: index of the degree to which a test
the test score is related to some criterion measure obtained at the
o Criterion-related validity: measure of validity obtained by same time (concurrently)
evaluating the relationship of scores obtained on the test to o Predictive validity: index of the degree to which a test score
scores on other tests or measures predicts some criterion measure
o Construct validity: measure of validity that is arrived at by - What Is a Criterion?
executing a comprehensive analysis of: (umbrella validity o Criterion: a standard on which a judgment or decision may
every other variety of validity falls under it) be based; standard against which a test or a test score is
How scores on test relate to other test scores and evaluated (criterion-related validity)
measures o Characteristics of criterion
How scores on test can be understood within Relevancy pertinent or applicable to the matter
some theoretical framework for understand the at hand
construct that the test was designed to measure Validity (for the purpose which it is being used)
- Strategies: ways of approaching the process of test validity Uncontaminated Criterion contamination:
o Content validation strategies term applied to a criterion measure that has been
o Criterion-related validation strategies based, at least in part, on predictor measures
o Construct validation strategies - Concurrent Validity
- Face Validity o Test scores are obtained at about the same time as the
o Face validity: relates more to what a test appears to criterion measures are obtained measures of the
measure to the person being tested than to what the test relationship between the test scores and the criterion
actually measures provide evidence of concurrent validity
o Judgment concerning how relevant the test items appear to o Indicate the extent to which test scores may be used to
be usually from testtaker, not test user estimate an individuals present standing on a criterion
o Lack of face validity= lack of confidence in perceived o Once validity of inference from test scores is established=
effectiveness of test which decreases testtaker’s faster, less expensive way to offer a diagnosis or a
motivation/cooperation *may still be useful classification decision
- Content validity o Concurrent validity of a test can be explored with respect to
o Content validity: a judgment of how adequately a test another test
samples behavior representative of the universe of behavior Prior research must have satisfactorily
that the test was designed to sample demonstrated the 1st test’s validity
Ideally, test developers have a clear vision of the 1st test= validating criterion
construct being measured clarity reflected in - Predictive validity
the content validity of the test o Test scores may be obtained at one time and the criterion
o Test blueprint: structure of the evaluation; a plan regarding measures obtained at a future time, usually after some
the types of information to be covered by the items, the intervening event has taken place
number of items tapping each area of coverage, the Intervening event training, experience, therapy,
organization of the items in the test, etc. medication, etc.
Behavior observation is a technique frequently Measures of relationship between the test scores
used in test blueprinting and a criterion measure obtained at a future time
o The quantification of content validity provide an indication of the predictive validity
Important in employment settings tests used test (how accurately scores on the test predict
to hire and promote some criterion measure)
One method: method for gauging agreement o Ex: SAT test score and freshman gpa
among raters or judges regarding how essential a o Judgments of criterion validity are based on 2 types of
particular item is (C.H. Lawshe) statistical evidence:
“Is the skill or knowledge measured by The validity coefficient
this item… Validity coefficient: correlation
o Essential coefficient that provides a measure of
o Useful but not essential the relationship between test scores
o Not necessary and scores on the criterion measure
To the performance of the job?”
CHAPTER 6: VALIDITY
Ex: Pearson correlation coefficient False negative (type II error) – does
used to determine validity between 2 not possess particular attribute but
measures (r) actually does have. Ex. Scored below
Affected by restriction or inflation of cutoff score, not hired, but could have
range been successful in the job
Is the range of scores employed - Construct Validity
appropriate to the objective of the o Construct validity: judgment about the appropriateness of
correlational analysis inferences drawn from test scores regarding individual
No rules regarding the validity standings on a variable called a construct
coefficient (how high or low it Construct: an informed, scientific idea developed
should/could be for test to be valid) or hypothesized to describe or explain behavior
Incremental validity Ex: intelligence, depression,
o More than one predictor motivation, personality, etc.
o Incremental validity: the Unobservable, presupposed
degree to which an (underlying) traits that a test
additional predictor developer invokes to describe test
explains something about behavior/criterion performance
the criterion measure that Viewed as unifying concept for all validity
is not explained by evidence
predictors already in use o Evidence of Construct Validity
Expectancy data Various techniques of construct validation that
Expectancy data: provides info that provide evidence:
can be used in evaluating the Test is homogeneous measures
criterion-related validity of a test single construct
Score obtained on expectancy Test scores increase/decrease as
test/tables likelihood testtaker will function of age, passage of time, or
score within some interval of scores experimental manipulation
on a criterion measure (“passing”, (theoretically predicted)
“acceptable”, etc.) Test scored obtained after some even
Expectancy table: shows the or passage of time differ from pretest
percentage of people within specified scores (theoretically predicted)
test-score intervals who subsequently Test scores obtained by people from
were placed in various categories of distinct groups vary (theoretically
the criterion predicted)
o May be created from Test scores correlate with scores on
scatterplot other tests (theoretically predicted)
o Shows relationships Evidence of homogeneity
Expectancy chart: graphic Homogeneity: refers to how uniform a
representation of an expectancy table test is in measuring a single concept
o The higher the initial rating, Evidence correlations between
the greater the probability subtest scores and total test scores
of job/academic success Item-analysis procedures have been
Taylor Russell Table – provide an estimate of the used in quest for test homogeneity
extent to which inclusion pf a particular test in Desirable but not necessary
the selection system will actually improve Contributes no info about how
selection construct being measured relates to
Selection ratio – relationship between other constructs
the number of people to be hired and Evidence of changes with age
the number of people available to be If test purports to measure a construct
hired that changes over time then the test
Base rate – percentage of people scores, too, should show progressive
under existing system for a particular changes to be considered valid
position measurement of construct
Relationship between predictor and Does not in itself provide info about
criterion must be linear how construct relates to other
Naylor-shine Tables – difference between the constructs
means of the selected and unselected groups to Evidence of pretest-posttest changes
derive an index of what the test is adding to Can be evidence of construct validity
already established procedures Some more typical intervening
o Decision theory and Test utility experiences responsible for changes in
Base rate – extent to which a particular trait, test scores are:
behavior, characteristic or attribute exists in the o Formal education
population o Therapy/medication
Hit rate – defined as the proportion of people a o Any life experience
test accurately identifies as possessing or Evidence from distinct groups/method of
exhibiting a particular trait. contrasted groups
Miss rate – proportion of people the test fails to Method of contrasted groups: one
identify as having or not having attributes way of providing evidence for the
False positive (type I error) – possess validity of a test is to demonstrate that
particular attribute but actually does scores on the test vary in a predictable
not have. Ex: score above cutoff score, way as a function of membership in
hired but failed the job. some group
CHAPTER 6: VALIDITY
Rationale if a test is a valid measure on the part of the rater to be lenient
of a particular construct, test scores in scoring, marking, and/or grading
from groups of people who would Severity error: rater exhibits general
presumed with respect to that and systematic reluctance to giving
construct should have correspondingly ratings at either the positive or
different test scores negative extreme
Convergent evidence Overcome restriction of range rating errors is to
Evidence for the construct validity of a use rankings: procedure that requires the rater to
particular test may converge from a measure individuals against one another instead
number of sources, such as tests or of against an absolute scale
measures designed to assess the Rater is forced to select 1st, 2nd, 3rd, etc.
same/similar construct Halo effect: fact that for some raters, some rates
Convergent evidence: scores on a test can do no wrong
undergo construct validity and Tendency to give a particular ratee a
correlate highly in the predicted higher rating than he or she
direction with scores on older, more objectively deserves
established and already validated tests Criterion data may be influenced by
designed to measure the same/similar rater’s knowledge of ratee race,
construct gender, etc.
Discriminant evidence o Test fairness
Discriminant evidence: validity Issues of fairness tend to be more difficult and
coefficient showing little relationship involve values
between test scores and /or other Fairness: the extent to which a test is used in an
variables with which scores on the test impartial, just, and equitable way
being construct-validated should not Sources of misunderstanding
theoretically be correlated Discrimination
Provides evidence of construct validity Group not included in standardization
Multitrait-multimethod matrix: “two sample
or more traits”, “two or more Performance differences between
methods” matrix/table that results identified groups
from correlating variables (traits)
within and between methods Relationship Between Reliability and Validity
Factor analysis - A test should not correlate more highly with any other variable
Factor analysis: shorthand term for a than it correlates with itself
class of mathematical procedures - A modest correlation between the true scores on two traits may
designed to identify factors or specific be missed if the test for each of the traits is not highly reliable
variables that are typically attributes, - We can have reliability without validity
characteristics, or dimension on which o It is impossible to demonstrate that an unreliable test
people may differ is valid
Frequently used as a data reduction
method in which several sets of scores
and correlations between them are
analyzed
Exploratory factor analysis:
researchers test the degree to which a
hypothetical model fits the actual data
o Factor loading: conveys
information about the
extent to which the factor
determines the test score
or scores
o Complex procedures
- Validity, Bias, and Fairness
o Test Bias
Bias: a factor inherent in a test that systematically
prevents accurate, impartial measurement
Technical means to identify and remedy bias
(mathematically)
Bias implies systematic variation
Rating error
Rating: a numerical or verbal
judgment (or both) that places a
person or an attribute along a
continuum identified by a scale of
numerical or word descriptions,
known as a rating scale
Rating error: judgment resulting from
intentional or unintentional misuse of
a rating scale
Leniency error/generosity error: error
in rating that arises from the tendency
CHAPTER 7: UTILITY
Utility: usefulness or practical value of testing to improve efficiency Based on norm-related considerations
rather than on the relationship of test
Factors that Affect a Test’s Utility scores to a criterion
Psychometric Soundness Also called norm-referenced cut score
o Reliability and validity of a test Ex.) top 10% of test scores get A’s
o Gives us the practical value of both the scores o Fixed cut score: set with reference to a judgment
(reliability and validity) concerning a minimum level of proficiency required to
o They tell us whether decisions are cost-effective be included in a particular classification.
o A valid test is not always a useful test Also called absolute cut scores
especially if testtakers do not follow test o Multiple cut scores: using two or more cut scores with
directions reference to one predictor for the purpose of
Costs categorizing testtakers
o Economic and non economic Ex.) having cut score that marks an A, B, C
o Ex.) using a less expensive and therefore less stringent etc. all measuring same predictor
application process for airline personnel. o Multiple hurdles: for success, requires one individual
Benefits to complete many tasks, with elimination at each level
o Profits, gains, advantages Ex.) written application group interview
o Ex.) more stringent hiring policy more productive personal interview etc.
employees o Compensatory model of selection: assumption is
o Ex.) maintaining successful and academic environment made that high scores on one attribute can
of university compensate for low scores on another attribute
- Those who use the results Using Group Tests - These three tests are highly - does NOT consider multiple
of group tests must assume - Reliable and well interrelated intelligences
that the subject was standardized as the best Group Achievement Tests Cognitive Abilities Test (COGAT)
cooperative and motivated individual tests - Stanford Achievement Test - Good reliability
o Many subjects - Validity data for some one of the oldest of the - Provides three separate
tested at a time group tests are standardized achievement scores though: verbal,
o Subjects record weak/meager/contradictor tests widely used in school quantitative, and nonverbal
own responses y system - Item selection is superior
o Subjects not Use Results with Caution - Well-normed and criterion- to the H-NT in terms of
praised for - Never consider scores in referenced, with selecting minority,
responding isolation or as absolutes psychometric culturally diverse, and
o Low scores on - Be careful using tests for documentation economically
group tests prediction - Another one is the disadvantaged children
often difficult to - Avoid overinterpreting test Metropolitan Achievement - Can be adopted for use
interpret scores Test, which measures outside the US
o No safeguards Be Especially Suspicious of Low achievement in reading by - No cultural bias
Advantages of Individual Tests Scores evaluating vocab, word - Each of the subtests
- Provide info beyond the - Assume that subjects recognition, and reading required 32-34 minutes of
test score understand purpose of comprehension actual working time, which
- Allow the examiner to testing, want to succeed, - Both of these are reliable the manual recommends to
observe behavior in a and are equally rested/free and normed on big samples be spread out over 2-3 days
standard setting of stress Group Tests of Mental Abilities - Standard age scores
- Allow individualized Consider Wide Discrepancies a (Intelligence) averaged some 15pts lower
interpretation of test Warning Signal Kuhlmann-Anderson Test (KAT) for African American
scores - May reflect emotional – 8th Edition students on the verbal
Advantages of Group Tests problems or severe stress - KAT is a group intelligence battery and quantitative
- Are cost-efficient When in Doubt, Refer test with 8 separate levels batteries
- Minimize professional time - With low scores, covering kindergarten
for administration and discrepancies, etc, refer the through 12th grade Summary of K-12 Group Tests
scoring subject for individual - Items are primarily - All are sound, viable
- Require less examiner skill testing nonverbal at lower levels, instruments
and training - Get trained professional requiring minimal reading
- Have more objective and Group Tests in the Schools: and language ability College Entrance Tests
more reliable scoring Kindergarten Through 12th - Suited to young children - SAT Reasoning Test,
procedures Grade and those who might be Cooperative School and
- Have especially broad - Purpose of tests is to handicapped in following College Ability Tests, and
application measure educational verbal procedures American College Test
Overview of Group Tests achievement in - Scores can be expressed in SAT Reasoning Test
Characteristics of Group Tests schoolchildren verbal, quantitative, and - Most widely used college
- Characterized as paper- Achievement Tests verses total scores entrance test
and-pencil or booklet-and- Aptitude Tests - Scores at other levels can - Used for 1000+ private and
pencil tests because only - Achievement tests attempt be expressed at percentile public institutions
materials needed are a to assess what a person has bands: like a confidence - Renorming of the SAT did
printed booklet of test learned following a specific interval; provides the range not alter the standing of
items, a test manual, course of instruction of percentiles that most test takers relative to one
scoring key, answer sheet, o Evaluate the likely represent a subject’s another in terms of
and pencil product of a true score percentile rank
- Computerized group course of - Good construction, - New scoring (2400) is likely
testing becoming more training standardization, and other to reduce interpretation
popular o Validity is excellent psychometric errors, as interpreters can
- Most group tests are determined qualities no longer rely on
multiple choice – some free primarily by - Good validity and reliability comparisons with older
response content-related - Potential for use and versions
- Group tests outnumber evidence adaptation for non-English- - 45mins longer – 3hrs and
individual tests - Aptitude tests attempt to speaking individuals or 45mins to administer
o One major evaluate a student’s even countries needs to be - may disadvantage students
difference is potential for learning explored with disabilities such as
whether the rather than how much a Henmon-Nelson Test (H-NT) ADD
test is primarily student has already - Of mental abilities - Verbal section now called
verbal, learned - 2 sets of norms available: “critical reading” – focus on
nonverbal, or o Evaluate effects o one based on reading comprehension
combination of unknown and raw score - Math section eliminated
- Group test scores can be uncontrolled distributions by much of the basic grammar
converted to a variety of experiences age, the other school math questions
units o Validity is on raw scores - Weakness: poor predictive
Selecting Group Tests judged primarily distributions by power regarding the grades
- Test user need never settle on its ability to grade of students who score in
for anything but well- predict future - reliabilities in the .90s the middle ranges
documented and performance - helps predict future - Little doubt that the SAT
psychometrically sound - Intelligence test measures academic success quickly predicts first-year college
tests general ability GPA