Cohen Based Summary
Cohen Based Summary
CHAPTER 1: PSYCHOLOGICAL TESTING AND ASSESSMENT (2) some test administers don’t even have
to be present
TESTING AND ASSESSMENT (a) usually administered to larger
Roots can be found in early twentieth century in France 1905 groups
Alfred Binet published a test designed to help place Paris school
(b) test takers complete tasks
children
WW1, military used the test to screen large numbers of recruits independently
quickly for intellectual and emotional problems b) Scoring and interpretation procedures
WW2, military depend more on tests to screen recruits for service (1) score: a code or summary statement,
usually (but not necessarily) numerical in
PSYCHOLOGICAL
PSYCHOLOGICAL TESTING nature, that reflects an evaluation of
ASSESSMENT
Process of measuring performance on a test, task, interview, or
Gathering & integration of
psychology-related some other sample of behavior
psychology-related data for
variables by means of (2) scoring: process of assigning such
DEFINITION the purpose of making a
devices/procedures evaluative codes/ statements to
psychological evaluation with
designed to obtain a
accompany of tools. performance on tests, tasks, interviews,
sample of behavior
or other behavior samples.
To answer a referral question,
To obtain some gauge, (3) different types of score:
solve problem or arrive at a
OBJECTIVE usually numerical in (a) cut score: reference point,
decision thru the use of tools
nature
of evaluation usually numerical, derived by
Testing may be judgement and used to divide
PROCESS Typically individualized individualized or group
a set of data into two or
Key in the process of selecting Tester is not key into the
ROLE OF more classifications.
tests as well as in drawing process; may be
EVALUATOR (i) sometimes reached
conclusions substituted
SKILL OF Typically requires an educated Requires technician-like without any formal
EVALUATIOR selection, skill in evaluation skills method: in order to
Entail logical problem-solving “eyeball”, teachers
Typically yields a test
OUTCOME approach to answer the who decide what is
score
referral ques.
passing and what is
3 FORMS OF ASSESSMENT: failing.
1. COLLABORATIVE PSYCHOLOGICAL ASSESSMENT – assessor and (4) who scores it
assesse work as partners from initial contact through final feedback (a) self-scored by testtaker
2. THERAPEUTIC PSYCHOLOGICAL ASSESSMENT – self-discovery and
(b) computer
new understandings are encouraged throughout the assessment
process (c) trained examiner
3. DYNAMIC PSYCHOLOGICAL ASSESSMENT – follows a model (a) c) psychometric soundness/ technical quality
evaluation (b) intervention (a) evaluation. Provide a means for (1) psychometrics:the science of
evaluating how the assesse processes or benefits from some type of psychological measurement.
intervention during the course of evaluation. (a) referring to to how
consistently and how
Tools of Psychological Assessment
accurately a psychological test
A. The Test (a measuring device or procedure)
measures what it purports to
1. psychological test: a device or procedure designed to
measure.
measure variables related to psychology (intelligence,
(2) utility: refers to the usefulness or
personality, aptitude, interests, attitudes, or values)
practical value that a test or other tool
2. format: refers to the form, plan, structure, arrangement, and
of assessment has for a particular
layout of test items as well as to related considerations such as
purpose.
time limits.
B. The Interview: method of gathering information through direct
a) also referred to as the form in which a test is
communication involving reciprocal exchange
administered (pen and paper, computer,
1. interviewer in face-to-face is taking note of
etc) Computers can generate scenarios.
a) verbal language
b) term is also used to denote the form or structure of
b) nonverbal language
other evaluative tools, and processes, such as the
(1) body language movements
guidelines for creating a portfolio work sample
(2) facial expressions in response to
3. Ways That tests differ from one another:
interviewer
a) administrative procedures
(3) the extent of eye contact
(1) some test administers have an active
(4) apparent willingness to cooperate
knowledge
c) how they are dressed
(a) some test administration
(1) neat vs sloppy vs inappropriate
involves demonstration of
2. interviewer over the phone taking note of
tasks
a) changes in the interviewee’s voice pitch
(b) usually one-on-one
b) long pauses
(c) trained observation of
c) signs of emotion in response
assessee’s performance
3. ways that interviews differ:
a) length, purpose, and nature
CHAPTER 1: PSYCHOLOGICAL TESTING AND ASSESSMENT
b) in order to help make diagnostic, treatment, 6. interpretive report: a formal or official computer-generated
selection, etc account of test performance presented in both numeric and
4. panel interview narrative form and including an explanation of the findings;
a) an interview conducted with one interviewee a) the three varieties of interpretive report are
with more than one interviewer (1) descriptive
C. The Portfolio (2) screening
1. files of work products: paper, canvas, film, video, audio, etc (3) consultive
2. samples of ones abilities and accomplishments b) some contain relatively little interpretation and
D. Case History Data: records, transcripts, and other accounts in written, simply call attention to certain high, low, or unusual
pictorial or other form that preserve archival information, official and scores that needed to be focused on.
informal accounts, and other data and items relevant to assessee c) consultative report: A type of interpretive report
1. sheds light on an individual's past and current adjustment as designed to provide expert and detailed analysis of
well as on events and circumstances that may have test data that mimics the work of an expert
contributed to any changes in adjustment consultant.
2. provides information about neuropsychological functioning d) integrative report: a form of interpretive report of
prior to the occurrence of a trauma or other event that psychological assessment, usually computer-
results in a deficit. generated, in which data from behavioral, medical,
3. insight into current academic and behavioral standing administrative, and/or other sources are
4. useful in making judgments for future class placements integrated
5. Case history Study: a report or illustrative account 7. CAPA: computer assisted psychological assessment. (assistance
concerning person or an event that was compiled on the basis to the test user not the test taker)
of case history data a) enables test developers to create psychometrically
a) might shed light on how one individual’s personality sound tests using complex mathematical procedures
and particular set of environmental conditions and calculations.
combined to produce a successful world leader. b) enables test users the construction of tailor-made
b) groupthink: work on a social psychological test with built-in scoring and interpretive capabilities.
phenomenon: contains rich case history material on c) Pros:
collective decision making that did not always result (1) test administrators have greater access to
in the best decisions. potential test users because of the global
E. Behavioral Observation: monitoring the actions of others or oneself by reach of the internet.
visual or electronic means while recording quantitative and/or qualitative (2) scoring and interpretation of test data
information regarding those actions. tend to be quicker than for paper-and-
1. often used as a diagnostic aid in various settings: inpatient pencil tests
facilities, behavioral research laboratories, classrooms. (3) costs associated with internet testing tend
2. naturalistic observation: behavioral observation that takes to be lower than costs associated with
place in a naturally occurring setting (as opposed to a research paper-and-pencil tests
laboratory) for the purpose of evaluation and information- (4) the internet facilitates the testing of
gathering. otherwise isolated populations, as well as
3. in practice tends to be used most frequently by researchers people with disabilities for whom getting
in settings such as classrooms, clinics, prisons, etc. to a test center might prove as a hardship.
F. Role- Play Tests (5) greener: conserves paper, shipping
1. role play: acting an improvised or partially improvised part in a materials etc.
simulated situation. d) Cons:
2. role-play test: tool of assessment wherein assessees are (1) test client integrity
directed to act as if they were in a particular situation. Assessees (a) refers to the verification of the
are then evaluated with regard to their expressed thoughts, identity of the test taker when
behaviors, abilities, etc a test is administered online
G. Computers as tools (b) also refers to the sometimes
1. local processing: on site computerized scoring, interpretation, varying interests of the test
or other conversion of raw test data; contrast w/ CP and taker vs that of the test
teleprocessing administrator. The test taker
2. central processing: computerized scoring, interpretation, or might have access to notes,
other conversion of raw data that is physically transported from aids, internet resources etc.
the same or other test sites; contrast w/ LP and teleprocessing. (c) internet testing is only testing,
3. teleprocessing: computerized scoring, interpretation, or other not assessment
conversion of raw test data sent over telephone lines by modem 8. CAT: computerized adaptive testing: an interactive, computer-
from a test site to a central location for computer processing. administered test taking process wherein items presented to
contrast with CP and LP the test taker are based in part on the test taker's
4. simple score report: a type of scoring report that provides only performance on previous items
a listing of scores a) EX: on a computerized test of academic abilities, the
5. extended scoring report: a type of scoring report that computer might be programmed to switch from
provides a listing of scores AND statistical data. testing math skills to English skills after three
consecutive failures on math items.
H. Other Tools
CHAPTER 1: PSYCHOLOGICAL TESTING AND ASSESSMENT
1. DVD- how would you respond to the events that take place in satisfaction, personal values, quality of living conditions,
the video and quality of friendships and other social support.
a) sexual harassment in the workplace BUSINESS AND MILITARY SETTINGS
b) respond to various types of emergencies GOVERNMENTAL AND ORGANIZATIONAL CREDENTIALING
c) diagnosis/treatment plan for clients on videotape How are Assessments Conducted?
2. thermometers, biofeedback, etc protocol: the form or sheet or booklet on which a testtaker’s
responses are entered.
TEST DEVELOPER o term might also be used to refer to a description of a set of
They are the one who create tests. test- or assessment- related procedures, as in the
They conceive, prepare, and develop tests. They also find a way to sentence , “the examiner dutifully followed the complete
disseminate their tests, by publishing them either commercially or protocol for the stress interview”
through professional publications such as books or periodicals. rapport: working relationship between the examiner and the
TEST USER
examinee
They select or decide to take a specific test off the shelf and use it for
some purpose. They may also participate in other roles, e.g., as
examiners or scorers. ASSESSEMENT OF PEOPLE WITH DISABILITITES
TEST TAKER Define who requires alternate assessement, how such assessment are
Anyone who is the subject of an assessment to be conducted and how meaningful inferences are to be drawn
Test taker may vary on a continuum with respect to from the data derived from such assessment
numerous variables including: Accommodation – adaptation of a test, procedure or situation or the
o The amount of anxiety they experience & the degree to substitution of one test for another to make the assessment more
which the test anxiety might affect the results suitable for an assesee with exceptional needs.
o The extent to which they understand & agree with Translate it into Braillee and administere in that form.
the rationale of the assessment Alternate assessment – evaluative or diagnostic procedure or process
o Their capacity & willingness to cooperate that varies from the usual, customary, or standardized way a
o Amount of physical pain/emotional distress they are measurement is derived either by virtue of some special
experiencing accommodation made to the assesee by means of alternative
o Amount of physical discomfort methods
o Extent to which they are alert & wide awake Consider these four variables on which of many different types of
o Extent to which they are predisposed to agreeing accommodation should be employed:
or disagreeing when presented with stimulus o The capabilities of the assesse
o The extent to which they have received prior coaching o The purpose of the assessment
o May attribute to portraying themselves in a good light o The meaning attached to test scores
Psychological autopsy – reconstruction of a deceased individual’s o The capabilities of the assessor
psychological profile on the basis of archival records, artifacts, & REFERENCE SOURCES
interviews previously conducted with the deceased assesee TEST CATALOUGES – contains brief description of the test
TYPES OF SETTINGS TEST MANUALS – detailed information
EDUCATIONAL SETTING REFERENCE VOLUMES – one stop shopping, provides detailed
o achievement test: evaluation of accomplishments or the information for each test listed, including test publisher,
degree of learning that has taken place, usually with author, purpose, intended test population and test
administration time
regard to an academic area.
JOURNAL ARTICLES – contain reviews of the test
o diagnosis: a description or conclusion reached on the basis
ONLINE DATABASES – most widely used bibliographic databases
of evidence and opinion though a process of distinguishing
the nature of something and ruling out alternative TYPES OF TESTS
conclusions. INDIVIDUAL TEST – those given to only one person at a time
o diagnostic test: a tool used to make a diagnosis, usually to GROUP TEST – administered to more than one person at a time by
identify areas of deficit to be targeted for intervention single examiner
o informal evaluation: A typically non systematic, relatively ABILITY TESTS:
o ACHIEVEMENT TESTS – refers to previous learning (ex.
brief, and “off the record” assessment leading to the
Spelling)
formation of an opinion or attitude, conducted by any
o APTITUDE/PROGNOSTIC – refers to the potential for
person in any way for any reason, in an unofficial context learning or acquiring a specific skill
and not subject to the same ethics or standards as o INTELLIGENCE TESTS – refers to a person’s general
evaluation by a professiomal potential to solve problems
CLINICAL SETTING PERSONALITY TESTS: refers to overt and covert dispositions
o these tools are used to help screen for or o OBJECTIVE/STRUCTURED TESTS – usually self-report,
diagnose behavior problems require the subject to choose between two or more
o group testing is used primarily for screening: identifying alternative responses
those individuals who require further diagnostic o PROJECTIVE/UNSTRUCTURED TESTS – refers to all
evaluation. possible uses, applications and underlying concepts of
COUNSELING SETTING psychological and educational tests
o schools,prisons, and governmental or privately owned o INTEREST TESTS –
institutions
o ultimate objective: the improvement of the assessee in
terms of adjustment, productivity, or some related
variable.
GERIATRIC SETTING
o quality of life: in psychological assesment, an evaluation
of variables such as perceived stress,lonliness, sources of
CHAPTER 2: HISTORICAL, CULTURAL AND LEGAL/ETHICAL
A HISTORICAL PERSPECTIVE testakers from young children through senior
19TH CENTURY adulthood.
Tests and testing programs first came into being in China B. THE MEASUREMENT OF PERSONALITY
Testing was instituted as a means of selecting who, of many o Field of psychology was being too test oriented
applicants would obtain government jobs (Civil service) o Clinical psychology was synonymous to mental testing
The job applicants are tested on proficiency in endeavors such o ROBERT WOODWORTH – develop a measure of
as music, archery, knowledge and skill etc. adjustment and emotional stability that could be
GRECO-ROMAN WRITINGS (Middle Ages) administered quickly and efficiently to groups of recruits
World of evilness To disguise the true purpose of the test,
Deficiency in some bodily fluid as a factor believed to influence questionnaire was labeled as Personal Data
personality Sheet
Hippocrates and Galen He called it Woodworth Psychoneurotic
RENAISSANCE Inventory – first widely used self-report test of
Christian von Wolff – anticipated psychology as a science and personality
psychological measurement as a specialty within that science o Self-report test:
CHARLES DARWIN AND INDIVIDUAL DIFFERENCES Advantages:
Tests designed to measure these individual differences in ability and Respondents best qualified
personality among people Disadvantages:
“Origin of Species” chance variation in species would be selected Poor insight into self
or rejected by nature according to adaptivity and survival value. One might honestly believe
“survival of the fittest” something about self that isn’t true
FRANCIS GALTON Unwillingness to report seemingly
Explore and quantify individual differences between people. negative qualities
Classify people “according to their natural gifts” o Projective test: individual is assumed to project onto some
Displayed the first anthropometric laboratory ambiguous stimulus (inkblot, photo, etc.) his or her own
KARL PEARSON unique needs, fears, hopes, and motivations
Developed the product moment correlation technique. Ex.) Rorschack inkblot
His work can be traced directly from Galton o
WILHEM MAX WUNDT C. THE ACADEMIC AND APPLIED TRADITIONS
First experimental psychology laboratory in University of Leipzig
Culture and Assessment
Focuses more on relating to how people were similar, not different
from each other.
Culture: ‘the socially transmitted behavior patterns, beliefs, and products of
JAMES MCKEEN CATELL
work f a particular population, community, or group of people’
Individual differences in reaction time
Coined the term mental test
Evolving Interest in Culture-Related Issues
CHARLES SPEARMAN
Goddard tested immigrants and found most to be feebleminded
Originating the concept of test reliability as well as building the
-invalid; overestimated mental deficiency, even in native English-
mathematical framework for the statistical technique of factor
speakers
analysis
Lead to nature-nurture debate about what intelligence tests actually measure
VICTOR HENRI
Needed to “isolate” the cultural variable
Frenchman who collaborated with Binet on papers suggesting how Culture-specific tests: tests designed for use with ppl from one culture, but not
mental tests could be used to measure higher mental processes from another
EMIL KRAEPELIN -minorities still scored abnormally low
Early experimenter of word association technique as a formal test ex.) loaf of bread vs. tortillas
LIGHTNER WITMER today tests undergo many steps to ensure its suitable for said nation
“Little known founder of clinical psychology” -take testtakers reactions into account
Founded the first psychological clinic in the U.S.
PSYCHE CATELL Some Issues Regarding Culture and Assessment
Daughter of James Cattell Verbal Communication
Cattel Infant Intelligence Scale (CIIS) & Measurement of Intelligence in o Examiner and examinee must speak the same language
Infants and Young Children o Especially tricky with infrequently used vocabulary or
RAYMOND CATTELL unusual idioms employed
Believed in lexical approach to defining personality which examines o Translator may lose nuances of translation or give
human languages for descriptors of personality dimensions unintentional hints toward more desirable answer
20th CENTURY o Also requires understanding of culture
- Birth of the first formal tests of intelligence Nonverbal Communication and Behavior
- Testing shifted to be of more understandable relevance/meaning o Different between cultures
A. THE MEASUREMENT OF INTELLIGENCE o Ex.) meaning of not making eye contact
o Binet created first intelligence to test to identify mentally o Body movement could even have physical cause
retarded school children in Paris (individual) o Psychoanalysis: Freud’s theory of personality and
o Binet-Simon Test has been revised over again psychological treatment which stated that symbolic
o Group intelligence tests emerged with need to significance is assigned to many nonverbal acts.
screen intellect of WWI recruits o Timing tests in cultures not obsessed with speed
o David Wechsler – designed a test to measure adult o Lack of speaking could be reverence for elders
intelligence test Standards of Evaluation
for him Intelligence is a global capacity of the o Acceptable roles for women differ throughout culture
individual to act purposefully, to think o “judgments as to who might be the best employee,
rationally and to deal effectively with his manager, or leader may differ as a function of culture, as
environment. might judgments regarding intelligence, wisdom, courage,
Wechsler-Bellevue Intelligence Scale and other psychological variables”
Wechsler Adult Intelligence Test – was revised
several times and extended the age range of
CHAPTER 2: HISTORICAL, CULTURAL AND LEGAL/ETHICAL
o must ask ‘how appropriate are the norms or other The right to be informed of test findings
standards that will be used to make this evaluation’ o Formerly test administrators told to give participants only
positive information
Tests and Group Membership o No realistic information is required
ex.) must be 5’4” to be police officer- excludes cultures with short o Tell test takers as little as possible about the nature of
stature their performance on a particular test. So that the
ex.) Jewish lifestyle not well suited for corporate America examinee would leave the test session feeling pleased and
affirmative action: voluntary and mandatory efforts to combat statisfied.
discrimination and promote equal opportunity in education and o Test takers have the right also to know what
employment for all recommendations are being made as a consequence of the
Psychology, tests, and public policy test data
The right to privacy and confidentiality
Legal and Ethical Condiseration o Private right: “recognizes the freedom of the individual to
Code of professional ethics: defines the standard of care expected of members of pick and choose for himself the time, circumstances, and
a given profession. particularly the extent to which he wishes to share or
withhold from others his attitudes, beliefs, behaviors, and
The Concerns of the Public opinions”
Beginning in world war I, fear that tests were only testing the ability o Privileged information: information protected by law from
to take tests being disclosed in legal proceeding. Protects clients from
Legislation disclosure in judicial proceedings. Privilege belongs to the
o Minimum competency testing programs: formal testing client not the psychologist.
programs designed to be used in decisions regarding o Confidentiality: concerns matters of communication
various aspects of students’ educations outside the courtroom
o Truth-in-testing legislation: state laws to provide testtakers Safekeeping of test data: It is not a good policy
with a means of learning the criteria by which they are to maintain all records in perpetuity
being judged The right to the least stigmatizing label
Litigation o The standards advise that the least stigmatizing labels
o Daubert ruling made federal judges the gatekeepers to should always be assigned when reporting test results.
determining what expert testimony is admitted
o This overrode the Frye policy which only admitted
scientific testimony that had won general acceptance in
the scientific community.
Standard Scores
Standard Score: raw score that has been converted from one scale to another
scale, where the latter has arbitrarily set mean and standard deviation
-used for comparison
Z-score
CHAPTER 4: OF TESTS AND
Tasks on some tests mimic the actual behaviors that
Some Assumptions About Psychological Testing and Assessment the test user is attempting to understand
- Assumption 1: Psychological Traits and States Exist o Obtained behavior is usually used to predict future behavior
o Trait: any distinguishable, relatively enduring way in which one o Could also be used to postdict behavior to aid in the
individual varies from another understanding of behavior that has already taken place
o States: distinguish one person from another but are relatively o Tools of assessment, such as a diary, or case history data, might
less enduring be of great value in such an evaluation
Trait term that an observer applies, as well as - Assumption 4: Tests and Other Measurement Techniques Have Strengths
strength or magnitude of the trait presumed present and Weaknesses
based on observing a sample of behavior o Competent test users understand a lot about the tests they use
o Trait and state definitions also refer to individual variation How it was developed
make comparisons with respect to the hypothetical average Circumstances under which it is appropriate to
person administer the test
o Samples of behavior: How test should be administered and to whom
Direct observation How results should be interpreted
Analysis of self-report statements o Understand and appreciation limitations for tests they use
Paper-and-pencil test answers - Assumption 5: Various Sources of Error Are Part of the Assessment Process
o Psychological trait covers wide range of o Everyday error= misstates and miscalculations
possible characteristics; ex: o Assessment error= a long-standing assumption that
Intelligence factors other than what a test attempts to measure will
Specific intellectual abilities influence performance on a test
Cognitive style o Error variance: component of a test score attributable to
Psychopathology sources other than the trait or ability measured
o Controversy regarding how psychological tests exist Assessees themselves are sources of error variance
Psychological tests exist only as constructs: an o Classical test theory (CTT)/ True score theory: assumption is
informed, scientific concept developed or made that each testtaker has a true score on a test that would
constructed to describe or explain a behavior be obtained but for the action of measurement error
Cant see, hear or touch infer - Assumption 6: Testing and Assessment Can Be Conducted in a Fair and
existence from overt behavior: refers to Unbiased Manner
an observable action or the product of o Court challenged to various tests and testing programs have
an observable action, including test- or sensitized test developers and users to the societal demand for
assessment-related responses fair tests used in a fair manner
o Traits not expected to be manifested in behavior 100% of the Publishers strive to develop instruments that are fair
time when used in strict accordance with guidelines in the
Seems to be rank-order stability in personality test manual
traits relatively high correlations between trait o Fairness related problems/questions:
scores at different time points Culture is different from people whom the test was
o Whether and to what degree a trait manifests itself is intended for
dependent on the strength and nature of the situation Politics
- Assumption 2: Psychological Traits and States Can Be Quantified and - Assumption 7: Testing and Assessment Benefit Society
Measured o Many critical decisions are based on testing and assessment
o After acknowledged that psychological traits and states do exist, procedures
the specific traits and states to be measured need to be defined
What types of behaviors are assumed to WHAT’S A “GOOD TEST”?
be indicative of trait? - Criteria
Test developer has to provide test users with a clear o Clear instruction for administration, scoring, and interpretation
operational definition of the construct under study - Reliability
o After being defined, test developer considers types of item o A “good test”/measuring tool reliable
content that would provide insight into it Involves consistency: the prevision with which the
Ex: behaviors that are indicative of a particular trait test measures and the extent to which error is
o Should all questions be weighted the same? present in measurements
Weighting the comparative value of a test’s items Unreliable measurement needs to be avoided
comes about as the result of a complex interplay - Validity
among many factors: o Test is considered valid if it doesn’t indeed measure what it
Technical considerations purports to measure
The way a construct has been defined (for o If there is controversy over the definition of a construct then the
particular test) validity is sure to be criticized as well
Value society (and test developer) attach o Questions regarding validity focus on the items that collectively
to behaviors evaluated make up the test
o Need to find appropriate ways to score the test and interpret Adequately sample range of areas to measure
results construct
Cumulative scoring: test score is presumed to Individual items contribute to or take away from
represent the strength of the targeted ability or test’s validity
trait or state o Validity may also be questioned on grounds related to the
The more the testtaker responds in a interpretation of test results
particular direction (as keyed by test - Other Considerations
manual) the higher the testtaker is o “Good test” one that trained examiners can administer,
presumed to possess the targeted trait score and interpret with minimum difficulty
or ability Useful
- Assumption 3: Test-Related Behavior Predicts Non-Test-Related Behavior Yields actionable results that will ultimately benefit
o Objective of test is to provide some indication of some aspects individual testtakers or society at large
of the examinee’s behavior
CHAPTER 4: OF TESTS AND
o Purpose of test compare performance of testtaker with o STANDARD ERROR OF THE DIFFERENCE – estimate how
performance of other testtakers (contains adequate norms: large a difference between two scores should be before
normative data) the difference is considered statistically significant
Normative data provides standard with which results - Developing norms for a standardized test
measured can be compared o Establish a standard set of instructions and conditions
NORMS under which the test is given makes scores of normative
- Norm-referenced testing and assessment: method of evaluation and sample more comparable with scores of future testtakers
a way of deriving meaning from test scored by evaluating an o All data collected and analyzed, test developer will
individual testtaker’s score and comparing it to scores of a group of summarize data using descriptive statistics (measures of
testtakers central tendency and variability)
- Meaning of individual score is relative to other scores on the same Test developer needs to provide precise
test description of standardization sample itself
- Norms (scholarly context): usual, average, normal, standard, Descriptions of normative samples vary widely
expected or typical in detail
- Norms (psychometric context): the test performance data of a Tracking
particular group of testtakers that are designed for use as a reference - Comparisons are usually with people of the same age
when evaluating or interpreting individual test scores - Children at the same age level tend to go through different growth
- Normative sample: group of people whose performance on a patterns
particular test is analyzed for reference in evaluation the - Pediatricians must know the child’s percentile within a given age
performance of individual testtakers group
o Yields a distribution of scores - This tendency to stay at about the same level relative to one’s peers is
- Norming: refers to the process of deriving norms; particular type of known as tracking (ie height and weight)
norm derivation - Diets may alter this “track”
o Race norming: controversial practice of norming on the - Faults: some believe there is an analogy between the rates of physical
basis of race or ethnic background growth and the rates of intellectual growth
- Norming a test can be very expensive user norms/program norms: o Some say that children learn at different rates
consist of descriptive statistics based on a group of testtakers in a o This system discriminates against some children
given period of time rather than norms obtained by form sampling
methods TYPES OF NORMS
- Sampling to Develop Norms o Classification of norms ex: age, grade, national, local,
- Standardization: process of administering a test to a percentile, etc.
representative sample of testtakers for the purpose of establishing o PERCENTILES
norms Median= 2nd quartile: the point at or below which
o Standardized when has clear, specified procedures 50% of the scores fell and above which the remaining
- Sampling 50% fell
o Developer targets defined group as population test Might wish to divide distribution of scores into
designed for deciles (instead of quartiles): 10 equal parts
All have at least one common, observable The Xth percentile is equal to the score at or below
characteristic which X% of scores fall
o To obtain distribution of scores: Percentile: an expression of the percentage of
Test administered to everyone in people whose score on a test or measure falls below
targeted population a particular raw score
Administer test to a sample of the population Percentage correct: refers to the
Sample: portion of universe of distribution of raw scores (number of
people deemed to be representative items that were answered correctly)
of whole population multiplied by 100 and divided by the total
Sampling: process of selecting the number of items *not same as percentile
portion of universe deemed to be Percentile is a converted score that refers
representative of whole to a percentage of testtakers
o Subgroups within a defined population may differ with Percentiles are easily calculated popular way of
respect to some characteristics and it is sometimes organizing test related data
essential to have these differences proportionately Using percentiles with normal distribution real
represented in sample differences between raw scores may be minimized
Stratified sampling: sample reflects statistics of near the ends of the distribution and exaggerated in
whole population; helps prevent sampling bias the middle (worsens with highly skewed data)
and ultimately aid in interpretation of findings o AGE NORMS
Purposive sampling: arbitrarily select sample Age-equivalent scores/age norms: indicate the
we believe to be representative of population average performance of different samples of
Incidental/convenience sampling: sample that testtakers who were at various ages at the time the
is convenient or available for use test was administered
Very exclusive (contain exclusionary Age norm tables for physical
criteria) characteristics
- TYPES OF STANDARD ERROR: “Mental” age vs. physical age (need to
o STANDARD ERROR OF MEASUREMENT – estimate the identify mental age)
extent to which an observed score deviates from a true o GRADE NORMS
score Grade norms: designed to indicate the average test
o STANDARD ERROR OF ESTIMATE – In regression, an performance of testtakers in a given school grade
estimate of the degree of error involved in predicting the Developed by administering the test to
value of one variable from another representative samples of children over a
o STANDARD ERROR OF THE MEAN – a measure of sampling range of consecutive grades
error Mean or median score for children at
each grade level is calculated
CHAPTER 4: OF TESTS AND
Great intuitive appeal CORRELATION
Do not provide info as to the content or Degree and direction of correspondence between two things.
type of items that a student could or Correlation coefficient (r) – expresses a linear relationship between
could not answer correctly two continuous variables
Developmental norms: (ex: grade norms and age o Numerical index that tells us the extent to which X and
norms) term applied broadly to norms developed on Y are “co-related”
the basis of any trait, ability, skill, or other Positive correlation: high scores on Y are associated with high scores
characteristic that is presumed to develop, on X, and low scores on Y correspond to low scores on X
deteriorate, or otherwise be affected by Negative correlation: higher scores on Y are associated with lower
chronological age, school grade, or stage of life scores on X, and vise versa
o NATIONAL NORMS No correlation: the variables are not related
National norms: derived from a normative sample -1 to 1
that was nationally representative of the population Correlation does not imply causation.
at the time the norming study was conducted o Ie weight, height, intelligence
o NATIONAL ANCHOR NORMS
Many different tests purporting to measure the same PEARSON r
human characteristics or abilities Pearson Product Moment Correlation Coefficient
National anchor norms: equivalency tables for scores Devised by Karl Pearson
on tests that purpose to measure the same thing Relationship of two variables are linear and continuous
Could provide the tool for comparisons Coefficient of Determination (r2) – indication of how much variance is
Provides stability to test scores by shared by the X and the Y variables
anchoring them to other test scores SPEARMAN RHO
Begins with the computation of percentile Rank order correlation coefficient
norms for each test to be compared Developed by Charles Spearman
Equipercentile method: equivalency of Used when the sample size is small and when both sets of
scores on different tests is calculated with measurements are in ordinal form (ranking form)
reference to corresponding percentile BISERIAL CORRELATION
scores expresses the relationship between a continuous variable and an
o SUBGROUP NORMS artificial dichotomous variable
Normative sample can be segmented by an criteria o If the dichotomous variable had been true then we would
initially used in selecting subjects for sample use the point biserial correlation
Subgroup norms: result of segmentation; more o When both variables are dichotomous and at least one
narrowly defined of the dichotomies is true, then the association between
o LOCAL NORMS them can be estimated using the phi coefficient
Local norms: provide normative info with respect to o If both dichotomous variables are artificial, we might use a
the local population’s performance on some test special correlation coefficient – tetrachoric correlation
Typically developed by test
users themselves REGRESSION
- Fixed Reference Group Scoring Systems analysis of relationships among variables for the purpose of
o Norms provide context for interpreting meaning of a test score understanding how one variable may predict another
o Fixed reference group scoring system: distribution of scored SIMPLE REGRESSION: one IV (X) and one DV (Y)
obtained on the test from one group of testtakers (fixed - Regression line: defined as the best-fitting straight line through a set
reference group) is used as the basis for the calculation of test of points in a scatter diagram
scores for future administrators on the test o Found by using the principle of least squares, which
Ex: SAT test (developed in 1962) minimizes the squared deviation around the regression
NORM-REFERENCED VERSUS CRITERION-REFERENCED EVALUATION line
- Way to derive meaning from test score is to evaluate test score Primary use: To predict one score or variable from another
in relation to other scores on same test (Norm-referenced) Standard error of estimate: the higher the correlation between X and
- Criterion-referenced: derive meaning from a test score by Y, the greater the accuracy of the prediction and the smaller the SEE.
evaluating it on the basis of whether or not some criterion has been MULTIPLE REGRESSION: The use of more than one score to predict Y.
met Regression coefficient: (b) slope of the regression line
o Criterion: a standard on which a judgment or decision may
o Sum of squares for the covariance to the sum of squares
be based
for X
- Criterion-referenced testing and assessment: method of evaluation
o Sum of squares is defined as the sum of the squared
and way of deriving meaning from test scores by evaluating an
deviations around the mean
individual’s score with reference to a set standard (ex: to drive must
o Covariance is used to express how much two measures
past driving test) covary, or vary together
o Derives from values and standards of an individual or
Slope describes how much change is expected in Y each time X
organization
increases by one unit
o Also called Domain/content-referenced testing and
Intercept (a) is the value of Y when X is 0
assessment
o The point at which the regression line crosses the Y axis
o Critique: if followed strictly, important info about
THE BEST-FITTING LINE
individual’s performance relative to others can
The difference between the observed and predicted score (Y-Y’) is
be potentially lost
called the residual
Culture and Inference
- Culture is a factor in test administration, scoring and interpretation The best-fitting line is most appropriately found by squaring each
- Test user should do research in advance on test’s available norms to residual
check how appropriate it is for targeted testtaker population Best-fitting line is obtained by keeping these squared residuals as
o Helpful to know about the culture of the testtaker small as possible
o Principle of least squares:
CORRELATION AND INFERENCE Correlation is a special case of regression in which the scores for
both variables are in standardized, or Z, units
CHAPTER 4: OF TESTS AND
In correlation, the intercept is always 0 - Third variable, ie poor social adjustment, causes TV viewing and
Pearson product moment correlation coefficient is a ratio used to aggression
determine the degree of variation in one variable that can be - External influence is the third variable
estimated from knowledge about variation in the other variable Restricted Range
Testing the Statistical Significance of a Correlation Coefficient - Correlation and regression use variability on one variable to
- Begin with the null hypothesis that there is no relationship between explain variability on a second variable
variables - Restricted range problem: correlation requires variability; if the
- Null hypothesis rejected is there is evidence that the association variability is restricted, then significant correlations are difficult to
between two variables is significantly different from 0 find
- t distribution is not a single distribution, but a family of distributions, Mulvariate Analysis
each with its own degrees of freedom - Multivariate analysis considers the relationship among combinations
- Degrees of freedom are defined as the sample size minus 2, or N-2 of three of more variables
- Two-tailed test General Approach
- Linear combination of variables is a weighted composite of the
How to Interpret a Regression Plot original variables
- Regression plots are pictures that show the relationship between - Y’ = a+b1X1 + … bkXk
variables
- Common use of correlation is to determine the criterion validity
evidence for a test, or the relationship between a test score and
some well-defined criterion
- Middle level of enjoyableness because it is the one observed most
frequently – normative because it uses info gained from
representative groups
- Using the test as a predictor is not as good as perfect prediction, but
it is still better than using the normative info
- A regression line such as in 3.9 shows that the test score tells us
nothing about the criterion beyond the normative info
All share interactionism: complex concept by which heredity and Measuring Intelligence
environment are presumed to interact and influence the
development of one’s intelligence Types of Tasks Used in Intelligence Test
Factor-analytic theories: focus is squarely on identifying the Infants: test sensorimotor, interviews with parents
ability(ies) deemed to constitute intelligence Older child: verbal and performance abilities
Information-processing theories: focus is on identifying the specific Mental Age: index that refers to chronological age equivalent to
mental processes that constitute intelligence. one’s test performance
Adults: retention of general information, quantitative reasoning,
Factor-Analytic Theories of Intelligence: expressive language and memory, and social judgment
Charles Spearman: pioneered new techniques to Theory in Intelligence Test Development and Interpretation
measure intercorrelations between tests. Weschler made a dichotomous test (Performance and Verbal), but
o Existence of a general intellectual ability factor (g) that advocated multifaceted definition
tapped by all other mental abilities. Thorndike: intelligence = social, concrete, abstract
g representing the portion of the variance that all intelligence tests Putting theories into test are extremely hard
have in common and the remaining portions of the variance being
accounted for either by specific components (s) or by error Intelligence: Some Issues:
components (e) Nature vs. Nurture
greater g = better test was thought to predict overall intelligence Currently believed to be mix of two
CHAPTER 9: INTELLIGENCE AND ITS
Performationism: all structures, including intelligence are had at birth
and can’t be improved upon
Led to predeterminism: one’s abilities are predetermined by genetic
inheritance and no learning or intervention can enhance it
Interactionist: ppl inherit certain intellectual potential
o Theres a limit to genetic abilities (i.e. can’t ever have x-ray
vision)
The Stability of Intelligence
Stable pretty much throughout one’s adult life
Cognitive abilities seem to decline with age
The Construct Validity of Tests of Intelligence
Having construct validity requires having unified understanding of
what intelligence is
Very difficult. Spearman says its one thing, Guilford says its many
Thorndike approach is sort of compromise
o Look for one central factor with three additional factors
representing social, concrete, and abstract intelligences
Other Issues
Flynn effect: IQ scores seem to rise every year, but not coupled with
rise in “true intelligence”
Personality
o High IQ: Need for achievement, competition, curiosity,
confidence, emotional stability etc.
o Low IQ: passivity, dependence, maladjustment
o Temperament (used to describe infants)
Gender
o Men usually outscore in visual spatialization tasks and
intelligence scores
o Women tend to outscore in language-skill tasks
o But differences can be bridged
Family Environment
o Divorce can have negative effects
o Begins with “maternal effects” in womb
Culture
o Provides specific models for thinking, acting and feeling
o Assumed that if cultural factors can be controlled then
differences between cultural groups will be lessened
o Assumed that culture can be removed by the reliance on
exclusively nonverbal tasks
Tend not to be very good at predicting success
in various academic and business settings
o Culture loading: the extent to which a test
incorporates the vocabulary, concepts, traditions,
knowledge and feelings associated with a particular
culture
o No test can be culture free
o Culture-fair intelligence test: test/assessment process
designed to minimize the influence of culture with regard
to various aspects of evaluation procedure
o Another approached called for cultural-specific intelligence
tests
Ex.) BITCH measured streetwiseness
Lacked predictive validity and useful, practical
information
CHAPTER 10: TESTS OF
The Stanford-Binet Intelligence Scales Other Measures of Intelligence
First to have detailed administration and scoring instructions Tests Designed for Individual Administration
First American test to test IQ Kaufman Adolescent and Adult Intelligence Test
First to use alternate items (an item that can be used in place of Kaufman Brief Intelligence Test
another) Kaufman Assessment Battery for Children
Lacked minority group representation Away from information processing and towards a distinction
Ratio IQ=(mental age/chronological age)x100 between sequential and simultaneous processing
Deviation Ratio/test composite: performance of one individual Tests Designed for Group Administration
compared to the performance of others of the same age. Has Group Testing in the Military
mean of 100 and standard deviation of 16 o WWI need for government to test intelligence
Age scale: items grouped by age as means of differentiating “unfit” and
Point scale: items organized by category “exceptionally superior ability”
The Stanford-Binet Intelligence Scales: Fifth Edition o Army Alpha Test: to army recruits who could read.
Measures fluid intelligence, crystallized knowledge, quantitative Included general information questions, analogies, and
knowledge, visual-processing, and short-term (working) memory scrambled sentences to reassemble
Utilizes adaptive testing: testing individually tailored to testtakers o Army Beta Test: to foreign or illiterate recruits,
to ensure that items are neither too difficult (frustrating) or too included mazes, coding, and picture completion.
easy (false hope) o After the war, the alpha and beta test were used
Examiner establishes rapport with testtaker, then administers rampantly, and oftentimes misused
routing test to direct, route examinee to test items most likely at o Screening tools: instrument of procedure used to
optimal level of difficulty identify a particular trait or constellation of traits
Teaching items: show testtaker what is expected, how to do it. o ASVAB (Armed Services Vocational Aptitude Battery):
o Can be used for qualitative assessment, but not scoring administered to prospective to recruits or high school
Subtests for verbal and nonverbal tests share same name, but students looked for career guidance
involve different tasks 5 career areas: clerical, electronics,
Floor: lowest level of items on subtest mechanical, skill-technical, and combat
Ceiling: highest-level item of subtest operations
Basal level: base-level criterion that must be met for testing on Group Testing in Schools
the subtest to continue o Useful in developing child’s profile- but cannot be sole
Ceiling level is met when testtaker fails certain number of items in indicator
a row. Test discontinues here. o Groups of 10-15
Scores: raw standard composite o Starting in Kindergarten
Extra-test behavior: behavioral observation o Also called traditional group testing, because more
The Wechsler Tests modern forms can utilize computer. These more aptly
-commonality between all versions: all yield deviation IQ’s with mean of 100 called individual testing
and standard deviation of 15 Measures of Specific Intellectual Abilities
Wechsler Adult Intelligence Scale-Fourth Edition (WAIS-IV) Widely used intelligence tests only test a sampling of the many
Core subtest: administered to obtain a composite score attributable factors aiding in intelligence
Supplemental/Optional Subtest: provides additional clinical Ex.) Creativity
information or extending the number of abilities or processes o Commonly thought to be composed of originality,
sampled. fluency, flexibility, and elaboration
o If the focus is too heavily on whether an answer is
Yields four index scores: Verbal Comprehension Index, a Working
correct, doesn’t allow for creativity
Memory Index, a Perceptual Reasoning Index, and a Processing
o Achievement tests require convergent thinking:
Speed Index
The Wechsler Intelligence Scale for Children –Fourth Edition (WISC-IV) deductive reasoning process that entails recall and
consideration of facts as well as a series of logical
Process score: index designed to help understand how testtakers
judgments to narrow down solutions and eventually
process various kinds of information
arrive at one solution
WISC-IV compared to the SB5
o Divergent thinking: a reasoning process in which
The Wechsler Preschool and Primary Scale of Intelligence-Third Edition
thought is free in many different directions, making
(WPPSI-III)
several solutions possible
New school for children under 6
Associated words, uses of rubber band etc.
First major intelligence test which adequately sampled
Test-retest reliability for some of these tests
total population of the United States
are near unacceptable
Subtests labeled core, supplemental, or optional
Wechsler, Binet, and the Short Form
Short form: test that has been abbreviated in length to reduce
time needed to administer, score and interpret
used with caution, only for screening
provide only estimates
reducing the number of items usually reduces reliability and thus
validity
Wechsler Abbreviated Scale of Intelligence
The Wechsler Test in Perspective
Factor Analysis
o Exploratory factor analysis: summarizing data when
we are not sure how many factors are present in
our data
o Confirmatory factor analysis: used to test highly
specific factor analysis
CHAP.11: Other Individual Tests of Ability in Education and Special
Education Bayley Scales of Infants and Toddler Development – Third Edition (BSID-III)
- Base assessments on normative maturational developmental data
Alternative Individual Ability Tests Compared with the Binet and Wechsler - Designed for infants between 1 and 42mths
Scales - Assesses development across 5 domains: cognitive, language,
- None of these are clearly superior from a psychometric motor, socioemotional, and adaptive
standpoint - Motor scale: assumes that later mental functions depend on
- Some less stable, most more limited in their documented validity motor development
- Compare poorly to Binet and Wechsler on all accounts - Excellent standardization
- They don't rely on a verbal response as much as the B and W - Generally positive reviews
- Just use pointing or Yes/No responses, thus do not depend on the - Strong internal consistency
complex integration of visual and motor functioning - More validity studies needed
- Contain a performance scale or subscale - Widely used in research – children with Down syndrome,
- Their specificity often limits the range of functions or abilities that pervasive developmental disorders, cerebral palsy, language
they can measure impairment, etc
- Because they are designed for special populations, some - Most psychometrically sound test of its kind
alternatives can be administered totally without the - Predictive though?
verbal instructions Cattell Infant Intelligence Scale (CIIS)
- Based on normative developmental data
Specific Individual Ability Tests - Downward extension of Stanford-Binet scale for 2-30mth olds
- Earliest individual tests typically designed for specific purposes or - Similar to Gesell scale
populations - Rarely used today
- One of the first – Seguin Form Board Test – in 1800s – produced - Sample is primarily based on children of parents from lower and
only a single score middle classes and therefore does not represent the general
o Used primarily to evaluate mentally retarded population
adults and emphasized speed and performance - Unchanged for 60yrs
- After, the Healy-Fernald Test was developed as an exclusively - Psychometrically unsatisfactory
nonverbal test for adolescent delinquents
- Knox developed a battery of performance tests for non-English Major Tests for Young Children
adult immigrants to the US – administered without language; McCarthy Scales of Children’s Abilities (MSCA)
speed not emphasized - Measure ability in children between 2-8yrs
- These early individual tests designed for specific populations, - Present a carefully constructed individual test of human ability
produced a single score, and had nonverbal performance - Meager validity
scales - Produces a pattern of scores as well as a variety of composite
- Could be administered without visual instructions and used with scores
children as well as adults - General cognitive index (CGI): standard score with a mean of 100
Infant Scales and a standard deviation of 16
- Where mental retardation or developmental delays are o Index reflects how well the child has integrated prior
suspected, these tests can supplement observation, genetic learning experiences and adapted them to the
testing, and other medical procedures demands of the scales
Brazelton Neonatal Assessment Scale (BNAS) - Relatively good psychometric properties
- Individual test for infants between 3days and 4weeks - Reliability coefficients in the low .90s
- Purportedly provides an index of a newborn’s competence - In research studies
- Favorable reviews - Good validity? Good assessment tool
- Considerable research base Kaufman Assessment Battery for Children - Second Edition (KABC-II)
- Wide use as a research tool and as a diagnostic tool for special - Individual ability test for children between 3-18yrs
purposes - 18 subtests in 5 global scales called sequential processing,
- Commonly used scale for the assessment of neonates simultaneous processing, learning, planning, and knowledge
- Drawbacks: - Intended for psychological, clinical, minority-group, preschool,
o No norms are available and neuropsychological assessment as well as research
o More research is needed concerning the meaning and - Sequential-simultaneous distinction
implication of scores o Sequential processing refers to a child’s ability to solve
o Poorly documented predictive and construct validity problems by mentally arranging input in sequential or
o Test-retest reliability leaves much to be desired serial order
Gesell Developmental Schedules (GDS) o Simultaneous processing refers to a child’s ability to
- Infant intelligence measures synthesize info from mental wholes in order to solve a
- Used as a research tool by those interested in assessing infant problem
intellectual development after exposure to mercury, diagnoses of - Nonverbal measure of ability too
abnormal brain formation in utero and assessing infants with - Well constructed and psychometrically sound
autism - Not much evidence of (good) validity
- Children of 2.3mth to 6.3yrs - Poorer predictive validity for school achievement – smaller
- Obtains normative data concerning various stages in maturation differences between whites and minorities
- Individual’s developmental quotient (DQ) is determined - Test suffers from a noncorrespondence between its definition and
according to a test score, which is evaluated by assessing the its measurement of intelligence
presence or absence of behavior associated with maturation
- Provides an intelligence quotient like that of the Binet General Individual Ability Tests for Handicapped and Special Populations
o (development quotient / chronological age) x 100 Columbia Mental Maturity Scale – Third Edition (CMMS)
- But, falls short of acceptable psychometric standards - Purports to evaluate ability in normal and variously handicapped
- Standardization sample not representative of the population children from 3-12yrs
- No reliability or validity - Requires neither a verbal response nor fine motor skills
- Does appear to help uncover subtle deficits in infants
- Requires subject to discriminate similarities and differences by Illinois Test of Psycholinguistic Abilities (ITPA-3)
indicating which drawing does not belong on a 6-by-9inch card - Assumes that failure to respond correctly to a stimulus can result
containing 3-5 drawings not only from a defective output system but also from a defective
- Multiple choice input or information-processing system
- Standardization sample is impressive - Stage 1: info must first be received by the senses before it can be
- Vulnerable to random error analyzed
- Reliable instrument that is useful in assessing ability in many - Stage 2: info is analyzed or processed
people with sensory, physical, or language handicaps - Stage 3: with processed info, individual must make a response
- Good screening device - Theorizes that the child may be impaired in one or more specific
Peabody Picture Vocabulary Test – Fourth Edition (PPVT-IV) sensory modalities
- 2-90yrs - 12 subtests that measure individual’s ability to receive visual,
- multiple choice tests that require subject to indicate Yes/No in auditory, or tactile input independently of processing and output
some manner factors
- Instructions administered aloud (not for the deaf) - purports to help isolate the specific site of a learning disability
- Purports to measure hearing or receptive vocabulary, presumably - For children 2-10yrs
providing a nonverbal estimate of verbal intelligence - Early versions hard to administer and no reliability or validity
- Can be done in 15mins, requires no reading ability - Now, with revisions, ITPA-3 psychometrically sound measure of
- Good reliability and validity children’s psycholinguistic abilities
- Should never be used as a substitute for a Wechsler or Binet IQ Woodcock-Johnson III
- Important component in a test battery or used as a screening - Evaluates learning disabilities
device - Designed as a broad-range individually administered test to be
- Easy to administer and useful for variety of groups used in educational settings
- BUT: Tendency to underestimate IQ scores, and problems - Assesses general intellectual ability, specific cognitive abilities,
inherent in the multiple-choice format are bad scholastic aptitude, oral language, and achievement
Leiter International Performance Scale – Revised (LIPS-R) - Based on the CHC three-stratum theory of intelligence
- Strictly a performance scale - Compares child’s score on cognitive ability with sore on
- Aims at providing a nonverbal alternative to the Stanford-Binet achievement – can evaluate possible learning problems
scale for 2-18yr olds - Relatively good psychometric properties
- For research, and clinical settings, where it is still widely utilized to - For learning disability tests, three conclusions seem warranted:
assess the intellectual function of children with pervasive o 1. Test constructors appear to be responding to the
developmental disorders same criticisms that led to changes in the Binet and
- Purports to provide a nonverbal measure of general Wechsler scales and ultimately to the development of
intelligence by sampling a wide variety of functions from the KABC
memory to nonverbal reasoning o 2. Much more empirical and theoretical research is
- Can be applied to the deaf and language-disabled needed
- Untimed o 3. Users or learning disabilities tests should take great
- Good validity pains to understand the weaknesses of these
Porteus Maze Test (PMT) procedures and not overinterpret results
- Popular but poorly standardized nonverbal performance measure Visiographic Tests
of intelligence - Require a subject to copy various designs
- Individual ability test Benton Visual Retention Test – Fifth Edition (BVRT-V)
- Consists of maze problems (12) - Tests for brain damage are based on the concept of psychological
- Administered without verbal instruction, thus used for a variety of deficit, in which a poor performance on a specific task is related to
special populations or caused by some underlying deficit
- Needs restandardization - Assumes that brain damage easily impairs visual memory ability
Testing Learning Disabilities - For individuals 8yrs+
- Major concept is that a child average in intelligence may fail - Consists of geometric designs briefly presented and then removed
in school because of a specific deficit or disability that - Computerized version developed
prevents learning Bender Visual Motor Gestalt Test (BVMGT)
- Federal law entitles every eligible child with a disability to a free - Consists of 9 geometric figures that the subject is imply asked to
appropriate public education and emphasizes special education copy
and related services designed to meet his or her unique needs - By 9yrs, any child of normal intelligence can copy the figures with
and prepare them for further education, employment, and only one or two errors
independent living - Errors occur for people whose mental age is less than 9,
- To qualify, child must have a disability and educational brain damage, nonverbal learning disabilities, emotional
performance affected by it problems
- Educators today can find other ways to determine when a child - Questionable reliability
needs extra help Memory-for-Designs (MFD) Test
- Processed called Response to Intervention (RTI): premise is that - Drawing test that involves perceptual-motor coordination
early intervening services can prevent academic failure for many - Used for people 8-60yrs
students with learning difficulties - Good split-half reliability
- Signs of learning problem: - Needs for validity documentation
o Disorganization - All these tests criticized because of their limitations in reliability
o Careless effort and validity documentation
o Forgetfulness - Good as screening devices though
o Refusal to do schoolwork or homework Creativity: Torrance Tests of Creative Thinking (TTCT)
o Slow performance - Measurement of creativity underdeveloped in psychological
o Poor attention testing
o Moodiness - Creativity: ability to be original, to combine known facts in new
ways, or to find new relationships between known facts
- Evaluating this a possible alternative to IQ
- Creativity tests in early stages of development
- Torrance tests separately measure aspects of creative thinking - Avoid overinterpreting test scores
such as fluency, originality, and flexibility Be Especially Suspicious of Low Scores
- Does not meet the Binet and Wechsler scales in terms - Assume that subjects understand purpose of testing, want to succeed,
of standardization, reliability, or validity and are equally rested/free of stress
- Unbiased indicator of giftedness Consider Wide Discrepancies a Warning Signal
- Inconsistent tests, but available data reflect the tests’ merit and - May reflect emotional problems or severe stress
fine potential When in Doubt, Refer
Individual Achievement Tests: Wide Range Achievement Test-3 (WRAT-4) - With low scores, discrepancies, etc, refer the subject for individual
- Achievement tests measure what the person has actually acquired testing
or done with that potential - Get trained professional
- Discrepancies between IQ and achievement have traditionally Group Tests in the Schools: Kindergarten Through 12th Grade
been the main defining feature of a learning disability - Purpose of tests is to measure educational achievement in
- Most achievement tests are group tests schoolchildren
- WRAT-4 purportedly permits an estimate of grade-level Achievement Tests verses Aptitude Tests
functioning in word reading, spelling, math computation, and - Achievement tests attempt to assess what a person has learned
sentence comprehension following a specific course of instruction
- Used for children 5yrs+ o Evaluate the product of a course of training
- Easy to administer o Validity is determined primarily by content-related evidence
- Problems: - Aptitude tests attempt to evaluate a student’s potential for learning
o Inaccuracy in evaluating grade-level reading ability rather than how much a student has already learned
o Not proven as psychometrically sound o Evaluate effects of unknown and uncontrolled experiences
o Validity is judged primarily on its ability to predict future
CHAP: 12: Standardized Tests in Education, Civil Service, and the Military performance
- Intelligence test measures general ability
- When justifying the use of group standardized tests, test users often - These three tests are highly interrelated
have problems defining what exactly they are trying to predict, or what Group Achievement Tests
the test criterion is - Stanford Achievement Test one of the oldest of the standardized
Comparison of Group and Individual Ability Tests achievement tests widely used in school system
- Individual tests require a single examiner for a single subject - Well-normed and criterion-referenced, with psychometric
o Examiner provides instructions documentation
o Subject responds, examiner records response - Another one is the Metropolitan Achievement Test, which measures
o Examiner evaluates response achievement in reading by evaluating vocab, word recognition, and
o Examiner takes responsibility for eliciting a maximum reading comprehension
performance - Both of these are reliable and normed on big samples
o Scoring requires considerable skill Group Tests of Mental Abilities (Intelligence)
- Those who use the results of group tests must assume that the subject Kuhlmann-Anderson Test (KAT) – 8th Edition
was cooperative and motivated - KAT is a group intelligence test with 8 separate levels
o Many subjects tested at a time covering kindergarten through 12th grade
o Subjects record own responses - Items are primarily nonverbal at lower levels, requiring minimal reading
o Subjects not praised for responding and language ability
o Low scores on group tests often difficult to interpret - Suited to young children and those who might be handicapped in
o No safeguards following verbal procedures
Advantages of Individual Tests - Scores can be expressed in verbal, quantitative, and total scores
- Provide info beyond the test score - Scores at other levels can be expressed at percentile bands: like a
- Allow the examiner to observe behavior in a standard setting confidence interval; provides the range of percentiles that most likely
- Allow individualized interpretation of test scores represent a subject’s true score
Advantages of Group Tests - Good construction, standardization, and other excellent psychometric
- Are cost-efficient qualities
- Minimize professional time for administration and scoring - Good validity and reliability
- Require less examiner skill and training - Potential for use and adaptation for non-English-speaking individuals or
- Have more objective and more reliable scoring procedures even countries needs to be explored
- Have especially broad application Henmon-Nelson Test (H-NT)
Overview of Group Tests - Of mental abilities
Characteristics of Group Tests - 2 sets of norms available:
- Characterized as paper-and-pencil or booklet-and-pencil tests because o one based on raw score distributions by age, the other
only materials needed are a printed booklet of test items, a test on raw scores distributions by grade
manual, scoring key, answer sheet, and pencil - reliabilities in the .90s
- Computerized group testing becoming more popular - helps predict future academic success quickly
- Most group tests are multiple choice – some free response - does NOT consider multiple intelligences
- Group tests outnumber individual tests Cognitive Abilities Test (COGAT)
o One major difference is whether the test is primarily verbal, - Good reliability
nonverbal, or combination - Provides three separate scores though: verbal, quantitative, and
- Group test scores can be converted to a variety of units nonverbal
Selecting Group Tests - Item selection is superior to the H-NT in terms of selecting minority,
- Test user need never settle for anything but well-documented and culturally diverse, and economically disadvantaged children
psychometrically sound tests - Can be adopted for use outside the US
Using Group Tests - No cultural bias
- Reliable and well standardized as the best individual tests - Each of the subtests required 32-34 minutes of actual working time,
- Validity data for some group tests are weak/meager/contradictory which the manual recommends to be spread out over 2-3 days
Use Results with Caution - Standard age scores averaged some 15pts lower for African
- Never consider scores in isolation or as absolutes American students on the verbal battery and quantitative batteries
- Be careful using tests for prediction
Summary of K-12 Group Tests - Definite overall decline in verbal scores while quantitative and
- All are sound, viable instruments analytical scores are gradually rising