0% found this document useful (0 votes)
35 views

Cohen Based Summary of Psychological Testing Assessment

Psychological testing and assessment involves gathering and integrating psychological data for evaluation purposes. The roots of testing can be traced back to early 20th century France and military use during World Wars 1 and 2. There are three main forms of assessment: collaborative, therapeutic, and dynamic psychological assessment. Tools of assessment include tests, which are measuring devices designed to evaluate variables like intelligence and personality, and interviews, which gather information through direct communication. Tests are evaluated based on their psychometric soundness and technical quality.

Uploaded by

20210023707
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

Cohen Based Summary of Psychological Testing Assessment

Psychological testing and assessment involves gathering and integrating psychological data for evaluation purposes. The roots of testing can be traced back to early 20th century France and military use during World Wars 1 and 2. There are three main forms of assessment: collaborative, therapeutic, and dynamic psychological assessment. Tools of assessment include tests, which are measuring devices designed to evaluate variables like intelligence and personality, and interviews, which gather information through direct communication. Tests are evaluated based on their psychometric soundness and technical quality.

Uploaded by

20210023707
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

lOMoARcPSD|8301055

Cohen-Based-Summary of Psychological Testing &


Assessment
Bachelor of Science in Psychology (University of San Jose-Recoletos)

StuDocu is not sponsored or endorsed by any college or university


Downloaded by Connie Soledad ([email protected])
lOMoARcPSD|8301055

CHAPTER 1: PSYCHOLOGICAL TESTING AND ASSESSMENT


CHAPTER 1: PSYCHOLOGICAL TESTING AND ASSESSMENT (a) usually administered to larger
groups
TESTING AND ASSESSMENT (b) test takers complete tasks
 Roots can be found in early twentieth century in France 1905
independently
 Alfred Binet published a test designed to help place Paris school
b) Scoring and interpretation procedures
children
 WW1, military used the test to screen large numbers of recruits (1) score: a code or summary statement,
quickly for intellectual and emotional problems usually (but not necessarily) numerical in
 WW2, military depend more on tests to screen recruits for service nature, that reflects an evaluation of
PSYCHOLOGICAL performance on a test, task, interview, or
PSYCHOLOGICAL TESTING
ASSESSMENT some other sample of behavior
Process of measuring
Gathering & integration of (2) scoring: process of assigning such
psychology-related
psychology-related data for evaluative codes/ statements to
variables by means of
DEFINITION the purpose of making a performance on tests, tasks, interviews, or
devices/procedures
psychological evaluation with
designed to obtain a other behavior samples.
accompany of tools.
sample of behavior (3) different types of score:
To answer a referral question, (a) cut score: reference point,
To obtain some gauge,
solve problem or arrive at a usually numerical, derived by
OBJECTIVE usually numerical in
decision thru the use of tools
nature judgement and used to divide
of evaluation
Testing may be a set of data into two or more
PROCESS Typically individualized classifications.
individualized or group
Key in the process of selecting Tester is not key into the (i) sometimes reached
ROLE OF
tests as well as in drawing process; may be without any formal
EVALUATOR
conclusions substituted method: in order to
SKILL OF Typically requires an educated Requires technician-like
“eyeball”, teachers
EVALUATIOR selection, skill in evaluation skills
who decide what is
Entail logical problem-solving
Typically yields a test passing and what is
OUTCOME approach to answer the
score failing.
referral ques.
(4) who scores it
3 FORMS OF ASSESSMENT: (a) self-scored by testtaker
1. COLLABORATIVE PSYCHOLOGICAL ASSESSMENT – assessor and (b) computer
assesse work as partners from initial contact through final feedback
(c) trained examiner
2. THERAPEUTIC PSYCHOLOGICAL ASSESSMENT – self-discovery and
new understandings are encouraged throughout the assessment c) psychometric soundness/ technical quality
process (1) psychometrics:the science of
3. DYNAMIC PSYCHOLOGICAL ASSESSMENT – follows a model (a) psychological measurement.
evaluation (b) intervention (a) evaluation. Provide a means for (a) referring to to how consistently
evaluating how the assesse processes or benefits from some type of and how accurately a
intervention during the course of evaluation. psychological test measures
what it purports to measure.
Tools of Psychological Assessment
(2) utility: refers to the usefulness or practical
A. The Test (a measuring device or procedure)
value that a test or other tool of
1. psychological test: a device or procedure designed to measure
assessment has for a particular purpose.
variables related to psychology (intelligence, personality,
B. The Interview: method of gathering information through direct
aptitude, interests, attitudes, or values)
communication involving reciprocal exchange
2. format: refers to the form, plan, structure, arrangement, and
1. interviewer in face-to-face is taking note of
layout of test items as well as to related considerations such as
a) verbal language
time limits.
b) nonverbal language
a) also referred to as the form in which a test is
(1) body language movements
administered (pen and paper, computer, etc)
(2) facial expressions in response to
Computers can generate scenarios.
interviewer
b) term is also used to denote the form or structure of
(3) the extent of eye contact
other evaluative tools, and processes, such as the
(4) apparent willingness to cooperate
guidelines for creating a portfolio work sample
c) how they are dressed
3. Ways That tests differ from one another:
(1) neat vs sloppy vs inappropriate
a) administrative procedures
2. interviewer over the phone taking note of
(1) some test administers have an active
a) changes in the interviewee’s voice pitch
knowledge
b) long pauses
(a) some test administration
c) signs of emotion in response
involves demonstration of
3. ways that interviews differ:
tasks
a) length, purpose, and nature
(b) usually one-on-one
b) in order to help make diagnostic, treatment,
(c) trained observation of
selection, etc
assessee’s performance
4. panel interview
(2) some test administers don’t even have to
be present

Downloaded by Connie Soledad ([email protected])


lOMoARcPSD|8301055

CHAPTER 1: PSYCHOLOGICAL TESTING AND ASSESSMENT


a) an interview conducted with one interviewee with (1) descriptive
more than one interviewer (2) screening
C. The Portfolio (3) consultive
1. files of work products: paper, canvas, film, video, audio, etc b) some contain relatively little interpretation and
2. samples of ones abilities and accomplishments simply call attention to certain high, low, or unusual
D. Case History Data: records, transcripts, and other accounts in written, scores that needed to be focused on.
pictorial or other form that preserve archival information, official and c) consultative report: A type of interpretive report
informal accounts, and other data and items relevant to assessee designed to provide expert and detailed analysis of
1. sheds light on an individual's past and current adjustment as test data that mimics the work of an expert
well as on events and circumstances that may have contributed consultant.
to any changes in adjustment d) integrative report: a form of interpretive report of
2. provides information about neuropsychological functioning prior psychological assessment, usually computer-
to the occurrence of a trauma or other event that results in a generated, in which data from behavioral, medical,
deficit. administrative, and/or other sources are integrated
3. insight into current academic and behavioral standing 7. CAPA: computer assisted psychological assessment. (assistance
4. useful in making judgments for future class placements to the test user not the test taker)
5. Case history Study: a report or illustrative account concerning a) enables test developers to create psychometrically
person or an event that was compiled on the basis of case sound tests using complex mathematical procedures
history data and calculations.
a) might shed light on how one individual’s personality b) enables test users the construction of tailor-made
and particular set of environmental conditions test with built-in scoring and interpretive capabilities.
combined to produce a successful world leader. c) Pros:
b) groupthink: work on a social psychological (1) test administrators have greater access to
phenomenon: contains rich case history material on potential test users because of the global
collective decision making that did not always result reach of the internet.
in the best decisions. (2) scoring and interpretation of test data
E. Behavioral Observation: monitoring the actions of others or oneself by tend to be quicker than for paper-and-
visual or electronic means while recording quantitative and/or qualitative pencil tests
information regarding those actions. (3) costs associated with internet testing tend
1. often used as a diagnostic aid in various settings: inpatient to be lower than costs associated with
facilities, behavioral research laboratories, classrooms. paper-and-pencil tests
2. naturalistic observation: behavioral observation that takes place (4) the internet facilitates the testing of
in a naturally occurring setting (as opposed to a research otherwise isolated populations, as well as
laboratory) for the purpose of evaluation and information- people with disabilities for whom getting
gathering. to a test center might prove as a hardship.
3. in practice tends to be used most frequently by researchers in (5) greener: conserves paper, shipping
settings such as classrooms, clinics, prisons, etc. materials etc.
F. Role- Play Tests d) Cons:
1. role play: acting an improvised or partially improvised part in a (1) test client integrity
simulated situation. (a) refers to the verification of the
2. role-play test: tool of assessment wherein assessees are identity of the test taker when
directed to act as if they were in a particular situation. Assessees a test is administered online
are then evaluated with regard to their expressed thoughts, (b) also refers to the sometimes
behaviors, abilities, etc varying interests of the test
G. Computers as tools taker vs that of the test
1. local processing: on site computerized scoring, interpretation, or administrator. The test taker
other conversion of raw test data; contrast w/ CP and might have access to notes,
teleprocessing aids, internet resources etc.
2. central processing: computerized scoring, interpretation, or (c) internet testing is only testing,
other conversion of raw data that is physically transported from not assessment
the same or other test sites; contrast w/ LP and teleprocessing. 8. CAT: computerized adaptive testing: an interactive, computer-
3. teleprocessing: computerized scoring, interpretation, or other administered test taking process wherein items presented to the
conversion of raw test data sent over telephone lines by modem test taker are based in part on the test taker's performance on
from a test site to a central location for computer processing. previous items
contrast with CP and LP a) EX: on a computerized test of academic abilities, the
4. simple score report: a type of scoring report that provides only a computer might be programmed to switch from
listing of scores testing math skills to English skills after three
5. extended scoring report: a type of scoring report that provides a consecutive failures on math items.
listing of scores AND statistical data. H. Other Tools
6. interpretive report: a formal or official computer-generated 1. DVD- how would you respond to the events that take place in
account of test performance presented in both numeric and the video
narrative form and including an explanation of the findings; a) sexual harassment in the workplace
a) the three varieties of interpretive report are b) respond to various types of emergencies

Downloaded by Connie Soledad ([email protected])


lOMoARcPSD|8301055

CHAPTER 1: PSYCHOLOGICAL TESTING AND ASSESSMENT


c) diagnosis/treatment plan for clients on videotape How are Assessments Conducted?
2. thermometers, biofeedback, etc  protocol: the form or sheet or booklet on which a testtaker’s
responses are entered.
TEST DEVELOPER o term might also be used to refer to a description of a set of
 They are the one who create tests. test- or assessment- related procedures, as in the sentence
 They conceive, prepare, and develop tests. They also find a way to
, “the examiner dutifully followed the complete protocol
disseminate their tests, by publishing them either commercially or
through professional publications such as books or periodicals. for the stress interview”
TEST USER  rapport: working relationship between the examiner and the
 They select or decide to take a specific test off the shelf and use it for examinee
some purpose. They may also participate in other roles, e.g., as
examiners or scorers. ASSESSEMENT OF PEOPLE WITH DISABILITITES
TEST TAKER  Define who requires alternate assessement, how such assessment are
 Anyone who is the subject of an assessment to be conducted and how meaningful inferences are to be drawn from
 Test taker may vary on a continuum with respect to numerous the data derived from such assessment
variables including:  Accommodation – adaptation of a test, procedure or situation or the
o The amount of anxiety they experience & the degree to substitution of one test for another to make the assessment more
which the test anxiety might affect the results suitable for an assesee with exceptional needs.
o The extent to which they understand & agree with the  Translate it into Braillee and administere in that form.
rationale of the assessment  Alternate assessment – evaluative or diagnostic procedure or process
o Their capacity & willingness to cooperate that varies from the usual, customary, or standardized way a
o Amount of physical pain/emotional distress they are measurement is derived either by virtue of some special
experiencing accommodation made to the assesee by means of alternative
o Amount of physical discomfort methods
o Extent to which they are alert & wide awake  Consider these four variables on which of many different types of
o Extent to which they are predisposed to agreeing or accommodation should be employed:
disagreeing when presented with stimulus o The capabilities of the assesse
o The extent to which they have received prior coaching o The purpose of the assessment
o May attribute to portraying themselves in a good light o The meaning attached to test scores
 Psychological autopsy – reconstruction of a deceased individual’s o The capabilities of the assessor
psychological profile on the basis of archival records, artifacts, & REFERENCE SOURCES
interviews previously conducted with the deceased assesee  TEST CATALOUGES – contains brief description of the test
TYPES OF SETTINGS  TEST MANUALS – detailed information
 EDUCATIONAL SETTING  REFERENCE VOLUMES – one stop shopping, provides detailed
o achievement test: evaluation of accomplishments or the information for each test listed, including test publisher, author,
degree of learning that has taken place, usually with regard purpose, intended test population and test administration time
 JOURNAL ARTICLES – contain reviews of the test
to an academic area.
 ONLINE DATABASES – most widely used bibliographic databases
o diagnosis: a description or conclusion reached on the basis
of evidence and opinion though a process of distinguishing TYPES OF TESTS
the nature of something and ruling out alternative  INDIVIDUAL TEST – those given to only one person at a time
conclusions.  GROUP TEST – administered to more than one person at a time by
o diagnostic test: a tool used to make a diagnosis, usually to single examiner
 ABILITY TESTS:
identify areas of deficit to be targeted for intervention
o ACHIEVEMENT TESTS – refers to previous learning (ex.
o informal evaluation: A typically non systematic, relatively
Spelling)
brief, and “off the record” assessment leading to the o APTITUDE/PROGNOSTIC – refers to the potential for
formation of an opinion or attitude, conducted by any learning or acquiring a specific skill
person in any way for any reason, in an unofficial context o INTELLIGENCE TESTS – refers to a person’s general
and not subject to the same ethics or standards as potential to solve problems
evaluation by a professiomal  PERSONALITY TESTS: refers to overt and covert dispositions
 CLINICAL SETTING o OBJECTIVE/STRUCTURED TESTS – usually self-report,
o these tools are used to help screen for or diagnose require the subject to choose between two or more
behavior problems alternative responses
o group testing is used primarily for screening: identifying o PROJECTIVE/UNSTRUCTURED TESTS – refers to all possible
those individuals who require further diagnostic uses, applications and underlying concepts of psychological
evaluation. and educational tests
 COUNSELING SETTING o INTEREST TESTS –
o schools,prisons, and governmental or privately owned
institutions
o ultimate objective: the improvement of the assessee in
terms of adjustment, productivity, or some related
variable.
 GERIATRIC SETTING
o quality of life: in psychological assesment, an evaluation of
variables such as perceived stress,lonliness, sources of
satisfaction, personal values, quality of living conditions,
and quality of friendships and other social support.

 BUSINESS AND MILITARY SETTINGS


 GOVERNMENTAL AND ORGANIZATIONAL CREDENTIALING

Downloaded by Connie Soledad ([email protected])


lOMoARcPSD|8301055

CHAPTER 2: HISTORICAL, CULTURAL AND LEGAL/ETHICAL CONSIDERATIONS


A HISTORICAL PERSPECTIVE testakers from young children through senior
19TH CENTURY adulthood.
 Tests and testing programs first came into being in China B. THE MEASUREMENT OF PERSONALITY
 Testing was instituted as a means of selecting who, of many applicants o Field of psychology was being too test oriented
would obtain government jobs (Civil service) o Clinical psychology was synonymous to mental testing
 The job applicants are tested on proficiency in endeavors such as o ROBERT WOODWORTH – develop a measure of
music, archery, knowledge and skill etc. adjustment and emotional stability that could be
GRECO-ROMAN WRITINGS (Middle Ages) administered quickly and efficiently to groups of recruits
 World of evilness  To disguise the true purpose of the test,
 Deficiency in some bodily fluid as a factor believed to influence questionnaire was labeled as Personal Data
personality Sheet
 Hippocrates and Galen  He called it Woodworth Psychoneurotic
RENAISSANCE Inventory – first widely used self-report test of
 Christian von Wolff – anticipated psychology as a science and personality
psychological measurement as a specialty within that science o Self-report test:
CHARLES DARWIN AND INDIVIDUAL DIFFERENCES  Advantages:
 Tests designed to measure these individual differences in ability and  Respondents best qualified
personality among people  Disadvantages:
 “Origin of Species” chance variation in species would be selected or  Poor insight into self
rejected by nature according to adaptivity and survival value. “survival  One might honestly believe
of the fittest” something about self that isn’t true
FRANCIS GALTON  Unwillingness to report seemingly
 Explore and quantify individual differences between people. negative qualities
 Classify people “according to their natural gifts” o Projective test: individual is assumed to project onto some
 Displayed the first anthropometric laboratory ambiguous stimulus (inkblot, photo, etc.) his or her own
KARL PEARSON unique needs, fears, hopes, and motivations
 Developed the product moment correlation technique.  Ex.) Rorschack inkblot
 His work can be traced directly from Galton o
WILHEM MAX WUNDT C. THE ACADEMIC AND APPLIED TRADITIONS
 First experimental psychology laboratory in University of Leipzig
 Focuses more on relating to how people were similar, not different Culture and Assessment
from each other.
JAMES MCKEEN CATELL Culture: ‘the socially transmitted behavior patterns, beliefs, and products of work
 Individual differences in reaction time f a particular population, community, or group of people’
 Coined the term mental test
CHARLES SPEARMAN Evolving Interest in Culture-Related Issues
 Originating the concept of test reliability as well as building the Goddard tested immigrants and found most to be feebleminded
mathematical framework for the statistical technique of factor -invalid; overestimated mental deficiency, even in native English-
analysis speakers
VICTOR HENRI Lead to nature-nurture debate about what intelligence tests actually measure
 Frenchman who collaborated with Binet on papers suggesting how Needed to “isolate” the cultural variable
mental tests could be used to measure higher mental processes Culture-specific tests: tests designed for use with ppl from one culture, but not
EMIL KRAEPELIN from another
 Early experimenter of word association technique as a formal test -minorities still scored abnormally low
LIGHTNER WITMER ex.) loaf of bread vs. tortillas
 “Little known founder of clinical psychology” today tests undergo many steps to ensure its suitable for said nation
 Founded the first psychological clinic in the U.S. -take testtakers reactions into account
PSYCHE CATELL
 Daughter of James Cattell Some Issues Regarding Culture and Assessment
 Cattel Infant Intelligence Scale (CIIS) & Measurement of Intelligence in  Verbal Communication
Infants and Young Children o Examiner and examinee must speak the same language
RAYMOND CATTELL o Especially tricky with infrequently used vocabulary or
 Believed in lexical approach to defining personality which examines unusual idioms employed
human languages for descriptors of personality dimensions o Translator may lose nuances of translation or give
20th CENTURY unintentional hints toward more desirable answer
- Birth of the first formal tests of intelligence o Also requires understanding of culture
- Testing shifted to be of more understandable relevance/meaning  Nonverbal Communication and Behavior
A. THE MEASUREMENT OF INTELLIGENCE o Different between cultures
o Binet created first intelligence to test to identify mentally
o Ex.) meaning of not making eye contact
retarded school children in Paris (individual)
o Body movement could even have physical cause
o Binet-Simon Test has been revised over again
o Psychoanalysis: Freud’s theory of personality and
o Group intelligence tests emerged with need to screen
psychological treatment which stated that symbolic
intellect of WWI recruits
significance is assigned to many nonverbal acts.
o David Wechsler – designed a test to measure adult
o Timing tests in cultures not obsessed with speed
intelligence test
o Lack of speaking could be reverence for elders
 for him Intelligence is a global capacity of the
 Standards of Evaluation
individual to act purposefully, to think rationally
o Acceptable roles for women differ throughout culture
and to deal effectively with his environment.
o “judgments as to who might be the best employee,
 Wechsler-Bellevue Intelligence Scale 
manager, or leader may differ as a function of culture, as
Wechsler Adult Intelligence Test – was revised
might judgments regarding intelligence, wisdom, courage,
several times and extended the age range of
and other psychological variables”

Downloaded by Connie Soledad ([email protected])


lOMoARcPSD|8301055

CHAPTER 2: HISTORICAL, CULTURAL AND LEGAL/ETHICAL CONSIDERATIONS


o must ask ‘how appropriate are the norms or other o don’t use deception if it will cause emotional distress
standards that will be used to make this evaluation’ o fully debrief participants
 The right to be informed of test findings
Tests and Group Membership o Formerly test administrators told to give participants only
 ex.) must be 5’4” to be police officer- excludes cultures with short positive information
stature o No realistic information is required
 ex.) Jewish lifestyle not well suited for corporate America o Tell test takers as little as possible about the nature of their
 affirmative action: voluntary and mandatory efforts to combat performance on a particular test. So that the examinee
discrimination and promote equal opportunity in education and would leave the test session feeling pleased and statisfied.
employment for all o Test takers have the right also to know what
 Psychology, tests, and public policy recommendations are being made as a consequence of the
test data
Legal and Ethical Condiseration  The right to privacy and confidentiality
Code of professional ethics: defines the standard of care expected of members of o Private right: “recognizes the freedom of the individual to
a given profession. pick and choose for himself the time, circumstances, and
particularly the extent to which he wishes to share or
The Concerns of the Public withhold from others his attitudes, beliefs, behaviors, and
 Beginning in world war I, fear that tests were only testing the ability to opinions”
take tests o Privileged information: information protected by law from
 Legislation being disclosed in legal proceeding. Protects clients from
o Minimum competency testing programs: formal testing disclosure in judicial proceedings. Privilege belongs to the
programs designed to be used in decisions regarding client not the psychologist.
various aspects of students’ educations o Confidentiality: concerns matters of communication
o Truth-in-testing legislation: state laws to provide testtakers outside the courtroom
with a means of learning the criteria by which they are  Safekeeping of test data: It is not a good policy
being judged to maintain all records in perpetuity
 Litigation  The right to the least stigmatizing label
o Daubert ruling made federal judges the gatekeepers to o The standards advise that the least stigmatizing labels
determining what expert testimony is admitted should always be assigned when reporting test results.
o This overrode the Frye policy which only admitted
scientific testimony that had won general acceptance in
the scientific community.

The Concerns of the Profession


 Test-user qualifications
o Who should be allowed to use psych tests
o Level A: tests or aids that can adequately be administered,
scored, and interpreted with the aid of the manual and a
general orientation to the kind of institution or
organization in which one is working
o Level B: tests or aids that require some technical
knowledge of test construction and use and of supporting
psychological and educational fields
o Level C: tests and aids requiring substantial understanding
of testing and supporting psych fields with experience
 Testing people with disabilities
o Difficulty in transforming the test into a form that can be
taken by testtaker
o Transferring responses to be scorable
o Meaningfully interpreting the test data
 Computerized test administration, scoring, and interpretation
o simple, convenient
o easily copied, duplicated
o insufficient research to compare it to pencil-and-paper
versions
o value of computer interpretation is questionable
o unprofessional, unregulated “psychological testing” online

The Rights of Testtakers


 the right of informed consent
o right to know why they are being evaluated, how test data
will be used and what information will be released to
whom
o may be obtained by parent or legal representative
o must be in written form:
 general purpose of the testing
 the specific reason it is being undertaken
 general type of instruments to be administered
o revealing this information before the test can contaminate
the results
o deception only used if absolutely necessary

Downloaded by Connie Soledad ([email protected])


lOMoARcPSD|8301055

CHAPTER 3: A STATISTICS REFRESHER


 No absolute zero point
Why We Need Statistics  Can take average
RATIO SCALE
- Statistics are important for purposes of education  In addition to all the properties of nominal, ordinal, and interval
o Numbers provide convenient summaries and allow us to measurement, ratio scale has true zero point
evaluate some observations relative to others  Equal intervals between numbers
- We use statistics to make inferences, which are logical deductions  Ex.) measuring amount of pressure hand can exert
about events that cannot be observed directly  True zero doesn’t mean someone will receive a score of 0, but means
o Detective work of gathering and displaying clues – that 0 has meaning
exploratory data analysis
o Then confirmatory data analysis NOTE:
- Descriptive statistics are methods used to provide a concise Permissible Operations
description of a collection of quantitative information - Level of measurement is important because it defines which
- Inferential statistics are methods used to make inferences from mathematical operations we can apply to numerical data
observations of a small group of people known as a sample to a larger - For nominal data, each observation can be placed in only one
group of individuals known as a population mutually exclusive category
- Ordinal measurements can be manipulated using arithmetic
SCALES OF MEASUREMENT - With interval data, one can apply any arithmetic operation to the
differences between scores
 MEASUREMENT – act of assigning numbers or symbols to o Cannot be used to make statements about ratios
characteristics of things according to rules. The rules serves as a
guideline for representing the magnitude. It always involves error. DESCRIBING DATA
 SCALE – set of numbers whose properties model empirical properties  Distribution: set of scores arrayed for recording or study
of the objects to which the numbers are assigned.  Raw Score: straightforward, unmodified accounting of performance,
 CONTINUOUS SCALE – interval/ratio. A scale used to measure usually numerical
continuous variable. Always involves error
 DISCRETE SCALE – nominal/ordinal used to measure a discrete Frequency Distributions
variable (ex. Female or male)  Frequency Distribution: All scores listed alongside the number of
 ERROR – collective influence of all of the factors on a test score. times each score occurred
 Grouped Frequency Distribution: test-score intervals (class intervals),
PROPERTIES OF SCALES replace the actual test scores
- Magnitude, equal intervals, and an absolute 0 o Highest and lowest class intervals= upper and lower limits
Magnitude of distribution
- The property of “moreness”  Histogram: graph with vertical lines drawn at the true limits of each
- A scale has the property of magnitude if we can say that a particular test score (or class interval) forming TOUCHING rectangles- midpoint
instance of the attribute represents more, less, or equal amounts of in center of bar
the given quantity than does another instance  Bar Graph: rectangles DON’T touch
Equal Intervals  Frequency Polygon: data illustrated with continuous line connecting
- A scale has the property of equal intervals if the difference between the points where test scores or class intervals meet frequencies
two points at any place on the scale has the same meaning as the  A single test score means more if one relates it to other test scores
difference between two other points that differ by the same number  A distribution of scores summarizes the scores for a group of
of scale units individuals
- A psychological test rarely has the property of equal intervals  Frequency distribution: displays scores on a variable or a measure to
- When a scale has the property of equal intervals, the relationship reflect how frequently each value was obtained
between the measured units and some outcome can be described by o One defines all the possible scores and determines how
a straight line or a linear equation in the form Y=a+bX many people obtained each of those scores
o Shows that an increase in equal units on a given scale  Income is an example of a variable that has a positive skew
reflects equal increases in the meaningful correlates of  Whenever you draw a frequency distribution or a frequency polygon,
units you must decide on the width of the class interval
Absolute 0  Class interval: for inches of rainfall is the unit on the horizontal axis
- An Absolute 0 is obtained when nothing of the property being
measured exists Measures of Central Tendency
- This is extremely difficult/impossible for many psychological qualities  Measure of central tendency: statistic that indicates the average or
midmost score between the extreme scores in a distribution.
NOMINAL SCALE  The Arithmetic Mean
 Simplest form of measurement o “X bar”
 Classification or categorization o sum of observations divided by number of observations
 Arithmetic operations can be performed with nominal data o Sigma (X/n)
 Ex.) Male or female o Used for interval or ratio data when distributions are
 Also includes test items relatively normal
o Ex.) yes/no responses  The Median
ORDINAL SCALE o The middle score
 Classifies in some kind of ranking order o Used for ordinal, interval, and ratio data
 Individuals compared to others and assigned a rank o Especially useful when few scores fall at extremes
 Imply nothing about how much greater one ranking is than another  The Mode
 Numbers/ranks do not indicate units of measure o Most frequently-occurring score
 No absolute zero point o Bimodal distribution- 2 scores both have highest
 Binet: believed that data derived from intelligence test are ordinal in frequency
nature o Only common with nominal data
INTERVAL SCALE Measures of Variability
 In addition to the features of nominal and ordinal scales, contain  Variability: indication of how scores in a distribution are scattered or
equal intervals between numbers dispersed

Downloaded by Connie Soledad ([email protected])


lOMoARcPSD|8301055

CHAPTER 3: A STATISTICS REFRESHER


 The Range  The difference between a particular raw score and the mean divided
o Difference between the highest and lowest scores by the standard deviation
o Quick but gross description of the spread of scores  Used to compare test scores with difference scales
 The interquartile and semi-interquartile range
o Distribution is split up by 3 quartiles, thus making 4 T-score
quarters each representing 25% of the scores  Standard score system composed of a scale that ranges from 5
o Q2= median standard deviations below the mean to 5 standard deviations above
o Interquartile range measure of variability equal to the the mean
difference between Q3 and Q1  No negatives
o Semi-interquartile range interquartile range divided by 2
 Quartiles and Deciles Other Standard Scores
o Quartiles are points that divide the frequency distribution  SAT
into equal fourths  GRE
o First quartile is the 25th percentile; second quartile is the  Linear transformation: when a standard score retains a direct
median, or 50th percentile; third quartile is the 75th numerical relationship to the original raw score
percentile  Nonlinear transformation: required when data are not normally
o The interquartile range is bounded by the range of scores distributed, yet comparisons with normal distributions need to be
that represents the middle 50% of the distribution made
o Deciles are similar but use points that mark 10% rather o Normalized Standard Scores
than 25% intervals  When scores don’t fall on normal distribution
o Stanine system: converts any set of scores into a  “normalizing a distribution involves ‘stretching’
he skewed curve into the shape of a normal
transformed scale, which ranges from 1 to 9
curve and creating a corresponding scale of
 The average deviation
o X-mean=x standard scores, a scale called a normalized
standard score scale”
o Average deviation= (sum of all deviation scores)/ total
number of scores
o Tells us on average how far scores are from the mean
 The Standard Deviation
o Similar to average deviation
o But in order to overcome the (+/-) problem, each deviation
is squared
o Standard deviation: a measure of variability equal to the
square root of the average squared deviations about the
mean
o Is square root of variance
o Variance: the mean of the squares of the difference b/w
the scores in a distribution and their mean
 Found by squaring and summing all the
deviation scores and then dividing by the total
number of scores
o s = sample standard deviation
o sigma = population standard deviation
Skewness
 skewness: nature and extent to which symmetry is absent
 POSITIVE SKEW Ex.) test was too hard
 NEGATIVELY SKEWED ex.) test was too easy
 can be gauges by examining relative distances of quartiles from the
median
Kurtosis
 steepness of distribution
 platykurtic: relatively flat
 leptokurtic: relatively peaked
 mesokurtic: somewhere in the middle

The Normal Curve


Normal curve: bell-shaped, smooth, mathematically defined curve, highest at
center; both sides taper as it approaches the x-axis asymptotically
-symmetrical, and thus have mean, median, mode, is same

Area under the Normal Curve


Tails and body

Standard Scores
Standard Score: raw score that has been converted from one scale to another
scale, where the latter has arbitrarily set mean and standard deviation
-used for comparison

Z-score
 conversion of a raw score into a number indicating how many
standard deviation units the raw score is below or above the mean of
the distribution.

Downloaded by Connie Soledad ([email protected])


lOMoARcPSD|8301055

CHAPTER 4: OF TESTS AND TESTING


 Tasks on some tests mimic the actual behaviors that
Some Assumptions About Psychological Testing and Assessment the test user is attempting to understand
- Assumption 1: Psychological Traits and States Exist o Obtained behavior is usually used to predict future behavior
o Trait: any distinguishable, relatively enduring way in which one o Could also be used to postdict behavior to aid in the
individual varies from another understanding of behavior that has already taken place
o States: distinguish one person from another but are relatively o Tools of assessment, such as a diary, or case history data, might
less enduring be of great value in such an evaluation
 Trait term that an observer applies, as well as - Assumption 4: Tests and Other Measurement Techniques Have Strengths
strength or magnitude of the trait presumed present and Weaknesses
 based on observing a sample of behavior o Competent test users understand a lot about the tests they use
o Trait and state definitions also refer to individual variation  How it was developed
make comparisons with respect to the hypothetical average  Circumstances under which it is appropriate to
person administer the test
o Samples of behavior:  How test should be administered and to whom
 Direct observation  How results should be interpreted
 Analysis of self-report statements o Understand and appreciation limitations for tests they use
 Paper-and-pencil test answers - Assumption 5: Various Sources of Error Are Part of the Assessment Process
o Psychological trait  covers wide range of possible o Everyday error= misstates and miscalculations
characteristics; ex: o Assessment error= a long-standing assumption that factors
 Intelligence other than what a test attempts to measure will influence
 Specific intellectual abilities performance on a test
 Cognitive style o Error variance: component of a test score attributable to
 Psychopathology sources other than the trait or ability measured
o Controversy regarding how psychological tests exist  Assessees themselves are sources of error variance
 Psychological tests exist only as constructs: an o Classical test theory (CTT)/ True score theory: assumption is
informed, scientific concept developed or made that each testtaker has a true score on a test that would
constructed to describe or explain a behavior be obtained but for the action of measurement error
 Cant see, hear or touch infer existence - Assumption 6: Testing and Assessment Can Be Conducted in a Fair and
from overt behavior: refers to an Unbiased Manner
observable action or the product of an o Court challenged to various tests and testing programs have
observable action, including test- or sensitized test developers and users to the societal demand for
assessment-related responses fair tests used in a fair manner
o Traits not expected to be manifested in behavior 100% of the  Publishers strive to develop instruments that are fair
time when used in strict accordance with guidelines in the
 Seems to be rank-order stability in personality test manual
traits relatively high correlations between trait o Fairness related problems/questions:
scores at different time points  Culture is different from people whom the test was
o Whether and to what degree a trait manifests itself is intended for
dependent on the strength and nature of the situation  Politics
- Assumption 2: Psychological Traits and States Can Be Quantified and - Assumption 7: Testing and Assessment Benefit Society
Measured o Many critical decisions are based on testing and assessment
o After acknowledged that psychological traits and states do exist, procedures
the specific traits and states to be measured need to be defined
 What types of behaviors are assumed to be indicative WHAT’S A “GOOD TEST”?
of trait? - Criteria
 Test developer has to provide test users with a clear o Clear instruction for administration, scoring, and interpretation
operational definition of the construct under study - Reliability
o After being defined, test developer considers types of item o A “good test”/measuring tool reliable
content that would provide insight into it  Involves consistency: the prevision with which the
 Ex: behaviors that are indicative of a particular trait test measures and the extent to which error is
o Should all questions be weighted the same? present in measurements
 Weighting the comparative value of a test’s items  Unreliable measurement needs to be avoided
comes about as the result of a complex interplay - Validity
among many factors: o Test is considered valid if it doesn’t indeed measure what it
 Technical considerations purports to measure
 The way a construct has been defined (for o If there is controversy over the definition of a construct then the
particular test) validity is sure to be criticized as well
 Value society (and test developer) attach o Questions regarding validity focus on the items that collectively
to behaviors evaluated make up the test
o Need to find appropriate ways to score the test and interpret  Adequately sample range of areas to measure
results construct
 Cumulative scoring: test score is presumed to  Individual items contribute to or take away from
represent the strength of the targeted ability or trait test’s validity
or state o Validity may also be questioned on grounds related to the
 The more the testtaker responds in a interpretation of test results
particular direction (as keyed by test - Other Considerations
manual) the higher the testtaker is o “Good test” one that trained examiners can administer, score
presumed to possess the targeted trait or and interpret with minimum difficulty
ability  Useful
- Assumption 3: Test-Related Behavior Predicts Non-Test-Related Behavior  Yields actionable results that will ultimately benefit
o Objective of test is to provide some indication of some aspects individual testtakers or society at large
of the examinee’s behavior

Downloaded by Connie Soledad ([email protected])


lOMoARcPSD|8301055

CHAPTER 4: OF TESTS AND TESTING


o Purpose of test compare performance of testtaker with o STANDARD ERROR OF ESTIMATE – In regression, an
performance of other testtakers (contains adequate norms: estimate of the degree of error involved in predicting
normative data) the value of one variable from another
 Normative data provides standard with which o STANDARD ERROR OF THE MEAN – a measure of
results measured can be compared sampling error
NORMS o STANDARD ERROR OF THE DIFFERENCE – estimate
- Norm-referenced testing and assessment: method of evaluation how large a difference between two scores should be
and a way of deriving meaning from test scored by evaluating an before the difference is considered statistically
individual testtaker’s score and comparing it to scores of a group significant
of testtakers - Developing norms for a standardized test
- Meaning of individual score is relative to other scores on the same o Establish a standard set of instructions and conditions
test under which the test is given makes scores of
- Norms (scholarly context): usual, average, normal, standard, normative sample more comparable with scores of
expected or typical future testtakers
- Norms (psychometric context): the test performance data of a o All data collected and analyzed, test developer will
particular group of testtakers that are designed for use as a summarize data using descriptive statistics (measures
reference when evaluating or interpreting individual test scores of central tendency and variability)
- Normative sample: group of people whose performance on a  Test developer needs to provide precise
particular test is analyzed for reference in evaluation the description of standardization sample itself
performance of individual testtakers  Descriptions of normative samples vary
o Yields a distribution of scores widely in detail
- Norming: refers to the process of deriving norms; particular type Tracking
of norm derivation - Comparisons are usually with people of the same age
o Race norming: controversial practice of norming on - Children at the same age level tend to go through different growth
the basis of race or ethnic background patterns
- Norming a test can be very expensive user norms/program - Pediatricians must know the child’s percentile within a given age
norms: consist of descriptive statistics based on a group of group
testtakers in a given period of time rather than norms obtained by - This tendency to stay at about the same level relative to one’s
form sampling methods peers is known as tracking (ie height and weight)
- Sampling to Develop Norms - Diets may alter this “track”
- Standardization: process of administering a test to a - Faults: some believe there is an analogy between the rates of
representative sample of testtakers for the purpose of establishing
physical growth and the rates of intellectual growth
norms
o Some say that children learn at different rates
o Standardized when has clear, specified procedures
o This system discriminates against some children
- Sampling
o Developer targets defined group as population test
TYPES OF NORMS
designed for
o Classification of norms ex: age, grade, national, local,
 All have at least one common, observable
percentile, etc.
characteristic
o PERCENTILES
o To obtain distribution of scores:
 Median= 2nd quartile: the point at or below which
 Test administered to everyone in targeted
50% of the scores fell and above which the
population
remaining 50% fell
 Administer test to a sample of the
 Might wish to divide distribution of scores into
population
deciles (instead of quartiles): 10 equal parts
 Sample: portion of universe of
 The Xth percentile is equal to the score at or
people deemed to be
below which X% of scores fall
representative of whole
 Percentile: an expression of the percentage of
population
people whose score on a test or measure falls
 Sampling: process of selecting
below a particular raw score
the portion of universe deemed
 Percentage correct: refers to the
to be representative of whole
distribution of raw scores (number of
o Subgroups within a defined population may differ with
items that were answered correctly)
respect to some characteristics and it is sometimes
multiplied by 100 and divided by the
essential to have these differences proportionately
total number of items *not same as
represented in sample
percentile
 Stratified sampling: sample reflects
 Percentile is a converted score that
statistics of whole population; helps prevent
refers to a percentage of testtakers
sampling bias and ultimately aid in
 Percentiles are easily calculated popular way of
interpretation of findings
organizing test related data
 Purposive sampling: arbitrarily select
 Using percentiles with normal distribution real
sample we believe to be representative of
differences between raw scores may be
population
minimized near the ends of the distribution and
 Incidental/convenience sampling: sample
exaggerated in the middle (worsens with highly
that is convenient or available for use
skewed data)
 Very exclusive (contain
o AGE NORMS
exclusionary criteria)
 Age-equivalent scores/age norms: indicate the
- TYPES OF STANDARD ERROR:
average performance of different samples of
o STANDARD ERROR OF MEASUREMENT – estimate the
testtakers who were at various ages at the time
extent to which an observed score deviates from a true
the test was administered
score
 Age norm tables for physical
characteristics

Downloaded by Connie Soledad ([email protected])


lOMoARcPSD|8301055

CHAPTER 4: OF TESTS AND TESTING


 “Mental” age vs. physical age (need to - Criterion-referenced testing and assessment: method of
identify mental age) evaluation and way of deriving meaning from test scores by
o GRADE NORMS evaluating an individual’s score with reference to a set standard
 Grade norms: designed to indicate the average (ex: to drive must past driving test)
test performance of testtakers in a given school o Derives from values and standards of an individual or
grade organization
 Developed by administering the test o Also called Domain/content-referenced testing and
to representative samples of children assessment
over a range of consecutive grades o Critique: if followed strictly, important info about
 Mean or median score for children at individual’s performance relative to others can be
each grade level is calculated potentially lost
 Great intuitive appeal Culture and Inference
 Do not provide info as to the content - Culture is a factor in test administration, scoring and
or type of items that a student could interpretation
or could not answer correctly - Test user should do research in advance on test’s available norms
 Developmental norms: (ex: grade norms and age to check how appropriate it is for targeted testtaker population
norms) term applied broadly to norms developed o Helpful to know about the culture of the testtaker
on the basis of any trait, ability, skill, or other
characteristic that is presumed to develop, CORRELATION AND INFERENCE
deteriorate, or otherwise be affected by
chronological age, school grade, or stage of life
o NATIONAL NORMS CORRELATION
 National norms: derived from a normative  Degree and direction of correspondence between two things.
sample that was nationally representative of the  Correlation coefficient (r) – expresses a linear relationship
population at the time the norming study was between two continuous variables
conducted o Numerical index that tells us the extent to which X and
o NATIONAL ANCHOR NORMS Y are “co-related”
 Many different tests purporting to measure the  Positive correlation: high scores on Y are associated with high
same human characteristics or abilities scores on X, and low scores on Y correspond to low scores on X
 National anchor norms: equivalency tables for  Negative correlation: higher scores on Y are associated with lower
scores on tests that purpose to measure the scores on X, and vise versa
same thing  No correlation: the variables are not related
 Could provide the tool for  -1 to 1
comparisons  Correlation does not imply causation.
 Provides stability to test scores by o Ie weight, height, intelligence
anchoring them to other test scores
 Begins with the computation of PEARSON r
percentile norms for each test to be  Pearson Product Moment Correlation Coefficient
compared  Devised by Karl Pearson
 Equipercentile method: equivalency  Relationship of two variables are linear and continuous
of scores on different tests is  Coefficient of Determination (r2) – indication of how much
calculated with reference to variance is shared by the X and the Y variables
corresponding percentile scores SPEARMAN RHO
o SUBGROUP NORMS  Rank order correlation coefficient
 Normative sample can be segmented by an  Developed by Charles Spearman
criteria initially used in selecting subjects for  Used when the sample size is small and when both sets of
sample measurements are in ordinal form (ranking form)
 Subgroup norms: result of segmentation; more BISERIAL CORRELATION
narrowly defined  expresses the relationship between a continuous variable and an
o LOCAL NORMS artificial dichotomous variable
 Local norms: provide normative info with respect o If the dichotomous variable had been true then we
to the local population’s performance on some would use the point biserial correlation
test o When both variables are dichotomous and at least one
 Typically developed by test users of the dichotomies is true, then the association
themselves between them can be estimated using the phi
- Fixed Reference Group Scoring Systems coefficient
o Norms provide context for interpreting meaning of a test o If both dichotomous variables are artificial, we might
score use a special correlation coefficient – tetrachoric
o Fixed reference group scoring system: distribution of scored correlation
obtained on the test from one group of testtakers (fixed
reference group) is used as the basis for the calculation of REGRESSION
test scores for future administrators on the test  analysis of relationships among variables for the purpose of
 Ex: SAT test (developed in 1962) understanding how one variable may predict another
NORM-REFERENCED VERSUS CRITERION-REFERENCED EVALUATION  SIMPLE REGRESSION: one IV (X) and one DV (Y)
- Way to derive meaning from test score is to evaluate test score in - Regression line: defined as the best-fitting straight line through a
relation to other scores on same test (Norm-referenced) set of points in a scatter diagram
- Criterion-referenced: derive meaning from a test score by o Found by using the principle of least squares, which
evaluating it on the basis of whether or not some criterion has minimizes the squared deviation around the regression
been met line
o Criterion: a standard on which a judgment or decision  Primary use: To predict one score or variable from another
may be based

Downloaded by Connie Soledad ([email protected])


lOMoARcPSD|8301055

CHAPTER 4: OF TESTS AND TESTING


 Standard error of estimate: the higher the correlation between X - Correlation coefficient squared is known as the coefficient of
and Y, the greater the accuracy of the prediction and the smaller determination
the SEE. - Tells us the proportion of the total variation in scores on Y that we
 MULTIPLE REGRESSION: The use of more than one score to know as a function of information about X
predict Y. Coefficient of Alienation
 Regression coefficient: (b) slope of the regression line - Coefficient of alienation is a measure of nonassociation between
o Sum of squares for the covariance to the sum of two variables
squares for X - Square root of 1-r2 –-- r is the coefficient of determination
o Sum of squares is defined as the sum of the squared - High value means there is a high degree of nonassociation
deviations around the mean between 2 variables
o Covariance is used to express how much two measures Shrinkage
covary, or vary together - Tendency to overestimate the relationship, particularly if the
 Slope describes how much change is expected in Y each time X sample of subjects is small
increases by one unit - Shrinkage is the amount of decrease observed when a regression
 Intercept (a) is the value of Y when X is 0 equation is created for one population and then applied to
o The point at which the regression line crosses the Y axis another
THE BEST-FITTING LINE Cross Validation
 The difference between the observed and predicted score (Y-Y’) is - Use regression equation to predict performance in a group of
called the residual subjects other than the ones to which the equation was applied
 The best-fitting line is most appropriately found by squaring each - Standard error of estimate obtained for relationship between the
residual values predicted by the equation and the values actually observed
 Best-fitting line is obtained by keeping these squared residuals as – called cross validation
small as possible The Correlation-Causation Problem
o Principle of least squares: - Experiments are required to determine whether manipulation of
 Correlation is a special case of regression in which the scores for one variable causes changes in another variable
both variables are in standardized, or Z, units - A correlation alone does not prove causality, although it might
 In correlation, the intercept is always 0
lead to other research that is designed to establish the causal
 Pearson product moment correlation coefficient is a ratio used to
relationships between variables
determine the degree of variation in one variable that can be
Third Variable Explanation
estimated from knowledge about variation in the other variable
- Third variable, ie poor social adjustment, causes TV viewing and
Testing the Statistical Significance of a Correlation Coefficient
aggression
- Begin with the null hypothesis that there is no relationship
- External influence is the third variable
between variables
Restricted Range
- Null hypothesis rejected is there is evidence that the association
- Correlation and regression use variability on one variable to
between two variables is significantly different from 0
explain variability on a second variable
- t distribution is not a single distribution, but a family of
- Restricted range problem: correlation requires variability; if the
distributions, each with its own degrees of freedom
variability is restricted, then significant correlations are difficult to
- Degrees of freedom are defined as the sample size minus 2, or N-2
find
- Two-tailed test
Mulvariate Analysis
- Multivariate analysis considers the relationship among
How to Interpret a Regression Plot combinations of three of more variables
- Regression plots are pictures that show the relationship between
General Approach
variables - Linear combination of variables is a weighted composite of the
- Common use of correlation is to determine the criterion validity original variables
evidence for a test, or the relationship between a test score and - Y’ = a+b1X1 + … bkXk
some well-defined criterion
- Middle level of enjoyableness because it is the one observed most
frequently – normative because it uses info gained from
representative groups
- Using the test as a predictor is not as good as perfect prediction,
but it is still better than using the normative info
- A regression line such as in 3.9 shows that the test score tells us
nothing about the criterion beyond the normative info

TERMS AND ISSUES IN THE USE OF CORRELATION


Residual
- Difference between the predicted and the observed values is
called the residual
o Y-Y’
- Important property of residual is that the sum of the residuals
always equals 0
- Sum of the squared residuals is the smallest value according to the
principle of least squares
Standard Error of Estimate
- Standard deviation of the residuals is the standard error of
estimate
- A measure of the accuracy of prediction
- Prediction is most accurate when the standard error of estimate is
relatively small
Coefficient of Determination

Downloaded by Connie Soledad ([email protected])


lOMoARcPSD|8301055

CHAPTER 5: RELIABILITY
RELIABILITY  Does not affect score consistency
- Dependability and consistent
- Error implies that there will always be some inaccuracy in our SOURCES OF ERROR VARIANCE
measurements - TEST CONSTUCTION
- Tests that are relatively free of measurement error are deemed to be o Item sampling or content sampling – refer to variation
reliable among items within a test as well as to variation among
- Reliability estimates in the range of .70 and .80 are good enough for items between test\
most purposes in basic research  The extent to which a test takers score is
affected by the content sampled on a test and
- Reliability coefficient: an index that indicates the ratio between the by the way the content is sampled (that is, the
true score variance on a test and the total variance way in which the item is constructed) is a
- HISTORY OF RELIABILITY: source of error variance
o Charles Spearman (1904): The Proof and Measurement of - TEST ADMINISTRATION
Association between Two Things o may influence the test takers attention or motivation
o Then Thorndike
o Item response theory has taken advantage of computer o Environment variables, test taker’s variables, examiner
technology to advance psychological measurement variables. Level of professionalism
significantly - TEST SCORING AND INTERPRETATION
o Based on Spearman’s ideas o Computer scoring and a growing reliance on objective,
- X = T + E  CLASSICAL TEST THEORY computer-scorable items have virtually eliminated error
o assumes that each person has a true score that would be variance caused by scorer differences
o However, other tools of assessment still require scoring by
obtained if there were no errors in measurement
trained personnel
o Difference between the true score and the observed score
o If subjectivity is involved in scoring, then the scorer can be
results from measurement error
a source of error variance
o Despite rigorous scoring criteria set forth in many of the
o Assumption here is that errors of measurement are
better known test of intelligence, examiner occasionally
random
still are confronted by situations where an examinees
o Basic sampling theory tells us that the distribution of
response lies in a gray area
random errors is bell-shaped
 The center of the distribution should represent
TEST-RETEST RELIABILITY
the true score, and the dispersion around the
- Also known as time-sampling reliability
mean of the distribution should display the
- Correlating pairs of scores from the same group on two different
distribution of sampling errors
o Classical test theory assumes that the true score for an administration of the same test
individual will not change with repeated applications of the - Measure something that is relatively stable over time
same test - Sources of Error variance:
o o Passage of time: the longer the time that passes, the
o Variance: standard deviation squared. It is useful because greater the likelihood that reliability coefficient will be
it can be broken into components: lower.
o True variance: variance from true differences  are o Coefficient of stability: when the interval between testing
assumed to be stable is greater than 6 months,
o Error variance: random irrelevant sources - Consider possibility of carryover effect: occurs when first testing
- Standard error of measurement: we assume that the distribution of session influences scores from the second session
random errors will be the same for all people, classical test theory - If something affects all the test takers equally, then the results are
uses the standard deviation of errors as the basic measure of error uniformly affected and no net errors occurs
o Standard error of measurement tells us, on the average, - Practice tests may make this effect happen
how much a score varies from the true score - Practice can also affect tests of manual dexterity
o Standard deviation of the observed score and the reliability - Time interval between testing sessions must be selected and
of the test are used to estimate the standard error of evaluated carefully
measurement - Poor test-retest correlations do not always mean that a attest is
- Reliability: proportion of the total variance attributed to true unreliable – suggest that the characteristic under study has changed
variance.
o the greater portion of total variance attributed to true PARALLEL-FORM OR ALTERNATE FORMS RELIABILITY
variance, the more reliable the test - compares two equivalent forms of a test that measure the same
- Measurement error: refers to collectively, all of the factors associated attribute
with the process of measuring some variable, other than the variable - Two forms should be equally constructed, both format, etc.
being measured - When two forms of the test are available, one can compare
performance on one form versus the other – equivalent forms
o Random error: a source of error in measuring a targeted reliability or parallel forms
variable caused by unpredictable fluctuations and - Coefficient of equivalence: degree of relationship between various
inconsistencies of other variables in the measurement forms of a test can be evaluated by means of an alternate-forms
process - Parallel forms: each form of the test, the means and variances of
 This source of error fluctuates from one testing observed test scores are equal
situation to another with no discernible pattern - Alternate forms: different versions of a test that have been
that would systematically raise or lower scores constructed so as to be parallel
o Systematic Error: - (1) two test administrations with the same group are required
 A source of error in measuring a variable that is - (2) test scores may be affected by factors such as motivation etc.
typically constant or proportionate to what is - Problem: developing a new version of a test
presumed to be true value of the variable being INTERNAL CONSISTENCY
measured - How well does each item measure the content/construct under
 Error is predictable and fixable consideration

Downloaded by Connie Soledad ([email protected])


lOMoARcPSD|8301055

CHAPTER 5: RELIABILITY
- How consistent the items together o Heterogeneity: degree to which a test measures
- Used when tests are administered once different factors
- If all items on a test measure the same construct, then it has a o Ex: homo=test that assesses knowledge only of #-D
good internal consistency television repair skills vs. a general electronics repair
- Split-half reliability, KR20, Cronbach Alpha test (hetero)
o The more homogenous a test is, the more inter-item
SPLIT-HALF RELIABILITY consistency it can be expected to have
- Correlating two pairs of scores obtained from equivalent halves of o Test homogeneity is desirable because it allows
a single test administered once. relatively straightforward test-score interpretation
- This is useful when it is impractical to assess reliability with two o Test takers with the same score on a homogenous test
tests or to administer test twice probably have similar abilities in the area tested
- Results of one half of the test are then compared with the results o Test takers with the same score on a heterogeneous
of the other test may have quite different abilities
o However, homogenous testing is often an insufficient
- Rules in splitting forms into half: tool for measuring multifaceted psychological variable
o Do not divide test in the middle because it would lower such as intelligence or personality
the reliability
o Different amounts of anxiety and differences in item Measures of Inter-Scorer Reliability
difficulty shall also be considered - In some types of tests under some conditions, the score may be more a
o Randomly assign items to one or the other half of the function of the scorer than of anything else
test - Inter-scorer reliability: is the degree of agreement or consistency
o use the odd-even system: where one subscore is between two or more scorers (or judges or rather) with regard to a
obtained for the odd-numbered items in the test and particular measure
another for the even-numbered items - Coefficient of inter-scorer reliability: coefficient of correlation to
- To correct for half-length, apply the Spearman-Brown formula, determine the degree of consistency among scorers in the scoring of a
which allows you to estimate what the correlation between the test
two halves would have been if each half had been the length of - Kappa statistic is the best method for assessing the level of agreement
the whole test among several observers
o Indicates the actual agreement as a proportion of the
o Use this if test user wish to shorten a test
potential agreement following the correction for chance
o Used to determine the number of items needed to
agreement
attain a desired level of reliability o Cohen’s Kappa – 2 raters
o Fleiss’ Kappa – 3 or more raters
- Reliability increases as the test length increases
HOMOGENEITY VS. HETEROGENEITY OF TEST ITEMS
KUDER-RICHARDSON FORMULAS OR KR20/KR21
- Homogeneous items has high degree of reliability
- Kuder-Richardson technique simultaneously considers all possible
ways of splitting the items
DYNAMIC VS. STATIC CHARACTERISTICS
- The formula for calculating the reliability of a test in which the
- Dynamic: trait, state, ability presumed to be ever-changing as a function
items are dichotomous, scored 0 or 1, is the Kuder-Richardson 20
of situational and cognitive experiences
(see p.114)
- Static: trait, state, ability relatively unchanging
- Introduced KR21 – uses an approximation of the sum of the pq
products – the mean test score
RESTRICTION OR INFLATION OF RANGE
- If it is restricted, reliability tends to be lower.
CRONBACH ALPHA
- If it is inflated, reliability tends to be higher.
- Cronbach developed a formula that estimates the internal
consistency of tests in which the items are not scored as 0 or 1 – a
SPEED TESTS VS. POWER TESTS
more general reliability estimate, which he called coefficient alpha
- Speed test: test is homogenous, means that it is easy but short time
- Sum the individual item variances
- Power test: Few items, but more complex.
o Most general method of finding estimates of reliability
through internal consistency
CRITERION-REFERENCED TESTS
- Domain sampling: define a domain that represents a single trait
- Provide an indication of where a testtaker stands with respect to some
or characteristic, and each item is an individual sample of this
variable or criterion.
general characteristic
- Tends to contain material that has been mastered in hierarchical
- Factor analysis deals with the situation in which a test apparently
fashion.
measures several different characteristics
- Scores here tend to be interpreted in pass-fail terms.
o Good for the process of test construction
- Measure of reliability depends on the variability of the test scores: how
- Most widely used as a measure of reliability because it requires
different the scores are from one another.
only one administration of the test
- Ranges from 0 to 1 “bigger is always better”
The Domain Sampling Model
Other Methods of Estimating Internal Consistencies
- This model considers the problems created by using a limited
- Inter-item consistency: refers to the degree of correlation among
number of items to represent a larger and more complicated
all the items on a scale
construct
o A measure of inter-item consistency is calculated from
- Our task in reliability analysis is to estimate how much error we
a single administration of a single form of a test
would make by using the score from the shorter test as an
o An index of inter-item consistency, in turn, is useful in
estimate of your true ability
assessing the homogeneity of the test
- Conceptualizes reliability as the ratio of the variance of the
o Tests are said to be homogenous if they contain items
observed score on the shorter test and the variance of the long-
that measure a single trait
run true score
o Definition: the degree to which a test measures a
- Reliability can be estimated from the correlation of the observed
single factor
test score with the true score

Downloaded by Connie Soledad ([email protected])


lOMoARcPSD|8301055

CHAPTER 5: RELIABILITY

Item Response Theory


- Classical test theory requires that exactly the same test items be
administered to each person – BAD
- Item response theory (IRT) is newer – computer is used to focus
on the range of item difficulty that helps assess an individual’s
ability level
o More reliable estimate of ability is obtained using a
shorter test with fewer items
o Takes a lot of items and effort

Generalizability theory
- based on the idea that a persons test scores vary from testing to testing
because of variables in the testing situation
- Instead of conceiving of all variability in a persons scores as error,
Cronbach encouraged test developers and researchers to describe the
details of the particular test situation or universe leading to a specific
test score
- This universe is described in terms of its facets: which include things like
the number of items in the test, the amount of training the test scorers
have had, and the purpose of the test administration
- According to generalizability theory, given the exact same conditions of
all the facets in the universe, the exact same test score should be
obtained
- Universe score: the test score obtained and its analogous to a true
score in the true score model
- Cronbach suggested that tests be developed with the aid of a
generalizability study followed by a decision study
- Generalizability study: examines how generalizable scores from a
particular test are if the test is administered in different situations
- How much of an impact different facets of the universe have on the test
score
- Ex: is the test score affected by group as opposed to individual
administration
- Coefficients of generalizability: the influence of particular facts on the
test score is represented by this. These coefficients are similar to
reliability coefficients in the true score model
- Decision study: developers examine the usefulness of test scores in
helping the test user make decision
- The decision study is designed to tell the test user how test scores
should be used and how dependable those scores are as a basis for
decisions, depending on the context of their use

What to Do About Low Reliability


- Two common approaches are to increase the length of the test
and to throw out items that run down the reliability
- Another procedure is to estimate what the true correlation would
have been if the test did not have measurement error
Increase the Number of Items
- The larger the sample, the more likely that the test will represent
the true characteristic
o This could entail a long and costly process however
- Prophecy formula
Factor and Item Analysis
- Reliability of a test depends on the extent to which all of the items
measure one common characteristic
- Factor analysis
o Tests are most reliable if they are unidimensional: one
factor should account for considerably more of the
variance than any other factor
- Or examine the correlation between each item and the total score
for the test
o Called discriminability analysis: when the correlation
between the performance on a single item and the
total test score is low, the item is probably measuring
something different from the other items on the test

Correction for Attenuation


- Potential correlations are attenuated, or diminished, by
measurement error

Downloaded by Connie Soledad ([email protected])


lOMoARcPSD|8301055

CHAPTER 6: VALIDITY
The Concept of Validity  Content validity ratio (CVR):
- Validity: as applied to a test, is a judgment or estimate of how well a  CVR= ne – (N/2)
test measures what it purports to measure in a particular context (N/2)
o Judgment based on evidence about the appropriateness of o CVR Content validity
inferences drawn from test scores ratio
o Validity of test must be shown from time to time to account o ne  Number of panelists
for culture and advancement stating “essential”
- Inference: a logical result or deduction o N Total number of
- “Acceptable” or “weak” validity of tests and test scores panelists
- Validation: process of gathering and evaluating evidence about validity  CVR is calculated for each item
o Test user and testtaker both have roles in validation of test o Culture and the relativity of content validity
o Test users may conduct their own validation studies: may  Tests thought of as either valid or invalid
yield insights regarding a particular population of testtakers  What constitutes historical fact depends to some
as compared to the norming sample (in manual) extent on who is writing the history
o Local validation studies: absolutely necessary when test  Culture relativity
user plans to alter in some way the format, instructions,  Politics (politically correct)
language, or content of the test Criterion-Related Validity
- Types of Validity (Trinitarian view) *not mutually exclusive all - Criterion-related validity: judgment of how adequately a test score can
contribute to a unified picture of a test’s validity/ critique approach is be used to infer an individual’s most probable standing on some
fragmented and incomplete measure of interest (measure of interest being the criterion)
o Content validity: measure of validity based on an evaluation - 2 types:
of the subjects, topics, or content covered by the items in o Concurrent validity: index of the degree to which a test
the test score is related to some criterion measure obtained at the
o Criterion-related validity: measure of validity obtained by same time (concurrently)
evaluating the relationship of scores obtained on the test to o Predictive validity: index of the degree to which a test score
scores on other tests or measures predicts some criterion measure
o Construct validity: measure of validity that is arrived at by - What Is a Criterion?
executing a comprehensive analysis of: (umbrella validity o Criterion: a standard on which a judgment or decision may
every other variety of validity falls under it) be based; standard against which a test or a test score is
 How scores on test relate to other test scores and evaluated (criterion-related validity)
measures o Characteristics of criterion
 How scores on test can be understood within  Relevancy pertinent or applicable to the matter
some theoretical framework for understand the at hand
construct that the test was designed to measure  Validity (for the purpose which it is being used)
- Strategies: ways of approaching the process of test validity  Uncontaminated Criterion contamination:
o Content validation strategies term applied to a criterion measure that has been
o Criterion-related validation strategies based, at least in part, on predictor measures
o Construct validation strategies - Concurrent Validity
- Face Validity o Test scores are obtained at about the same time as the
o Face validity: relates more to what a test appears to criterion measures are obtained measures of the
measure to the person being tested than to what the test relationship between the test scores and the criterion
actually measures provide evidence of concurrent validity
o Judgment concerning how relevant the test items appear to o Indicate the extent to which test scores may be used to
be usually from testtaker, not test user estimate an individuals present standing on a criterion
o Lack of face validity= lack of confidence in perceived o Once validity of inference from test scores is established=
effectiveness of test which decreases testtaker’s faster, less expensive way to offer a diagnosis or a
motivation/cooperation *may still be useful classification decision
- Content validity o Concurrent validity of a test can be explored with respect to
o Content validity: a judgment of how adequately a test another test
samples behavior representative of the universe of behavior  Prior research must have satisfactorily
that the test was designed to sample demonstrated the 1st test’s validity
 Ideally, test developers have a clear vision of the  1st test= validating criterion
construct being measured clarity reflected in - Predictive validity
the content validity of the test o Test scores may be obtained at one time and the criterion
o Test blueprint: structure of the evaluation; a plan regarding measures obtained at a future time, usually after some
the types of information to be covered by the items, the intervening event has taken place
number of items tapping each area of coverage, the  Intervening event training, experience, therapy,
organization of the items in the test, etc. medication, etc.
 Behavior observation is a technique frequently  Measures of relationship between the test scores
used in test blueprinting and a criterion measure obtained at a future time
o The quantification of content validity provide an indication of the predictive validity
 Important in employment settings  tests used test (how accurately scores on the test predict
to hire and promote some criterion measure)
 One method: method for gauging agreement o Ex: SAT test score and freshman gpa
among raters or judges regarding how essential a o Judgments of criterion validity are based on 2 types of
particular item is (C.H. Lawshe) statistical evidence:
 “Is the skill or knowledge measured by  The validity coefficient
this item…  Validity coefficient: correlation
o Essential coefficient that provides a measure of
o Useful but not essential the relationship between test scores
o Not necessary and scores on the criterion measure
 To the performance of the job?”

Downloaded by Connie Soledad ([email protected])


lOMoARcPSD|8301055

CHAPTER 6: VALIDITY
 Ex: Pearson correlation coefficient  False negative (type II error) – does
used to determine validity between 2 not possess particular attribute but
measures (r) actually does have. Ex. Scored below
 Affected by restriction or inflation of cutoff score, not hired, but could have
range been successful in the job
 Is the range of scores employed - Construct Validity
appropriate to the objective of the o Construct validity: judgment about the appropriateness of
correlational analysis inferences drawn from test scores regarding individual
 No rules regarding the validity standings on a variable called a construct
coefficient (how high or low it  Construct: an informed, scientific idea developed
should/could be for test to be valid) or hypothesized to describe or explain behavior
 Incremental validity  Ex: intelligence, depression,
o More than one predictor motivation, personality, etc.
o Incremental validity: the  Unobservable, presupposed
degree to which an (underlying) traits that a test
additional predictor developer invokes to describe test
explains something about behavior/criterion performance
the criterion measure that  Viewed as unifying concept for all validity
is not explained by evidence
predictors already in use o Evidence of Construct Validity
 Expectancy data  Various techniques of construct validation that
 Expectancy data: provides info that provide evidence:
can be used in evaluating the  Test is homogeneous measures
criterion-related validity of a test single construct
 Score obtained on expectancy  Test scores increase/decrease as
test/tables likelihood testtaker will function of age, passage of time, or
score within some interval of scores experimental manipulation
on a criterion measure (“passing”, (theoretically predicted)
“acceptable”, etc.)  Test scored obtained after some even
 Expectancy table: shows the or passage of time differ from pretest
percentage of people within specified scores (theoretically predicted)
test-score intervals who subsequently  Test scores obtained by people from
were placed in various categories of distinct groups vary (theoretically
the criterion predicted)
o May be created from  Test scores correlate with scores on
scatterplot other tests (theoretically predicted)
o Shows relationships  Evidence of homogeneity
 Expectancy chart: graphic  Homogeneity: refers to how uniform a
representation of an expectancy table test is in measuring a single concept
o The higher the initial rating,  Evidence correlations between
the greater the probability subtest scores and total test scores
of job/academic success  Item-analysis procedures have been
 Taylor Russell Table – provide an estimate of the used in quest for test homogeneity
extent to which inclusion pf a particular test in  Desirable but not necessary
the selection system will actually improve  Contributes no info about how
selection construct being measured relates to
 Selection ratio – relationship between other constructs
the number of people to be hired and  Evidence of changes with age
the number of people available to be  If test purports to measure a construct
hired that changes over time then the test
 Base rate – percentage of people scores, too, should show progressive
under existing system for a particular changes to be considered valid
position measurement of construct
 Relationship between predictor and  Does not in itself provide info about
criterion must be linear how construct relates to other
 Naylor-shine Tables – difference between the constructs
means of the selected and unselected groups to  Evidence of pretest-posttest changes
derive an index of what the test is adding to  Can be evidence of construct validity
already established procedures  Some more typical intervening
o Decision theory and Test utility experiences responsible for changes in
 Base rate – extent to which a particular trait, test scores are:
behavior, characteristic or attribute exists in the o Formal education
population o Therapy/medication
 Hit rate – defined as the proportion of people a o Any life experience
test accurately identifies as possessing or  Evidence from distinct groups/method of
exhibiting a particular trait. contrasted groups
 Miss rate – proportion of people the test fails to  Method of contrasted groups: one
identify as having or not having attributes way of providing evidence for the
 False positive (type I error) – possess validity of a test is to demonstrate that
particular attribute but actually does scores on the test vary in a predictable
not have. Ex: score above cutoff score, way as a function of membership in
hired but failed the job. some group

Downloaded by Connie Soledad ([email protected])


lOMoARcPSD|8301055

CHAPTER 6: VALIDITY
 Rationale if a test is a valid measure on the part of the rater to be lenient
of a particular construct, test scores in scoring, marking, and/or grading
from groups of people who would  Severity error: rater exhibits general
presumed with respect to that and systematic reluctance to giving
construct should have correspondingly ratings at either the positive or
different test scores negative extreme
 Convergent evidence  Overcome restriction of range rating errors is to
 Evidence for the construct validity of a use rankings: procedure that requires the rater to
particular test may converge from a measure individuals against one another instead
number of sources, such as tests or of against an absolute scale
measures designed to assess the  Rater is forced to select 1st, 2nd, 3rd, etc.
same/similar construct  Halo effect: fact that for some raters, some rates
 Convergent evidence: scores on a test can do no wrong
undergo construct validity and  Tendency to give a particular ratee a
correlate highly in the predicted higher rating than he or she
direction with scores on older, more objectively deserves
established and already validated tests  Criterion data may be influenced by
designed to measure the same/similar rater’s knowledge of ratee race,
construct gender, etc.
 Discriminant evidence o Test fairness
 Discriminant evidence: validity  Issues of fairness tend to be more difficult and
coefficient showing little relationship involve values
between test scores and /or other  Fairness: the extent to which a test is used in an
variables with which scores on the test impartial, just, and equitable way
being construct-validated should not  Sources of misunderstanding
theoretically be correlated  Discrimination
 Provides evidence of construct validity  Group not included in standardization
 Multitrait-multimethod matrix: “two sample
or more traits”, “two or more  Performance differences between
methods” matrix/table that results identified groups
from correlating variables (traits)
within and between methods Relationship Between Reliability and Validity
 Factor analysis - A test should not correlate more highly with any other variable
 Factor analysis: shorthand term for a than it correlates with itself
class of mathematical procedures - A modest correlation between the true scores on two traits may
designed to identify factors or specific be missed if the test for each of the traits is not highly reliable
variables that are typically attributes, - We can have reliability without validity
characteristics, or dimension on which o It is impossible to demonstrate that an unreliable test
people may differ is valid
 Frequently used as a data reduction
method in which several sets of scores
and correlations between them are
analyzed
 Exploratory factor analysis:
researchers test the degree to which a
hypothetical model fits the actual data
o Factor loading: conveys
information about the
extent to which the factor
determines the test score
or scores
o Complex procedures
- Validity, Bias, and Fairness
o Test Bias
 Bias: a factor inherent in a test that systematically
prevents accurate, impartial measurement
 Technical means to identify and remedy bias
(mathematically)
 Bias implies systematic variation
 Rating error
 Rating: a numerical or verbal
judgment (or both) that places a
person or an attribute along a
continuum identified by a scale of
numerical or word descriptions,
known as a rating scale
 Rating error: judgment resulting from
intentional or unintentional misuse of
a rating scale
 Leniency error/generosity error: error
in rating that arises from the tendency

Downloaded by Connie Soledad ([email protected])


lOMoARcPSD|8301055

CHAPTER 7: UTILITY
Utility: usefulness or practical value of testing to improve efficiency  Based on norm-related considerations
rather than on the relationship of test
Factors that Affect a Test’s Utility scores to a criterion
 Psychometric Soundness  Also called norm-referenced cut score
o Reliability and validity of a test  Ex.) top 10% of test scores get A’s
o Gives us the practical value of both the scores o Fixed cut score: set with reference to a judgment
(reliability and validity) concerning a minimum level of proficiency required to
o They tell us whether decisions are cost-effective be included in a particular classification.
o A valid test is not always a useful test  Also called absolute cut scores
 especially if testtakers do not follow test o Multiple cut scores: using two or more cut scores with
directions reference to one predictor for the purpose of
 Costs categorizing testtakers
o Economic and non economic  Ex.) having cut score that marks an A, B, C
o Ex.) using a less expensive and therefore less stringent etc. all measuring same predictor
application process for airline personnel. o Multiple hurdles: for success, requires one individual
 Benefits to complete many tasks, with elimination at each level
o Profits, gains, advantages  Ex.) written application group interview
o Ex.) more stringent hiring policy more productive personal interview etc.
employees o Compensatory model of selection: assumption is
o Ex.) maintaining successful and academic environment made that high scores on one attribute can
of university compensate for low scores on another attribute

Utility Analysis Methods for Setting Cut Scores

What is Utility Analysis? The Angoff Method


-a family of techniques that entail a cost-benefit analysis designed to yield Judgments of experts are averaged
information relevant to a division about the usefulness and/or practical value
of a tool of assessment. The Known Groups Method
Collection of data on the predictor of interest from group known to posses
Utility analysis: An illustration and not to possess trait, attribute, or ability
What’s the companies goal? Cut score based on which test best discriminates the two groups
 Limit the cost of selection performance
o Don’t use FERT
 Ensure that qualified candidates are not rejected IRT-Based Method
o Set a cut score that yields the lowest false negative rate Based on testtaker’s performance across all items on a test
 Ensure that all candidates selected will prove to be qualified Some portion of test items must be correct
o Lowest dales positive rate Item-mapping method: determining difficulty level reflected by cut score (?)
 Ensure, to the extent possible, that qualified candidates will be Book-Mark method: test items are listed, one per page, in ascending level of
selected and unqualified candidates will be rejected difficulty. An expert places a bookmark to mark the divide which separates
o False positives are no better or worse than false testtakers who have acquired minimal knowledge, skills, or abilities and those
negatives that have not.
o Highest hit rate and lowest miss rate Problems include training of experts, possible floor and ceiling effects, and
the optimal length of item booklets
How Is a Utility Analysis Conducted?
Other Methods
-objective: dictate what sort of information will be required as well as the
-discriminant analysis: family of statistical techniques used to shed light on
specific methods to be used
the relationship between certain variables and two or more naturally
 Expectancy Data
occurring groups
o Expectancy table provides indication of the likelihood
ex.) the relationships between scores of tests and ppl judged to be
that a testtaker will score within some interval of
successful or unsuccessful at job
scores on a criterion measure
o Used to measure costs vs. benefits
 Brogden-Cronbach-Gleser formula Taylor-Russel Tables
o Utility gain: estimate of the benefit of using a -help the test user understand an estimate of the percentage of
particular test or selection method employees hired by use of particular test who will be likely successful at
o Most simply is benefits-cost their job
o Productivity gain: estimated increase in work output 3 variables:
-test validity
Some Practical Considerations -selection ratio
 The Pool of Job Applicants -and base rate
o There is rarely a limitless supply of potential employees
o Dependent on many factors, including economic
environment
o We assume that top scoring individuals will accept the
job, but those individuals are more likely to be the
ones being offered higher positions
 The complexity of the Job
o It is questionable whether the same utility analysis
methods can be used to measure the eligibility of
varying complexities of jobs
 The cut score in use
o Relative cut score: may be defines as reference point

Downloaded by Connie Soledad ([email protected])


lOMoARcPSD|8301055

CHAPTER 8: TEST DEVELOPMENT


CHAPTER 8: TEST DEVELOPMENT o Item format: variables such as the form, plan, structure,
arrangement and layout of individual test items
STEPS: o 2 types
1. TEST CONCEPTUALIZATION o 1.) selected-response format: testtaker selects a response from
2. TEST CONSTRUCTION a set of alternative responses
3. TEST TRYOUT  includes multiple choice, true-false, and matching
4. ITEM ANALYSIS o 2.) constructed-response format: testtaker supplies or creates
5. TEST REVISION the correct answer
 includes completion item, short answer and essay
TEST CONCEPTUALIZATION - Writing Items for computer administration
- Thoughts or stimulus that could be almost everything. o Item bank: relatively large and easily accessible collection of test
- An emerging social phenomenon or pattern of behavior might serve questions
as the stimulus for the development of a new test. o Computerized Adaptive Testing (CAT): interactive, computer-
- Norm referenced: An item for which high scorers on the test respond administered testtaking process wherein items presented to the
correctly. Low scorers respond to that same item incorrectly testtaker are based in part on testtaker’s performance on
- Criterion referenced: high scorers on the test get a particular item previous items.
right whereas low scorers on the test get that same item wrong. o Floor effect: the diminished utility of an assessment tool for
- Pilot work: pilot study or pilot research. To know whether some items distinguishing testtakers at the low end of the ability, trait, or
should be included in the final form of the instrument. other attribute being measured
o the test developer typically attempts to determine how o Ceiling effect: diminished utility of an assessment tool for
best to measure a targeted construct distinguishing testtakers at the high end of the ability, trait,
TEST CONSTRUCTION attribute being measured
- Scaling: process of setting rules for assigning numbers in o Item branching: ability of computer to tailor the content and
measurement. order of presentation of test items on the basis of responses to
- L.L. Thurstone: credited for being the forefront of efforts to develop previous items
methodologically sound scaling methods. SCORING ITEMS
TYPES OF SCALES: - Cummulative scoring: testtakers earn cumulative credit with regard to
- Nominal, ordinal, interval or ratio a particular construct
- Age-based scale - Class/category scoring: testtaker responses earn credit toward
- Grade-based scale placement in a particular class or category with other testtakers
- Stanine scale (raw score converted to 1-9) whose pattern of responses is presumably similar in some way
- Unidimensional vs. multidimensional - Ipsative scoring: comparing a testtaker’s score on one within a test to
o Unidimensional: measuring one construct another scale within that same test
o Multidimensional: measuring more than one construct o ex.) “John’s need for achievement is higher than his need
- Comparative vs. categorical for affiliation”
o Comparative scaling: entails judgments of a stimulus in ITEM WRITING (KAPLAN BOOK)
comparison with every other stimulus on the scale Item Writing
o Categorical scaling: stimuli are placed into one of two or - Personality and intelligence tests require different sorts of responses
more alternative categories that differ quantitatively with - Guidelines for item writing
respect to some continuum o Define clearly what you want to measure
- Rating Scale: Which can be defined as a grouping of words, o Generate an item pool
statements, or symbols on which judgments of the strength of a o Avoid exceptionally long items
particular trait, attitude, or emotion are indicated by the testtaker o Keep the level of reading difficulty appropriate for those who
- Summative scale: when final score is obtained by summing the ratings will complete the scale
across all the items o Avoid “double-barreled” items that convey two or more ideas at
- Likert scale: each item presents the testtaker with five alternative the same time
responses usually on agree-disagree, or approve-disapprove o Consider mixing positively and negatively worded items
continuum - Must be sensitive to ethnic and cultural differences
- Method of paired comparisons: presented with two stimuli and asked - Items that retain their reliability are more likely to focus on skills, while
to compare those that lost reliability focused on more abstract concepts
- Comparative scaling: judging of a stimulus in comparison with every Item Formats
other stimulus on the scale - Simplest test uses dichotomous format
- Categorical scaling: testtaker places stimuli into a category; those The Dichotomous Format
categories differ quantitatively on a spectrum. - Dichotomous format offers two alternatives for each item
- Guttman scale (Scalogram analysis): items range from sequentially o Ie. True-false examination
weaker to stronger expressions of attitude, belief, or feeling. A - Advantages:
testtaker who agrees with the stronger statement is assumed to also o Simplicity
agree with the milder statements o True-false items require absolute judgment
- Equal-appearing intervals (Thurstone): direct estimation because - Disadvantages:
don’t need to transform testtaker’s response to another scale o True-false encourage students to memorize material
WRITING ITEMS o “truth” often comes in shades of gray
- 3 Questions of test developer o mere chance of getting any item correct is 50%
o What range of content should the items cover? - Yes-no format on personality tests
o Which of the many different types of item formats should be - Multiple-choice = polytomous
employed? The Polytomous Format
o How many items should be written in total and for each content - Polytomous format resembles the dichotomous format except that each
area covered? item has more than two alternatives
- Item pool: reservoir from which items will not be drawn for the final o Multiple-choice exams
version of the test (should be about double the number of questions as - Advantage:
final will have)
- Item format

Downloaded by Connie Soledad ([email protected])


lOMoARcPSD|8301055

CHAPTER 8: TEST DEVELOPMENT


o Little time for test takers to respond to a particular item o Obtained by calculating the proportion of the total
because they do not have to write number of testtakers who answered the item correctly
- Incorrect choices are called distractors “p”
- Disadvantages: o Higher p= easier item
o How many distractors should a test have? --> 3 or 4 o Difficulty can be replaced with endorsement in non-
o Distractors hurting reliability / validity of test achievement tests
o Three alternative multiple-choice items may be better than o The midpoint representing the optimal difficulty is
five alternative items because they retain the psychometric obtained by summing up the chance of success
value but take less time to develop and administer proportion and 1.00 and then dividing the sum by 2
o Scoring of the MC exams? --> simply guessing should elicit Item Reliability Index
correctness o Indication of the internal consistency of a test
o Correcting for this though, the expected score is 0 – as o Equal to the product of the item-score standard deviation (s) and
getting a question wrong loses you a point the correlation (r)
- Guessing can be good if you can narrow down a couple answers o Factor analysis and inter-item consistency
- Students are more likely to guess when they anticipate a lower grade on o Factor analysis determines whether items on a test
a test than when they are more confident appear to be measuring the same thing
- Guessing threshold describes the chances that a low-ability test taker The Item-Validity Index
will obtain each score o Statistic designed to provide an indication of the degree to which
- True-false and MC tests are common to educational and achievement a test is measuring what it purports to measure
tests o Requires: item-score standard deviation, the correlation between
- Likert format, category scale, and the Q-sort used for personality- the item score and criterion score
attitude tests The Item-Discrimination Index
Likert Format o Measures how adequately an item separates or discriminates
- Likert format: requires that a respondent indicate the degree of between high scorers and low scorers
agreement with a particular attitudinal question o “d”
o Strongly disagree ... Strongly agree o compares performance on a particular item with performance in
o For measurements of attitude the upper and lower regions of a distribution of continuous test
- Used to create Likert Scales: scales require assessment of item scores
discriminability o higher d means greater number of high scorers answering the
- Familiar and easy --- likely to remain popular in personality and attitude item correctly
tests o negative d means low-scoring examinees are more likely to answer
Category Format the item correctly than high-scoring examinees
- Category format: uses more choices than Likert; 10-point rating scale o Analysis of item alternatives
- Disadvantage: responses to items on 10-pt scales are affected by the Item-Characteristic Curves?
groupings of the people or things being rated o Graphic representation of item difficulty and discrimination
- People change their ratings depending on context
o This problem can be avoided if the endpoints of the scale are Other Considerations in Item Analysis
clearly defined and the subjects are frequently reminded of o Guessing
the definitions of the endpoints o Usually in some direction
- Optimal number of points is 7? o Depends on individuals ability to take risks
o Number depends on the fineness of the discrimination that o Item fairness
subjects are willing to make o Bias
o When people are highly involved with some issue, they will o Speed tests
tend to respond best to a greater number of categories o Last items will appear to be more difficult
- Increasing the number of response categories may not increase because not everyone got to them
reliability and validity
- Visual analogue scale: respondent is given a 100-millimeter line and Qualitative Item Analysis
asked to place a mark between two well-defined endpoints  Qualitative methods: techniques of data generation and analysis
o Measures self-rate health that rely primarily on verbal rather than mathematical or
Checklists and Q-Sorts statistical procedures
- Adjective Checklist: subject receives a long list of adjectives and  Qualitative item analysis: various nonstatistical procedures
designed to explore how individual test items work
indicates whether each one is characteristic of himself or herself
o Through means like interviews and group discussions
o Requires subjects either to endorse such adjectives or not,
 “Think aloud” test administration
thus allowing only two choices for each item
o approach to cognitive assessment that entails
- Q-Sort: increases the number of categories
respondents vocalizing thoughts as they occur
o Used to describe oneself or to provide ratings of others
o used to shed light on the testtker’s though processes
Other Possibilities
during the administration of a test
- Forced-choice and Likert formats are clearly the most popular in
 Expert panels
contemporary tests and measures
o Sensitivity review: study of test items in which they
- Checklists have fallen out of favor because they are more prone to error
are examined for fairness to all prospective testtakers
than are formats that require responses to every item
as well as for the presence of offensive language,
- Frequent advice is to not use “all of the above” as a response option
stereotypes, or situations
ITEM ANALYSIS (KAPLAN BASED)
TEST TRYOUT The Extreme Group Method
What is a good item? - Compares people who have done well with those who have done
o Reliable and valid
poorly on a test
o Helps to discriminate testtakers - Difference between these proportions is called the discrimination
index
ITEM ANALYSIS
The Point Biserial Method
o The Item-Difficulty Index

Downloaded by Connie Soledad ([email protected])


lOMoARcPSD|8301055

CHAPTER 8: TEST DEVELOPMENT


- Find the correlation between performance on the item and - One challenge in test applications is how to determine linkages
performance on the total test between two different measures
- Correlation between a dichotomous variable and a continuous Items for Criterion-Referenced Tests
variable is called a point biserial correlation - Traditional use of tests requires that we determine how well
- On tests with only a few items, using this is problematic because someone has done on a test by comparing the person’s
performance on the item contributes to the total test score performance to that of others
Pictures of Item Characteristics - Criterion-referenced tests compares performance with some
- Valuable way to learn about items is to graph their characteristics, clearly defined criterion for learning
which you can do with the item characteristic curve o Popular approach in individualized instruction
- Prepare a graph for each individual test item programs
o Total test score is used as an estimate of the amount of o Regarded as diagnostic instruments
a ‘trait’ possessed by individuals - First step in developing these tests involves clearly specifying the
- Relationship between performance on the item and performance objectives by writing clear and precise statements about what the
on the test gives some info about how well the item is tapping the learning program is attempting to achieve
info we want - To evaluate the items: one should give the test to two groups of
Drawing the Item Characteristic Curve students – one that has been exposed to the learning unit and one
- To draw this, we need to define discrete categories of test that has not
performance - Bottom of the V is the antimode – the least frequent score
- If the test has been given to many people, we might choose to - This point divides those who have been exposed to the unit from
make each test score a single category those who have not been exposed and is usually taken as the
- Gradual positive slope of the line demonstrates that the cutting score or point, or what marks the point of decision
proportion of people who pass the item gradually increases as test - When people get scores higher than the antimode, we assume
scores increase that they have met the objective of the test
o This means that the item successfully discriminates at Limitations of Item Analysis
all levels of test performance - Main Problem: though statistical methods for item analysis tell the
- Ranges in which the curve changes suggest that the item is test constructor which items do a good job of separating students,
sensitive, while flat ranges suggest areas of low sensitivity they do not help the students learn
- Item analysis breaks the general rule the increasing the number of - Although the data are available to give the child feedback on the
items makes a test more reliable “bug” in their thinking, nothing in the testing procedure initiates
- When bad items are eliminated, the effects of chance responding this guidance
can be eliminated and the test can become more efficient, TEST REVISION
reliable, and valid Test Revision in the Life Cycle of an Existing Test
Item Response Theory  Tests get old and need revision
- According to classical test theory, a score is derived from the sum  Questions arise over equivalence of two tests
of an individual’s responses to various items, which are sampled  Cross-validation and Co-validation
from a larger domain that represents a specific trait or ability o Cross-validation: revalidation of a test on a sample of
- New approaches consider the chances of getting particular items testtakers other than those on whom test performance
right or wrong – item response theory – make extensive use of was originally found to be a valid predictor of some
item analysis criterion
o With this, each item on a test has its own item o Validity shrinkage: decrease in item validities that
characteristic curve that describes the probability of inevitably occurs after cross-validation of finding
getting each particular item right or wrong given the o Co-validation: test validation process conducted on
ability level of each test taker two or more tests using the same sample of testtakers
o Testers can make an ability judgment without o Co-norming: when co-validation is used in conjunction
subjecting the test taker to all of the test items with the creation of norms or the revision of existing
- Technical adv: builds on traditional models of item analysis and norms
can provide info on item functioning, the value of specific items, o Quality assurance during test revision
and the reliability of a scale  test givers must have some degree of
- Two dimensions used are difficulty and discriminability qualification, training, and testing
- Most attractive adv. Is that one can easily adapt the IRT tests for  anchor protocol: test protocol scored by a
computer administration highly authoritative scorer that is designed
o Computer can rapidly identify the specific items that as a model for scoring and a mechanism for
are required to assess a particular ability level resolving scoring discrepancies
- “peaked conventional”  scoring drift: a discrepancy between scoring
- “rectangular conventional” – requires that test items be selected in an anchor protocol and the scoring of
another protocol
to create a wide range in level of difficulty
The Use of IRT in Building and Revising Tests
o problem: only a few items of the test are appropriate
(item response theory)
for individuals at each ability level; many test takers
 Evaluating the properties of existing tests and guiding test revision
spend much of their time responding to items either
 Determining measurement equivalence across testtaker
considerably below their ability level or too difficult to
populations
solve
o Differential item functioning (DIF): phenomenon,
- IRT addresses traditional problems in test construction well
wherein an item functions differently in one group of
- IRT can identify respondents with unusual response patterns and
testtakers as compared to another group of testtakers
offer insights into cognitive processes of the test taker
known to have the same level of the underlying trait
- May also reduce the biases against the people whoa re slow in
 Developing item banks
completing test problems o Items from other instruments item pool 
External Criteria
scrutiny preliminary item bank psychometric
- Item analysis has been persistently plagued by researchers’
testingitem bank
continued dependence on internal criteria, or total test score, for
evaluating items
Linking Uncommon Measures

Downloaded by Connie Soledad ([email protected])


lOMoARcPSD|8301055

CHAPTER 9: INTELLIGENCE AND ITS MEASUREMENT


What is Intelligence?  greater g = better test was thought to predict overall intelligence
Intelligence: a multifaceted capacity that manifests itself in different ways  group factors: neither as general as g nor as specific as s
across the lifespan. Usually includes abilities to: o ex.) linguistic, mechanical, arithmetical abilities
 Acquire and apply knowledge  Guilford: multiple-factor models of intelligence
 Reason logically o Explain mental activities by deemphasizing, any
 Plan effectively reference to g
 Infer perceptively  Thurstone: conceived intelligence as being composed of 7 primary
 Make judgment and solve problems abilities.
 Grasp and visualize concepts  Gardner: developed theory of multiple intelligences
 Pay attention o Question over whether emotional intelligence exists.
 Be intuitive o Logical-mathematical, bodily-kinesthetic, linguistic,
 Find the right words and thoughts with facility musical, spatial, interpersonal and intrapersonal
 Cope with, adjust to, and make the most of new situations  Raymond Cattell: fluid vs. crystallized intelligence
Intelligence Defines: Views of the Lay Public o Crystallized intelligence: acquired skills and knowledge
 Both social and academic and their retrieval. Retrieval of information and
Intelligence Defined: Views of Scholars and Test Professionals application of general knowledge
 Francis Galton o Fluid intelligence: nonverbal, relatively culture-free,
o First to publish on heritability of intelligence and independent of specific instruction.
o Most intelligent persons were those with the best  Horn: added more to 7 factors
sensory abilities o Vulnerable abilities: decline with age and tend to
 Alfred Binet return preinjury levels following brain damage
o Made tests about intelligence, but didn’t define it o Maintained abilities: tend not to decline with age and
o Components of intelligence: reasoning, judgment, may return to preinjury levels following brain damage.
memory, abstraction  Carrol:
o Added that definition is complex; requires interaction o Three-stratum theory of cognitive abilities: like
of components geology
o He argued that when one solves a particular problem, o Hierarchical model: meaning that all of the abilities
the abilities used cannot be separated because they listed in a stratum are subsumed by or incorporated in
interact to produce the solution. the strata above.
 David Wechsler o Those in the first stratum are narrow abilities
o Best way to measure this global ability was by  CHC model (Cattell-Horn-Carroll)
measuring aspects of several “qualitatively o Some overlap some difference
differentiable” abilities o Doesn’t use g
o Complexity of intelligence o Has broader abilities than Carroll’s theory
o Conceptualization as an “aggregate” or “global”  McGrew: Integrated the Cattell-Horn and Carroll’s model
capacity  McGrew and Flanagan: integrated McGrew-Flanagan CHC Model
 Jean Piaget o Features 10 broad stratum abilities
o Studied children o 70 narrow-stratum abilities
o Believed order of maturation to be unchangeable o Makes no provision for the general intellectual ability
o With age, increased schema: organized action or factor (g)
mental structure that, when applied to the world, leads o It was omitted because it has little practical relevance
to knowing or understanding. to cross-battery assessment and interpretation
o Learning occurred through assimilation (actively The Information-Processing View
organizing new information so that it fits in with what  Aleksandr Luria
already is perceived and thought) and accommodation o How (not what) information is processed
(changing what is already perceived or though so that o Simultaneous/parallel processing: integrated all at
it fits with new information) once
o Sensorimotor (0-2) o Successive/sequential processing: each bit individually
processed
o Preoperational (2-6)  PASS model: (Planning, attention, simultaneous, successive)-
model of assessing intelligence
o Concrete Operational (7-12)  Sternberg ‘The essence of intelligence is that it provides a means
to govern ourselves so that our thoughts and actions are
o Formal Operational (12 and older) organized, coherent, and responsive to both out internally driven
needs and to the needs of the environment”
 All share interactionism: complex concept by which heredity and
environment are presumed to interact and influence the Measuring Intelligence
development of one’s intelligence
 Factor-analytic theories: focus is squarely on identifying the Types of Tasks Used in Intelligence Test
ability(ies) deemed to constitute intelligence  Infants: test sensorimotor, interviews with parents
 Information-processing theories: focus is on identifying the  Older child: verbal and performance abilities
specific mental processes that constitute intelligence.  Mental Age: index that refers to chronological age equivalent to
one’s test performance
Factor-Analytic Theories of Intelligence:  Adults: retention of general information, quantitative reasoning,
 Charles Spearman: pioneered new techniques to measure expressive language and memory, and social judgment
intercorrelations between tests. Theory in Intelligence Test Development and Interpretation
o Existence of a general intellectual ability factor (g) that  Weschler made a dichotomous test (Performance and Verbal), but
tapped by all other mental abilities. advocated multifaceted definition
 g representing the portion of the variance that all intelligence  Thorndike: intelligence = social, concrete, abstract
tests have in common and the remaining portions of the variance  Putting theories into test are extremely hard
being accounted for either by specific components (s) or by error
components (e)

Downloaded by Connie Soledad ([email protected])


lOMoARcPSD|8301055

CHAPTER 9: INTELLIGENCE AND ITS MEASUREMENT


Intelligence: Some Issues:
Nature vs. Nurture
 Currently believed to be mix of two
 Performationism: all structures, including intelligence are had at
birth and can’t be improved upon
 Led to predeterminism: one’s abilities are predetermined by
genetic inheritance and no learning or intervention can enhance it
 Interactionist: ppl inherit certain intellectual potential
o Theres a limit to genetic abilities (i.e. can’t ever have x-
ray vision)
The Stability of Intelligence
 Stable pretty much throughout one’s adult life
 Cognitive abilities seem to decline with age
The Construct Validity of Tests of Intelligence
 Having construct validity requires having unified understanding of
what intelligence is
 Very difficult. Spearman says its one thing, Guilford says its many
 Thorndike approach is sort of compromise
o Look for one central factor with three additional factors
representing social, concrete, and abstract intelligences
Other Issues
 Flynn effect: IQ scores seem to rise every year, but not coupled
with rise in “true intelligence”
 Personality
o High IQ: Need for achievement, competition, curiosity,
confidence, emotional stability etc.
o Low IQ: passivity, dependence, maladjustment
o Temperament (used to describe infants)
 Gender
o Men usually outscore in visual spatialization tasks and
intelligence scores
o Women tend to outscore in language-skill tasks
o But differences can be bridged
 Family Environment
o Divorce can have negative effects
o Begins with “maternal effects” in womb
 Culture
o Provides specific models for thinking, acting and feeling
o Assumed that if cultural factors can be controlled then
differences between cultural groups will be lessened
o Assumed that culture can be removed by the reliance
on exclusively nonverbal tasks
 Tend not to be very good at predicting
success in various academic and business
settings
o Culture loading: the extent to which a test
incorporates the vocabulary, concepts, traditions,
knowledge and feelings associated with a particular
culture
o No test can be culture free
o Culture-fair intelligence test: test/assessment process
designed to minimize the influence of culture with
regard to various aspects of evaluation procedure
o Another approached called for cultural-specific
intelligence tests
 Ex.) BITCH measured streetwiseness
 Lacked predictive validity and useful,
practical information

Downloaded by Connie Soledad ([email protected])


lOMoARcPSD|8301055

CHAPTER 10: TESTS OF INTELLIGENCE


The Stanford-Binet Intelligence Scales Tests Designed for Individual Administration
 First to have detailed administration and scoring instructions  Kaufman Adolescent and Adult Intelligence Test
 First American test to test IQ  Kaufman Brief Intelligence Test
 First to use alternate items (an item that can be used in place of  Kaufman Assessment Battery for Children
another)  Away from information processing and towards a distinction
 Lacked minority group representation between sequential and simultaneous processing
 Ratio IQ=(mental age/chronological age)x100 Tests Designed for Group Administration
 Deviation Ratio/test composite: performance of one individual  Group Testing in the Military
compared to the performance of others of the same age. Has o WWI need for government to test intelligence as
mean of 100 and standard deviation of 16 means of differentiating “unfit” and “exceptionally
 Age scale: items grouped by age superior ability”
 Point scale: items organized by category o Army Alpha Test: to army recruits who could read.
The Stanford-Binet Intelligence Scales: Fifth Edition Included general information questions, analogies, and
 Measures fluid intelligence, crystallized knowledge, quantitative scrambled sentences to reassemble
knowledge, visual-processing, and short-term (working) memory o Army Beta Test: to foreign or illiterate recruits,
 Utilizes adaptive testing: testing individually tailored to testtakers included mazes, coding, and picture completion.
to ensure that items are neither too difficult (frustrating) or too o After the war, the alpha and beta test were used
easy (false hope) rampantly, and oftentimes misused
 Examiner establishes rapport with testtaker, then administers o Screening tools: instrument of procedure used to
routing test to direct, route examinee to test items most likely at identify a particular trait or constellation of traits
optimal level of difficulty o ASVAB (Armed Services Vocational Aptitude Battery):
 Teaching items: show testtaker what is expected, how to do it. administered to prospective to recruits or high school
o Can be used for qualitative assessment, but not scoring students looked for career guidance
 Subtests for verbal and nonverbal tests share same name, but  5 career areas: clerical, electronics,
involve different tasks mechanical, skill-technical, and combat
 Floor: lowest level of items on subtest operations
 Ceiling: highest-level item of subtest  Group Testing in Schools
 Basal level: base-level criterion that must be met for testing on o Useful in developing child’s profile- but cannot be sole
the subtest to continue indicator
 Ceiling level is met when testtaker fails certain number of items in o Groups of 10-15
a row. Test discontinues here. o Starting in Kindergarten
 Scores: raw standard  composite o Also called traditional group testing, because more
 Extra-test behavior: behavioral observation modern forms can utilize computer. These more aptly
The Wechsler Tests called individual testing
-commonality between all versions: all yield deviation IQ’s with mean of 100 Measures of Specific Intellectual Abilities
and standard deviation of 15  Widely used intelligence tests only test a sampling of the many
Wechsler Adult Intelligence Scale-Fourth Edition (WAIS-IV) attributable factors aiding in intelligence
 Core subtest: administered to obtain a composite score  Ex.) Creativity
 Supplemental/Optional Subtest: provides additional clinical o Commonly thought to be composed of originality,
information or extending the number of abilities or processes fluency, flexibility, and elaboration
sampled. o If the focus is too heavily on whether an answer is
 Yields four index scores: Verbal Comprehension Index, a Working
correct, doesn’t allow for creativity
Memory Index, a Perceptual Reasoning Index, and a Processing
o Achievement tests require convergent thinking:
Speed Index
deductive reasoning process that entails recall and
The Wechsler Intelligence Scale for Children –Fourth Edition (WISC-IV)
consideration of facts as well as a series of logical
 Process score: index designed to help understand how testtakers
judgments to narrow down solutions and eventually
process various kinds of information
arrive at one solution
 WISC-IV compared to the SB5
o Divergent thinking: a reasoning process in which
The Wechsler Preschool and Primary Scale of Intelligence-Third Edition
thought is free in many different directions, making
(WPPSI-III)
several solutions possible
 New school for children under 6
 Associated words, uses of rubber band etc.
 First major intelligence test which adequately sampled total
 Test-retest reliability for some of these tests
population of the United States
are near unacceptable
 Subtests labeled core, supplemental, or optional
Wechsler, Binet, and the Short Form
 Short form: test that has been abbreviated in length to reduce
time needed to administer, score and interpret
 used with caution, only for screening
 provide only estimates
 reducing the number of items usually reduces reliability and thus
validity
 Wechsler Abbreviated Scale of Intelligence
The Wechsler Test in Perspective
 Factor Analysis
o Exploratory factor analysis: summarizing data when
we are not sure how many factors are present in our
data
o Confirmatory factor analysis: used to test highly
specific factor analysis

Other Measures of Intelligence

Downloaded by Connie Soledad ([email protected])


lOMoARcPSD|8301055

construct - Designed for infants


CHAP.11: Other Individual Tests test for adolescent validity between 1 and
of Ability in Education and delinquents o Test-retest 42mths
Special Education - Knox developed a reliability - Assesses
battery of leaves development across 5
Alternative Individual Ability performance tests for much to domains: cognitive,
Tests Compared with the Binet non-English adult be desired language, motor,
and Wechsler Scales immigrants to the US Gesell Developmental Schedules socioemotional, and
- None of these are – administered (GDS) adaptive
clearly superior from without language; - Infant intelligence - Motor scale: assumes
a psychometric speed not measures that later mental
standpoint emphasized - Used as a research functions depend on
- Some less stable, - These early individual tool by those motor development
most more limited in tests designed for interested in - Excellent
their documented specific populations, assessing infant standardization
validity produced a single intellectual - Generally positive
- Compare poorly to score, and had development after reviews
Binet and Wechsler nonverbal exposure to mercury, - Strong internal
on all accounts performance scales diagnoses of consistency
- They don't rely on a - Could be abnormal brain - More validity studies
verbal response as administered without formation in utero needed
much as the B and W visual instructions and assessing infants - Widely used in
- Just use pointing or and used with with autism research – children
Yes/No responses, children as well as - Children of 2.3mth to with Down syndrome,
thus do not depend adults 6.3yrs pervasive
on the complex Infant Scales - Obtains normative developmental
integration of visual - Where mental data concerning disorders, cerebral
and motor retardation or various stages in palsy, language
functioning developmental delays maturation impairment, etc
- Contain a are suspected, these - Individual’s - Most
performance scale or tests can supplement developmental psychometrically
subscale observation, genetic quotient (DQ) is sound test of its kind
- Their specificity often testing, and other determined according - Predictive though?
limits the range of medical procedures to a test score, which Cattell Infant Intelligence Scale
functions or abilities Brazelton Neonatal Assessment is evaluated by (CIIS)
that they can Scale (BNAS) assessing the - Based on normative
measure - Individual test for presence or absence developmental data
- Because they are infants between of behavior - Downward extension
designed for special 3days and 4weeks associated with of Stanford-Binet
populations, some - Purportedly provides maturation scale for 2-30mth olds
alternatives can be an index of a - Provides an - Similar to Gesell scale
administered totally newborn’s intelligence quotient - Rarely used today
without the verbal competence like that of the Binet - Sample is primarily
instructions - Favorable reviews o (developm
based on children of
- Considerable research ent
parents from lower
Specific Individual Ability Tests base quotient /
and middle classes
- Earliest individual - Wide use as a chronologi
and therefore does
tests typically research tool and as a cal age) x
not represent the
designed for specific diagnostic tool for 100
general population
purposes or special purposes - But, falls short of
- Unchanged for 60yrs
populations - Commonly used scale acceptable
- Psychometrically
- One of the first – for the assessment of psychometric
unsatisfactory
Seguin Form Board neonates standards
Test – in 1800s – - Drawbacks: - Standardization
Major Tests for Young Children
produced only a o No norms sample not
McCarthy Scales of Children’s
single score are representative of the
Abilities (MSCA)
o Used available population
- Measure ability in
primarily o More - No reliability or
children between 2-
to research is validity
8yrs
evaluate needed - Does appear to help
- Present a carefully
mentally concerning uncover subtle
constructed individual
retarded the deficits in infants
test of human ability
adults and meaning - Meager validity
emphasize and
- Produces a pattern of
d speed implicatio Bayley Scales of Infants and
scores as well as a
and n of scores Toddler Development – Third
variety of composite
performan o Poorly Edition (BSID-III)
scores
ce document - Base assessments on
- General cognitive
- After, the Healy- ed normative
index (CGI): standard
Fernald Test was predictive maturational
score with a mean of
developed as an and developmental data
exclusively nonverbal

Downloaded by Connie Soledad ([email protected])


lOMoARcPSD|8301055

100 and a standard mental vocabulary, - Needs


deviation of 16 wholes in presumably providing restandardization
o Index order to a nonverbal estimate Testing Learning Disabilities
reflects solve a of verbal intelligence - Major concept is that
how well problem - Can be done in a child average in
the child - Nonverbal measure 15mins, requires no intelligence may fail
has of ability too reading ability in school because of a
integrated - Well constructed and - Good reliability and specific deficit or
prior psychometrically validity disability that
learning sound - Should never be used prevents learning
experience - Not much evidence of as a substitute for a - Federal law entitles
s and (good) validity Wechsler or Binet IQ every eligible child
adapted - Poorer predictive - Important component with a disability to a
them to validity for school in a test battery or free appropriate
the achievement – used as a screening public education and
demands smaller differences device emphasizes special
of the between whites and - Easy to administer education and related
scales minorities and useful for variety services designed to
- Relatively good - Test suffers from a of groups meet his or her
psychometric noncorrespondence - BUT: Tendency to unique needs and
properties between its definition underestimate IQ prepare them for
- Reliability coefficients and its measurement scores, and problems further education,
in the low .90s of intelligence inherent in the employment, and
- In research studies multiple-choice independent living
- Good validity? Good General Individual Ability Tests format are bad - To qualify, child must
assessment tool for Handicapped and Special Leiter International have a disability and
Kaufman Assessment Battery Populations Performance Scale – Revised educational
for Children - Second Edition Columbia Mental Maturity Scale (LIPS-R) performance affected
(KABC-II) – Third Edition (CMMS) - Strictly a by it
- Individual ability test - Purports to evaluate performance scale - Educators today can
for children between ability in normal and - Aims at providing a find other ways to
3-18yrs variously nonverbal alternative determine when a
- 18 subtests in 5 handicapped children to the Stanford-Binet child needs extra help
global scales called from 3-12yrs scale for 2-18yr olds - Processed called
sequential - Requires neither a - For research, and Response to
processing, verbal response nor clinical settings, Intervention (RTI):
simultaneous fine motor skills where it is still widely premise is that early
processing, learning, - Requires subject to utilized to assess the intervening services
planning, and discriminate intellectual function can prevent academic
knowledge similarities and of children with failure for many
- Intended for differences by pervasive students with
psychological, clinical, indicating which developmental learning difficulties
minority-group, drawing does not disorders - Signs of learning
preschool, and belong on a 6-by- - Purports to provide a problem:
neuropsychological 9inch card containing nonverbal measure of o Disorganiz
assessment as well as 3-5 drawings general intelligence ation
research - Multiple choice by sampling a wide o Careless
- Sequential- - Standardization variety of functions effort
simultaneous sample is impressive from memory to o Forgetfuln
distinction - Vulnerable to random nonverbal reasoning ess
o Sequential error - Can be applied to the o Refusal to
processing - Reliable instrument deaf and language- do
refers to a that is useful in disabled schoolwor
child’s assessing ability in - Untimed k or
ability to many people with - Good validity homework
solve sensory, physical, or Porteus Maze Test (PMT) o Slow
problems language handicaps - Popular but poorly performan
by - Good screening standardized ce
mentally device nonverbal o Poor
arranging Peabody Picture Vocabulary Test performance attention
input in – Fourth Edition (PPVT-IV) measure of o Moodiness
sequential - 2-90yrs intelligence
or serial - multiple choice tests - Individual ability test
order that require subject - Consists of maze
o Simultane to indicate Yes/No in problems (12) Illinois Test of Psycholinguistic
ous some manner - Administered without Abilities (ITPA-3)
processing - Instructions verbal instruction, - Assumes that failure
refers to a administered aloud thus used for a to respond correctly
child’s (not for the deaf) variety of special to a stimulus can
ability to - Purports to measure populations result not only from a
synthesize hearing or receptive defective output
info from system but also from

Downloaded by Connie Soledad ([email protected])


lOMoARcPSD|8301055

a defective input or conclusions seem the subject is imply - Achievement tests


information- warranted: asked to copy measure what the
processing system o 1. Test - By 9yrs, any child of person has actually
- Stage 1: info must constructo normal intelligence acquired or done with
first be received by rs appear can copy the figures that potential
the senses before it to be with only one or two - Discrepancies
can be analyzed respondin errors between IQ and
- Stage 2: info is g to the - Errors occur for achievement have
analyzed or processed same people whose mental traditionally been the
- Stage 3: with criticisms age is less than 9, main defining feature
processed info, that led to brain damage, of a learning disability
individual must make changes in nonverbal learning - Most achievement
a response the Binet disabilities, emotional tests are group tests
- Theorizes that the and problems - WRAT-4 purportedly
child may be impaired Wechsler - Questionable permits an estimate
in one or more scales and reliability of grade-level
specific sensory ultimately Memory-for-Designs (MFD) Test functioning in word
modalities to the - Drawing test that reading, spelling,
- 12 subtests that developm involves perceptual- math computation,
measure individual’s ent of the motor coordination and sentence
ability to receive KABC - Used for people 8- comprehension
visual, auditory, or o 2. Much 60yrs - Used for children
tactile input more - Good split-half 5yrs+
independently of empirical reliability - Easy to administer
processing and and - Needs for validity - Problems:
output factors theoretical documentation o Inaccuracy
- purports to help research is - All these tests in
isolate the specific needed criticized because of evaluating
site of a learning o 3. Users or their limitations in grade-level
disability learning reliability and validity reading
- For children 2-10yrs disabilities documentation ability
- Early versions hard to tests - Good as screening o Not
administer and no should devices though proven as
reliability or validity take great Creativity: Torrance Tests of psychomet
- Now, with revisions, pains to Creative Thinking (TTCT) rically
ITPA-3 understan - Measurement of sound
psychometrically d the
creativity
sound measure of weaknesse CHAP: 12: Standardized Tests in
underdeveloped in
children’s s of these Education, Civil Service, and the
psychological testing
psycholinguistic procedure Military
- Creativity: ability to
abilities s and not
be original, to
Woodcock-Johnson III overinterp When justifying the use of
combine known facts -
- Evaluates learning ret results group standardized tests,
in new ways, or to
Visiographic Tests test users often have
disabilities find new relationships
- Require a subject to problems defining what
- Designed as a broad- between known facts
copy various designs exactly they are trying to
range individually - Evaluating this a
Benton Visual Retention Test – predict, or what the test
administered test to possible alternative to
Fifth Edition (BVRT-V) criterion is
be used in IQ
- Tests for brain Comparison of Group and
educational settings - Creativity tests in
- Assesses general damage are based on Individual Ability Tests
early stages of
the concept of - Individual tests require a
intellectual ability, development
psychological deficit, single examiner for a single
specific cognitive - Torrance tests
in which a poor subject
abilities, scholastic separately measure
performance on a o Examiner
aptitude, oral aspects of creative
specific task is related provides
language, and thinking such as
to or caused by some instructions
achievement fluency, originality,
underlying deficit Subject
- Based on the CHC and flexibility o
- Assumes that brain
three-stratum theory - Does not meet the responds,
damage easily impairs examiner
of intelligence Binet and Wechsler
visual memory ability records
- Compares child’s scales in terms of
- For individuals 8yrs+ response
score on cognitive standardization,
- Consists of geometric o Examiner
ability with sore on reliability, or validity
achievement – can designs briefly evaluates
- Unbiased indicator of
evaluate possible presented and then response
giftedness
learning problems removed o Examiner takes
- Inconsistent tests, but
- Relatively good - Computerized version responsibility
available data reflect
psychometric developed for eliciting a
the tests’ merit and
properties Bender Visual Motor Gestalt maximum
fine potential
- For learning disability Test (BVMGT) performance
Individual Achievement Tests:
- Consists of 9 o Scoring requires
tests, three Wide Range Achievement Test-3
geometric figures that considerable
(WRAT-4)
skill

Downloaded by Connie Soledad ([email protected])


lOMoARcPSD|8301055

- Those who use the results Using Group Tests - These three tests are highly - does NOT consider multiple
of group tests must assume - Reliable and well interrelated intelligences
that the subject was standardized as the best Group Achievement Tests Cognitive Abilities Test (COGAT)
cooperative and motivated individual tests - Stanford Achievement Test - Good reliability
o Many subjects - Validity data for some one of the oldest of the - Provides three separate
tested at a time group tests are standardized achievement scores though: verbal,
o Subjects record weak/meager/contradictor tests widely used in school quantitative, and nonverbal
own responses y system - Item selection is superior
o Subjects not Use Results with Caution - Well-normed and criterion- to the H-NT in terms of
praised for - Never consider scores in referenced, with selecting minority,
responding isolation or as absolutes psychometric culturally diverse, and
o Low scores on - Be careful using tests for documentation economically
group tests prediction - Another one is the disadvantaged children
often difficult to - Avoid overinterpreting test Metropolitan Achievement - Can be adopted for use
interpret scores Test, which measures outside the US
o No safeguards Be Especially Suspicious of Low achievement in reading by - No cultural bias
Advantages of Individual Tests Scores evaluating vocab, word - Each of the subtests
- Provide info beyond the - Assume that subjects recognition, and reading required 32-34 minutes of
test score understand purpose of comprehension actual working time, which
- Allow the examiner to testing, want to succeed, - Both of these are reliable the manual recommends to
observe behavior in a and are equally rested/free and normed on big samples be spread out over 2-3 days
standard setting of stress Group Tests of Mental Abilities - Standard age scores
- Allow individualized Consider Wide Discrepancies a (Intelligence) averaged some 15pts lower
interpretation of test Warning Signal Kuhlmann-Anderson Test (KAT) for African American
scores - May reflect emotional – 8th Edition students on the verbal
Advantages of Group Tests problems or severe stress - KAT is a group intelligence battery and quantitative
- Are cost-efficient When in Doubt, Refer test with 8 separate levels batteries
- Minimize professional time - With low scores, covering kindergarten
for administration and discrepancies, etc, refer the through 12th grade Summary of K-12 Group Tests
scoring subject for individual - Items are primarily - All are sound, viable
- Require less examiner skill testing nonverbal at lower levels, instruments
and training - Get trained professional requiring minimal reading
- Have more objective and Group Tests in the Schools: and language ability College Entrance Tests
more reliable scoring Kindergarten Through 12th - Suited to young children - SAT Reasoning Test,
procedures Grade and those who might be Cooperative School and
- Have especially broad - Purpose of tests is to handicapped in following College Ability Tests, and
application measure educational verbal procedures American College Test
Overview of Group Tests achievement in - Scores can be expressed in SAT Reasoning Test
Characteristics of Group Tests schoolchildren verbal, quantitative, and - Most widely used college
- Characterized as paper- Achievement Tests verses total scores entrance test
and-pencil or booklet-and- Aptitude Tests - Scores at other levels can - Used for 1000+ private and
pencil tests because only - Achievement tests attempt be expressed at percentile public institutions
materials needed are a to assess what a person has bands: like a confidence - Renorming of the SAT did
printed booklet of test learned following a specific interval; provides the range not alter the standing of
items, a test manual, course of instruction of percentiles that most test takers relative to one
scoring key, answer sheet, o Evaluate the likely represent a subject’s another in terms of
and pencil product of a true score percentile rank
- Computerized group course of - Good construction, - New scoring (2400) is likely
testing becoming more training standardization, and other to reduce interpretation
popular o Validity is excellent psychometric errors, as interpreters can
- Most group tests are determined qualities no longer rely on
multiple choice – some free primarily by - Good validity and reliability comparisons with older
response content-related - Potential for use and versions
- Group tests outnumber evidence adaptation for non-English- - 45mins longer – 3hrs and
individual tests - Aptitude tests attempt to speaking individuals or 45mins to administer
o One major evaluate a student’s even countries needs to be - may disadvantage students
difference is potential for learning explored with disabilities such as
whether the rather than how much a Henmon-Nelson Test (H-NT) ADD
test is primarily student has already - Of mental abilities - Verbal section now called
verbal, learned - 2 sets of norms available: “critical reading” – focus on
nonverbal, or o Evaluate effects o one based on reading comprehension
combination of unknown and raw score - Math section eliminated
- Group test scores can be uncontrolled distributions by much of the basic grammar
converted to a variety of experiences age, the other school math questions
units o Validity is on raw scores - Weakness: poor predictive
Selecting Group Tests judged primarily distributions by power regarding the grades
- Test user need never settle on its ability to grade of students who score in
for anything but well- predict future - reliabilities in the .90s the middle ranges
documented and performance - helps predict future - Little doubt that the SAT
psychometrically sound - Intelligence test measures academic success quickly predicts first-year college
tests general ability GPA

Downloaded by Connie Soledad ([email protected])


lOMoARcPSD|8301055

o But, - Normative sample is - LSAT problems require - Used extensively in test


AfricanAmerica relatively small almost no specific batteries
ns and Latinos - Psychometric adequacy is knowledge
tend to obtain less than that of SAT – - Extreme time pressure The Culture Fair Intelligence Test
lower scores on validity and reliability - Three types of problems: - Designed to provide an
average - Predictive validity not great reading comprehension, estimate of intelligence
o Women score - Overpredicts the logical reasoning (~half), relatively free of cultural
lower on SAT achievement of younger and analytical reasoning and language influences
but higher in students while - Weight given to the LSAT - Paper-and-pencil
GPA underpredicting score is openly published procedure that covers
performance of older for each school approved three age groups
Cooperative School and College students by the American Bar - Two parallel forms are
Ability Tests - Many schools have Association available
- Falling out of favor developed their own norms - Entrance into schools - Acceptable measure of
- Developed in 1955, not and psychometric based on weighted sum of fluid intelligence
been updated documentation and can score and GPA
- Purports to measure use the GRE to predict - Psychometrically sound, Standardized Tests Used in the
school-learned abilities as success in their programs reliability coefficients in the US Civil Service System
well as an individual’s - By looking at a GRE score in .90s - General Aptitude Test
potential to undertake conjunction with GPA, - Predicts first-year GPA in Battery (GATB) – reading
additional schooling graduate success can be law school ability test that purportedly
- Psychometric predicted with greater - Content validity is measures aptitude for a
documentation not strong accuracy than without the exceptional variety of occupations
- Little empirical data GRE - Bias for minority group o Makes
support its major - Graduate schools also members, as well as employment
assumption – that previous frequently complain that women decisions in govt
success in acquiring school- grades no longer predict agencies
learned abilities can predict scholastic ability well Nonverbal Group Ability Tests o Attempts to
future success in acquiring because of grade inflation Raven Progressive Matrices measure wide
such abilities – the phenomenon of - RPM one of the best range of
rising average college known and most popular aptitudes from
American College Test grades despite declines in nonverbal group tests general
- Updated in 2005, average SAT scores - Suitable anytime one needs intelligence to
particularly useful for non- o Led to an estimate of an manual
native speakers of English corresponding individual’s general dexterity
- Produces specific content restriction in the intelligence - Controversial because it
scores and a composite range of grades - Groups or individuals, 5yrs- used within-group norming
- Makes use of the Iowa Test - As the validity of grades adults prior to the passage of the
of Educational and letters of rec becomes - Used throughout the Civil Rights Act of 1991
Development Scale more questionable, modern world - Today, any kind of score
- Compares with the SAT in reliance on test scores - Uses matrices – nonverbal; adjustments through
terms of predicting college increases with or without a time limit within-group norming in
GPA alone or in conjunction - Definite overall decline in - Research supports RPM as employment practices is
with high-school GPA verbal scores while a measure of general strictly forbidden by law
- Internal consistency quantitative and analytical intelligence, or Spearman’s
coefficients are not as scores are gradually rising g Standardized Tests in the US
strong in the ACT - Appears to minimize the Military: The Armed Services
Miller Analogies Test effects of language and Vocational Aptitude Battery
Graduate And Professional - Designed to measures culture - ASVAB administered to
School Entrance Tests scholastic aptitudes for - Tends to cut in half the more than 1.3million
Graduate Record Examination graduate studies selection bias that occurs people a year
Aptitude Test - Strictly verbal with the Binet or Wechsler - Designed for students in
- GRE purports to measure - 60 minutes grades 11 and 12 and in
general scholastic ability - knowledge of specific Goodenough-Harris Drawing postsecondary schools
- Most frequently used in content and a wide vocab Test (G-HDT) - Yields scores used in both
conjunction with GPA, are very useful - Nonverbal intelligence test, education and military
letters of rec, and other - most important factors group or individual settings
academic factors appear to be the ability to - Quick, east, and - Results can help identify
- General section with verbal see relationships and a inexpensive students who potentially
and quantitative scores knowledge of the various - Subject instructed to draw qualify for entry into the
- Third section which ways analogies can be a picture of a whole an and military and can
evaluates analytical formed to do the best job possible recommend assignment to
reasoning – now essay - psychometric adequacy is - Details get points various military
format reasonable occupational training
- One can determine mental
- Contains an advanced - does not predict research programs
ages by comparing scores
section that measures ability, creativity, and other - Great psychometric
with those of the
achievement in at least 20 factors important to grad qualities
normative sample
majors school - Reliability coefficients are
- Raw scores can be
- New 130-170 scoring scale excellent
converted to standard
- Standard mean score of The Law School Admission Test - Through computerized
scores with a mean of 100
500, and SD of 100 and SD of 15 format, subjects can be

Downloaded by Connie Soledad ([email protected])


lOMoARcPSD|8301055

tested adaptively, meaning


that the questions given
each person can be based
on his or her unique ability
- This cuts testing time in
half

Downloaded by Connie Soledad ([email protected])

You might also like