C7 Neuropsychology - Interpreting Psychometric Data
C7 Neuropsychology - Interpreting Psychometric Data
N.B. Matthew highlighted that his lectures don’t follow the 1 lecture: 1 exam question format – I’ve
picked out the exam themes which I believe are most relevant to the content of this lecture but they will
all probably need a bit more contextual information from the other lectures to answer fully
Main references:
BPS (2009) Assessment of effort in clinical testing of cognitive effort in adults (Dissimulation)
How indirect measures work – reliance on relation and inference – importance/rationale of considering
reliability, construct validity, performance validity and standardisation when designing a test/choosing
an assessment and interpreting the results.
● Critical evaluation
When using psychometric tests how can we justify a move from performance e.g. number of items
correct on a reading list, to a construct e.g. vocabulary.
Domains such as personality, IQ, visual memory, verbal memory, visuospatial skills etc. are constructs
and not facts.
Factors which may influence/impact upon reliability, construct validity, performance validity and
standardisation when designing a test/choosing an assessment and interpreting the results.
● Professional and ethical considerations
Impact of diagnosis/difficulty with diagnosis particularly given the context of the potential
unreliability/validity of measures – when working with clients who are not native English speakers, who
are clearly distressed, who may be under the influence of medication etc. Wider impact of the diagnosis
on the individual and family – can bring in our own clinical examples
Direct assessment – “directly accessing what you are interested in looking at”
● Observation
● Self-report and self-monitoring
● Role play (e.g. assessing social skills in children)
● Some physiological measures (e.g. blood pressure)
Direct assessment is assumed to be referential and unmediated (although query this for self-report
measures). Performance can be assessed without reference or comparison to anyone else’s behaviour.
Interpretation of this type of data is often based on social normative status e.g. Jane banged her head 9
times in a one-hour period – expectation that Jane would not bang her head at all.
Indirect Assessment – “indirectly accessing what you are interested in looking at through formalised
measures – psychometric testing”
● Relational – you cannot understand one person’s score without an understanding of all persons
scores
● Inferential – the scores point to something which ‘cannot be seen’ (measured directly) e.g. IQ,
personality type
Statements can only be made about one person’s performance in relation to the performance of other
people on the same test when using indirect measures. E.g. Nina’s BDI score is higher than average.
● Reliability = Consistency (simplified definition) “if you were to repeat the test given the same
conditions would you achieve the same results”
Reliability is concerned with how people perform in relation to one another – if for example a weighing
scale was set as measuring everyone as 5kg below their true weight this would still be a reliable measure
as everyone would consistently come out as their true weight + 5kg and as such could still be
understood in relation to everyone else.
● Time – depending on the construct being measured. For example we would expect intelligence
to be consistent over time whereas something like anxiety we would expect to alter in different
scenarios
● Place – the environment can impact on a person’s performance. Good psychometric tests
should set out the environment in which the test should be administered – differences in
temperature of the room, brightness, number of people present may all impact on scoring
● Person – the individual may have reasons for altering their performance on the measure –
perhaps they don’t like the examiner, perhaps they are distractible one day, perhaps they want
to demonstrate decline
● In itself – this is relevant when measuring results for an unchanged person – however when
repeating a measure an individual will always be changed – this is okay in terms of reliability
however as all individuals repeating the measure will be changed in the same way – they will all
have completed the measure before therefore relationally this can still be meaningful
Types of reliability:
● Internal consistency – this is concerned with whether all items within the measure which are
concerned with measuring the same construct. The expectation is that these measures would be
responded to in a similar way as they are all designed to tap in to the same construct. This can
be measured using Split- half reliability – this separates the test into 2 halves and correlates the
items – it is expected there would be a strong correlation between items which measure the
same construct.
● Temporal consistency – consistency of the test over time. Can be measured using test-re-test
reliability (repeating the test at different time points and correlating the results – expect a
strong correlation – expectation when re-testing that everyone may achieve a different score
due to familiarity but so long as everyone moves up or down consistently). Could use equivalent
coefficient e.g. story recall – change the content of the story for the re-test condition – subject
to problems in content sampling due to changes
● Scoring consistency - This refers to the degree to which different raters give consistent
estimates of the same behaviour. Can be measured using Cohen’s kappa. In clinical practice the
Standard Error of Measurement (SEM) is the more useful value. Confidence intervals report two
values between where the ‘true’ score may fall – how confident am I in this score – not how
reliable is the measure.
● Validity
“The behaviour sample collected is indicative of an underlying but unobservable construct” e.g. number
of correct items on a reading test = vocabulary, responses to an ambiguous design (Rorschach) =
personality
Validation is the process of examining these questions and supporting them with empirical evidence –
put simply – is the underlying construct which you are trying to measure a true, independent and
measureable phenomena or is it dependent on something else? E.g. a self-report measure of pain – we
know pain exists but cannot be measured directly – however a self-report measure of pain may be
mediated by mood state – is the test measuring ‘pain’ or ‘mood state’?
Measures may be valid but unreliable – for example joke funniness measure, one-shot problem solving
tests
● Qualitative credibility – Content validity “resonance with the examiner” (subjective assessment
based on clinical/professional judgement). Face validity – “resonance with the examinee”
(subject assessment based form the examinee – does the test seem to measure what it is
supposed to e.g. word recognition list measuring vocabulary is quite transparent, a Rorschach
test measuring personality is not)
● Criterion prediction – “usefulness” Can the measure predict performance in other tests of the
same construct? – Concurrent validity. Can the test predict performance in real world
functioning? Predictive validity. How well does the test ‘catch’ the people who meet the
criterion and leave out those who do not? Sensitivity & specificity e.g. a test measuring
depression – those diagnosed with clinical depression should obtain higher scores than those
with no formal diagnosis (In theory!)
● Convergent/Divergent correlation – the test should correlate well with tests which measure the
same/similar constructs. The tests should not have a correlation with tests which measure
unrelated constructs. Can be assessed using Factor Analysis (statistical measure) as well as
experimental manipulation e.g. carrying out a memory assessment with a group known to have
memory problems and comparing the performance to a control group without memory
problems.
The main purpose of standardization is to establish norms of performance against which any individual
subjected to an assessment procedure may be compared.
Norms are:
● Specific to the procedures followed in administering the tests during the development of the
test e.g. in a brightly lit room, controlled temperature with only the examiner present
● Specific to the population sample in which the test was standardised e.g. with individuals ages
18-45, high school level of education, 50/50 male : female ratio – CRITIQUE – individuals with
learning disabilities are excluded from normative sample populations on which IQ is collected
● Not enduring – norms such as IQ improvement in scores over time (Flynn effect) – norms should
be updated but often they are not
● The same administrative procedures are used as in standardisation – good psychometric tests
will have clear guidelines for the expected environment (e.g. desk and chair, no one else present
in the room), instruction giving – usually a script e.g. “Listen carefully, I am going to read out ten
words…”, how to deal with errors e.g. verbal prompts, non-verbal cues which are permitted,
stimulus layout e.g. task A must be presented first followed by task B, scoring criteria e.g. award
one mark for accuracy, award bonus point if completed in less than 10 seconds - CRITIQUE –
these tests require practice to administer in order to ensure validity – in reality a) how much
practice do clinicians have and b) how likely are we to be able to create the same, standardised
environments
● The client being assessed is representative of the standardisation sample. Tests are often
stratified by demographic variables such as:
o Age
o Years of formal education (not many task do this - Tombaugh et al 2004 – found education
norms for Trail Making A & B tests)
o Gender
o Ethnicity
CRITIQUE/PROFESSIONAL AND ETHICAL ISSUE – these have been identified as variables which
may matter in terms of performance. They may not necessarily matter however there may be
other variables which impact but have not been identified. It is a psychological decision to
stratify in this way.
● Typical performance (not the same as performance validity) has not changed since
standardisation e.g. Flynn effect with IQ. Cohort effects – experiential and cultural factors (e.g.
gender gap) and educational opportunities. For example – standardised American tests involving
identification of a ‘fire hydrant’ – may be unfamiliar term to British examinees but could be
mediated by culture – younger British generations more familiar with American culture through
TV, film and music. Tests of orientation asking who the prime minister is – mediated by
prevailing cultural focus on politics – does not mean individual is not orientated if they do not
know the answer. How are hibernation and migration alike? Migration previously most
commonly associated with animal behaviour – now a hot topic on political agenda concerning
migration of people – people less likely to automatically associate migration and hibernation
These three assumptions (same administrative procedures, representative sample and typical
performance) are often violated in clinical practice.
e.g. when using the Wechsler similarities to assess verbal abstraction (deducing conceptual
relationships) persons who are persuaded by the questions format “how are chair and table alike” to
focus on descriptive similarities will achieve a low score e.g. “they both have four legs” is correct and
descriptive but would not be awarded points as the desired answer is “they are both items of furniture”
– furniture being the desired conceptual category.
This does not mean the examinee has difficulties with verbal abstraction but due to standardisation the
examinee may not instruct or query further.
Since raw scores are in themselves meaningless, standardization also refers to the chosen means of
comparing an individual’s raw score to an external scale in order to determine their position in relation
to the standardisation sample.
o Enables a comparison of the individuals performance with that of others in the standardisation
sample (across age, gender, ethnicity, years of education)
o Enables comparison of an individual’s performance on two assessment instruments (provided
both instruments have been standardised using the same norms)
The simplest approach might be merely comparing a result to the mean e.g. mean = 5 so 9 is ‘above the
mean’
A more useful approach is comparing to the mean and variance e.g. mean = 5, standard deviation (SD) =
2 so 9 is ‘two standard deviations above the mean’
This requires that the mean (measure of central tendency) and the standard deviation (measure of
variation around that centre) are themselves useful parameters. In order for this to be the case they
must be based on a normal distribution (Gaussian, bell curve). That is, standardizing transformations
entail that the raw scores were normally distributed. In most cases, non-normal distributions have been
normalised – ‘forced to fit’ the curve using statistical tailoring.
Image 1. Examples of normal distribution – bell curve
This is plainly false as other scientific disciplines do not rely on this assumption.
e.g. Scoring Distributions – for most cognitive functions (and other physiological properties) there is a
‘hump’ in the low end of the score distribution due to the cohort of people with congenital disorders,
birth problems, acquired impairments etc. Due to improvements in medical care and survival rates this
statistical ‘problem’ is likely to increase.
e.g. Tests of language based on picture naming – in any language there will be a small number of high
frequency words/names that all adults know and a greater number of less frequently used words/names
that any given sample of adults will not know
Linear Transformations
The ‘standard’ score is the z-score. Raw scores are transformed to the z-scores by subtracting the
sample mean, then dividing by the SD. This yields a normal distribution with a mean = 0 and SD = 1.
(Thus some scores will have negative values).
Standarized scores are transformations based on the z-score: a new mean and SD are chosen and
applied; by multiplying by the new SD then adding the new mean. (This negative scores can be avoided).
Standard scores express the individual’s distance from the mean in terms of the standard deviation of
the distribution of test scores in the standardisation sample.
● Wechsler Indices and “IQ”: Mean = 100 SD= 15 (WAIS, WISC, WMS, CMS etc)
● Wechsler Scaled Scores: Mean = 10, SD = 3 (subscales - WAIS, WISC, WMS, CMS etc)
● T-scores: Mean = 50, SD = 10 (indices of WASI and BSI)
Stanine (Standard-nine) – normalized score based on a pre-defined 9-point scale, previously used widely
in achievement tests (education and occupational psychology). Mean = 5, SD = 2
STENs (Standard-ten) – normalized score based on a pre-defined 10-point scale: has no mid-point (as
mid-point is 5.5) and so 5 is just below average and 6 is just above. Rarely used. Mean = 5.5, SD = 2.
Area transformation
The ‘area’ refers to the area ‘under the normal curve’ e.g. at the mean 50% of people are below and
50% above the score. The most familiar are percentile ranks or ranges.
Whilst a percentage could be taken from any part of the sample, a percentile is a place in the order
(from 1 to 100, usually 1 being the lowest score). Scores are expressed not as an amount of the
construct but as a place in the ordinal scale.
Percentile rank: the percentage of people who would have obtained a lower score e.g. at the 40th
percentile; 39% of people would have scored lower and 60% have scored higher. Based on a normal
distribution…
…expressed as Cumulative percentile if based on non-normal distribution (or when range is highly
attenuated) and expresses the percentage of people who would have achieved that score or lower. They
are based on the frequencies of people scoring at a certain level.
Percentile range: captures the upper and lower limits e.g. scoring in the 10-24th percentile range implies
that 75% of people would have scored higher, though 9% scored lower. Conventionally the quartiles (25,
50, 75) and deciles (10, 90) have a special status
Classically the 5% percentile has been used as a cut-off point, scores below which are regarded as
impaired.
Age and grade equivalents: scores based on the means for given ages or school years, previously used
widely in developmental and achievement tests (e.g. WRAT-3)
An individuals score can be compared to those of others ate the same age/year
PROFESSIONAL AND ETHICAL ISSUES – the impact on the individual of being told they have
developmental functioning similar to a child – argument as to whether this is a helpful way of expressing
this as an adult with years of life experience will present globally as functioning in a different way to a
child. Consider individuals diagnosed with dementia, brain injury, LD being patronised – functioning
expressed as an age equivalent may be damaging. Counter argument that age-equivalent expression of
function may be easier for staff/families to comprehend and facilitate understanding and manage
expectations
● Performance Validity
Extent to which performance on a test score reflects underpinning construct. Mainly affected by:
● Task comprehension – was the examinee doing what they were supposed to do?
● Task impurity – in order to test visuospatial skills the examinee must also have intact verbal
ability in order to listen to and comprehend the task
● Non – cognitive factors
● Motivation and dissimulation
Task impurity
Non-cognitive factors (contributing to performance) – Annie covers these in more detail on page 18 of
her notes so I won’t repeat aside from to summarise (this area has also been flagged as a potential exam
topic!):
● Education
● Substances
● Medicines
● Metabolic factors
● Mood and Psychological Disorder
● Cross-culture and cross-language deficits
Motivation and dissimulation – flagged as a potential exam question several times. See the BPS
document on moodle (mentioned at the top of the page under references) for guidance. Again Annie
has already covered this on pages 16 & 17 of her notes so I’ve briefly summarised and added in some
additional bits:
Definition:
Suspected where:
• Association with reward (e.g. compensational injury claim) or avoidance of penalty (e.g. diminished
responsibility in a criminal case); • presence of antisocial (or other) personality disorder • discrepancy
between complaints and findings; • lack of co-operation with assessment/treatment; • atypical, bizarre
or extreme symptoms • symptom inconsistency • symptom variability • nil or inconsistent response to
treatment
Identifying dissimulation:
Stand-alone Tests:
Easy tests (that look harder than they are) • e.g., Rey 15 item; Kapur’s “coin in the hand” test – ETHICAL
AND PROFESSIONAL/CRITIQUE – Kapur (1994) cited a case study where a client had performed at
chance level on this test, not because of malingering but possibly related to poor cooperation related to
behavioural problems as a consequence of frontal lobe damage – relates back to ‘task impurity’
• NB: Most of these ‘manufactured’ tests have been investigated only for sensitivity/specificity
(detection of ‘caseness’: The degree to which the accepted standardized diagnostic criteria for a given
condition are applicable to a given patient) and so lack the inferential move necessary for adequate
interpretation. That is, the tests in combination with other methods; clinical judgement, observation etc.
must be considered.
● Approaches to Interpretation
External Comparison
Normative – compare performance to the (assumed normal) population. Norms may be stratified by
age, gender, ethnic group, years of education etc. Raw scores can be converted to scalar. E.g.:
Problems: It can be difficult to compare across differently stratified data sets. Norms with
gender/ethnicity strata imply difference
Norms usually convert the raw score to a scalar. These also differ across different tests and batteries.
e.g. standard index scores, mean = 100, SD =15 (WAIS IQ), T-scores, mean = 50, SD =10, (WASI subtests),
Percentiles based on population distribution (D&PT)
Some use very limited scales (BADS, RBMT, Hayling & Brixton) and are more akin to criterion measures.
Internal Comparison
Profile Analysis – analysis of “scatter” of scores within the patient performance (usually, normative
data).
This assumes that in ‘normals’ cognitive functioning across different domains is roughly equivalent; that
there is limited “scatter”
This is the rationale behind the “significant difference” approach in WAIS/WISC/WPPSI analyses – it is
important to remember that there is in fact wide scatter in normal performance, and some functions
correlate poorly with each other (e.g. processing speed and verbal comprehension)
Deficit Analysis – compare present performance to pre-morbid or optimal level of ability (usually
normative data)
Estimating pre-morbid ability (flagged as a possible exam topic, covered in depth on pages 15&16 of
Annie’s notes so I will briefly summarise – Annie’s notes include the CRITICAL EVALUATION/ETHICAL
DILEMMAS)
Recommended: use all available methods and then consider their relative merits in the individual client,
and similarity of estimate.
● Test, Battery and Composite scores – ignore or treat with caution (e.g. IQ). It is not helpful to
the reader to report on a “test by test” basis – useful to gather performance into meaningful
groups when reporting.
● Convert scores to percentiles where possible – this is understood by GPs, psychiatrists and
physicians. This can be explained simply in the report and to clients. Some test are criterion or
‘functional’.
● Report percentiles and function ranges – reads better and is understood easily. Avoids over
specificity and arbitrary differences. Report raw data in appendix for record and future.
1. Background
2. Neuropsychological Assessment
• Table of data, for future reference (this has been a matter for debate)
Indirect
1. Methods:
Mood and symptom inventories, Cognitive tests, Personality inventories, Projective tests
2. Relational
a. procedures are used to make statements about a person's performance in relation
to the performance of other people on the same test/procedure
b. Therefore, reliability important
i. consistency of scores/scoring
ii. what people the score is being compared to
3. Inferential
a. the behaviour sample collected is a sign of some underlying but unobservable
theoretical construct
b. Therefore, validity important
i. utility of scores for prediction of construct
ii. whether performance corresponds to construct
c. answers on a self-report questionnaire e.g. depressed mood
d. number of correct items on a reading test e.g. verbal intelligence
e. responses to an ambiguous design e.g. personality
e.g. Nina has a BDI score of 26; Gianni has a reading age of 10; Jo has a Rorschach score of 90PJ;
Nina’s BDI score is higher than average; Gianni’s RA is 2 years below his coevals; Jo’s Rorschach
scores are normal
Reliability
Consistency: same test results for an unchanged person
a. Internal Consistency (in itself)
b. Temporal Consistency (over time)
c. Scoring Consistency (over raters)
d. standard error of measurement (SEM)
e. confidence intervals for a given score (CIs)
Test Validity
Qualitative Credibility
1. Content validity, with test-user
2. Face validity, with test-taker
Criterion validity
1. Predictive: association with real world function;
2. Concurrent: association with performance on another test
Construct validity
1. Correlations matrix
2. Convergent versus divergent correlations
3. Factor analysis
4. Experimental analysis or known-group studies
Performance Validity
1. Extent to which test performance (test score) reflects the underpinning construct
2. Task comprehension
Has client understood the task instructions? Are people doing badly in a test
because they didn’t understand instructions or because they forgot? Could it be
memory or attention difficulties?
iii. cognitive status: attention, learning, memory
iv. cross-language issues
v. cross-cultural issues
vi. Idiom, metaphor & non-specificity in instructions
vii. “work as hard as you can” maybe people from different educational
backgrounds have different standards about this?
viii. “go as quickly as you can without making too many mistakes”
ix. “tell me when you have finished”
3. EXAM - Task impurity – professional/ ethical issues about neuropsychology. Choice of
tool and moral treatment of person and results - competency and ethical risks of getting
wrong diagnosis if we use the wrong tools to assess – consider limitations/ culture etc.
There are no “pure” neuropsychological measures: all rely on multiple intact systems
i. can be difficult to interpret locus of deficit E.g., the verbal comprehension and
spatial construction skills required to perform a test of perception
ii. sensory and motor requirements are significant
1. most need good vision/acuity, hearing, motor function
2. problem for older adults and neurologically impaired
1. no tests normed for the hearing- and/or visually- impaired.
4. Non-cognitive factors affecting validity of tests
a. Age & development
b. Education
c. Substances & Medicinesa
d. Metabolic factors & physical health
e. Mood & Psychological Disorder
f. Language and culture
Transformation
Raw scores are meaningless:
1. require transformation
2. converted into a relative measure
i. based on a comparison of an individual's performance with that of others in the
standardization sample
ii. simplest might be comparing to the average e.g. mean = 5, so score of 9 is
“above mean”
iii. more useful if compared to average and known range e.g. M=5 and SD=2, so 9 is
“two SDs > mean”
3. Transformations are usually based on the normal distribution
a. Strictly, this entails that the raw scores were normally distributed
b. In some cases, non-normal distributions have been normalized
Normal Distribution
1. Animating principle in psychometrics
2. assumption that what is not normally distributed is difficult to study or not worth
studying, not a ‘natural’ property of the species, is out with the scientific arena
3. With human (cognitive) function, is often wrong
a. scores at the low end of the distribution
b. criterion performances
Z-score Transformations
1. First derive the Standard scores
a. = z score = (Rn-M)/SD
b. normal distribution with a M=0 and SD=1.
c. some scores will have negative values
2. Then derive Standardized scores transformation based on the z-score
a. new Mean and SD are chosen and applied
b. multiply score by the new SD then add new mean.
c. negative scores will be avoided.
Percentile transformations
1. A percentage can be anywhere
2. A percentile is a place in a line (1-100)
3. Can be based on normal distribution or any non-normal distribution
a. Usually reported as cumulative percentile (non- normal distribution)
4. The percentile rank is the %age of people who would have obtained a lower score.
5. Percentile range captures upper and lower limits
a. e.g., scoring in the 10-24th %ile range implies that 75% of people have scored
higher, 9% of people score lower.
6. Cumulative percentage shows how many people had that score or lower.
Interpretative Approaches
a. External Comparison
i. Normative
1. Compare to the ‘normal population’
2. Norms may be stratified
3. Convert the raw score to a scalar
Problems:
1. Structure of norms and scalars offered differs
2. Difficult to compare differently stratified sets
3. Stratification may imply ‘difference’
ii. Criterion
1. Compare performance to an accepted standard
2. Easier “tests”: therefore provide less information
3. Usually define a “cut-off” point
Problems: Tests of basic language skills, motor skills, balance and apraxia, are
necessarily criterion tests
b. Internal Comparison
i. Profile Analysis
1. Analysis of “scatter” within the patient’s scores
2. Identify strengths and weaknesses
3. Match pattern to “known” profiles for diagnosis
Problems:
1. Assumes typically equivalent functioning across domains
2. But wide variability observed in typical performance
3. Some functions correlate poorly with each other
ii. Deficit Analysis
1. Compare present performance to ‘pre-morbid’
2. Richer interpretation: independent of ability level
Problems:
1. Requires estimate of pre-morbid ability
2. Difficult to gain a reliable estimate
3. Most techniques generate single “value”