0% found this document useful (0 votes)
9 views20 pages

C7 Neuropsychology - Interpreting Psychometric Data

Uploaded by

jeenanava19
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views20 pages

C7 Neuropsychology - Interpreting Psychometric Data

Uploaded by

jeenanava19
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

C7 Neuropsychology – Interpreting Psychometric Data

Main themes in exam questions


● Effects of psychological presentation/condition on test performance
● Example of a client and named condition – outline the neuropsychological assessment
● Means of assessing optimal (pre-morbid) ability
● Methods of assessing concept formation
● Headline question about psychometrics: means of assessing: reliability, validity, pre-morbid
functions, dissimulation, transformation and standardization
● Performance validity – effect of non-cognitive factors on test performance e.g. education,
substances, metabolic states

N.B. Matthew highlighted that his lectures don’t follow the 1 lecture: 1 exam question format – I’ve
picked out the exam themes which I believe are most relevant to the content of this lecture but they will
all probably need a bit more contextual information from the other lectures to answer fully

Main references:
BPS (2009) Assessment of effort in clinical testing of cognitive effort in adults (Dissimulation)

The Flynn Effect – Hernstein & Murray (1994)

Kapur (1994) Coin in the hand test of dissimulation/malingering

Mitrushina et al – stratified norms of ethnic group & gender

Summary of lecture notes


Themes from the lecture put into the exam assessment criteria:

● Theory applied to practice

How indirect measures work – reliance on relation and inference – importance/rationale of considering
reliability, construct validity, performance validity and standardisation when designing a test/choosing
an assessment and interpreting the results.

● Critical evaluation

When using psychometric tests how can we justify a move from performance e.g. number of items
correct on a reading list, to a construct e.g. vocabulary.

Domains such as personality, IQ, visual memory, verbal memory, visuospatial skills etc. are constructs
and not facts.

Factors which may influence/impact upon reliability, construct validity, performance validity and
standardisation when designing a test/choosing an assessment and interpreting the results.
● Professional and ethical considerations

Impact of diagnosis/difficulty with diagnosis particularly given the context of the potential
unreliability/validity of measures – when working with clients who are not native English speakers, who
are clearly distressed, who may be under the influence of medication etc. Wider impact of the diagnosis
on the individual and family – can bring in our own clinical examples

Order of notes in the lecture hand out:

● Direct and Indirect Assessment


● Standardization
● Transformation
● Performance Validity
● Approaches to Interpretation
● Interpreting and reporting data

Order of revision notes and how the lecture was delivered:

● Direct & Indirect assessment


● Reliability
● Validity
● Standardization
● Performance validity
● Transformation
● Approaches to interpretation
● Interpreting and reporting data

● Direct and Indirect Assessment

Direct assessment – “directly accessing what you are interested in looking at”

For example through:

● Observation
● Self-report and self-monitoring
● Role play (e.g. assessing social skills in children)
● Some physiological measures (e.g. blood pressure)

Direct assessment is assumed to be referential and unmediated (although query this for self-report
measures). Performance can be assessed without reference or comparison to anyone else’s behaviour.

Interpretation of this type of data is often based on social normative status e.g. Jane banged her head 9
times in a one-hour period – expectation that Jane would not bang her head at all.

Indirect Assessment – “indirectly accessing what you are interested in looking at through formalised
measures – psychometric testing”

For example through:


● Personality inventories and projective tests
● Mood and symptom inventories
● Cognitive tests

The key assumptions underlying indirect assessment are that it is:

● Relational – you cannot understand one person’s score without an understanding of all persons
scores
● Inferential – the scores point to something which ‘cannot be seen’ (measured directly) e.g. IQ,
personality type

Relation – Reliability and Standardisation very important

Statements can only be made about one person’s performance in relation to the performance of other
people on the same test when using indirect measures. E.g. Nina’s BDI score is higher than average.

● Reliability = Consistency (simplified definition) “if you were to repeat the test given the same
conditions would you achieve the same results”

Reliability is concerned with how people perform in relation to one another – if for example a weighing
scale was set as measuring everyone as 5kg below their true weight this would still be a reliable measure
as everyone would consistently come out as their true weight + 5kg and as such could still be
understood in relation to everyone else.

Reliability is affected by:

● Time – depending on the construct being measured. For example we would expect intelligence
to be consistent over time whereas something like anxiety we would expect to alter in different
scenarios
● Place – the environment can impact on a person’s performance. Good psychometric tests
should set out the environment in which the test should be administered – differences in
temperature of the room, brightness, number of people present may all impact on scoring
● Person – the individual may have reasons for altering their performance on the measure –
perhaps they don’t like the examiner, perhaps they are distractible one day, perhaps they want
to demonstrate decline
● In itself – this is relevant when measuring results for an unchanged person – however when
repeating a measure an individual will always be changed – this is okay in terms of reliability
however as all individuals repeating the measure will be changed in the same way – they will all
have completed the measure before therefore relationally this can still be meaningful

Types of reliability:

● Internal consistency – this is concerned with whether all items within the measure which are
concerned with measuring the same construct. The expectation is that these measures would be
responded to in a similar way as they are all designed to tap in to the same construct. This can
be measured using Split- half reliability – this separates the test into 2 halves and correlates the
items – it is expected there would be a strong correlation between items which measure the
same construct.
● Temporal consistency – consistency of the test over time. Can be measured using test-re-test
reliability (repeating the test at different time points and correlating the results – expect a
strong correlation – expectation when re-testing that everyone may achieve a different score
due to familiarity but so long as everyone moves up or down consistently). Could use equivalent
coefficient e.g. story recall – change the content of the story for the re-test condition – subject
to problems in content sampling due to changes
● Scoring consistency - This refers to the degree to which different raters give consistent
estimates of the same behaviour. Can be measured using Cohen’s kappa. In clinical practice the
Standard Error of Measurement (SEM) is the more useful value. Confidence intervals report two
values between where the ‘true’ score may fall – how confident am I in this score – not how
reliable is the measure.

● Validity

Inference – Construct Validity and Performance Validity are very important

“The behaviour sample collected is indicative of an underlying but unobservable construct” e.g. number
of correct items on a reading test = vocabulary, responses to an ambiguous design (Rorschach) =
personality

Construct validity = ‘meaning’ (although Matthew doesn’t like this word!)

● What are the assumptions of the test/task?


● What constructs are inferred from it?
● What predictions can we make based on these scores?

Validation is the process of examining these questions and supporting them with empirical evidence –
put simply – is the underlying construct which you are trying to measure a true, independent and
measureable phenomena or is it dependent on something else? E.g. a self-report measure of pain – we
know pain exists but cannot be measured directly – however a self-report measure of pain may be
mediated by mood state – is the test measuring ‘pain’ or ‘mood state’?

Measures may be valid but unreliable – for example joke funniness measure, one-shot problem solving
tests

Construct validity can be assessed in the following ways:

● Qualitative credibility – Content validity “resonance with the examiner” (subjective assessment
based on clinical/professional judgement). Face validity – “resonance with the examinee”
(subject assessment based form the examinee – does the test seem to measure what it is
supposed to e.g. word recognition list measuring vocabulary is quite transparent, a Rorschach
test measuring personality is not)
● Criterion prediction – “usefulness” Can the measure predict performance in other tests of the
same construct? – Concurrent validity. Can the test predict performance in real world
functioning? Predictive validity. How well does the test ‘catch’ the people who meet the
criterion and leave out those who do not? Sensitivity & specificity e.g. a test measuring
depression – those diagnosed with clinical depression should obtain higher scores than those
with no formal diagnosis (In theory!)
● Convergent/Divergent correlation – the test should correlate well with tests which measure the
same/similar constructs. The tests should not have a correlation with tests which measure
unrelated constructs. Can be assessed using Factor Analysis (statistical measure) as well as
experimental manipulation e.g. carrying out a memory assessment with a group known to have
memory problems and comparing the performance to a control group without memory
problems.

● Standardization (of administration and population)

The main purpose of standardization is to establish norms of performance against which any individual
subjected to an assessment procedure may be compared.

Norms are:

● Specific to the procedures followed in administering the tests during the development of the
test e.g. in a brightly lit room, controlled temperature with only the examiner present
● Specific to the population sample in which the test was standardised e.g. with individuals ages
18-45, high school level of education, 50/50 male : female ratio – CRITIQUE – individuals with
learning disabilities are excluded from normative sample populations on which IQ is collected
● Not enduring – norms such as IQ improvement in scores over time (Flynn effect) – norms should
be updated but often they are not

Therefore results are only meaningful (valid) if:

● The same administrative procedures are used as in standardisation – good psychometric tests
will have clear guidelines for the expected environment (e.g. desk and chair, no one else present
in the room), instruction giving – usually a script e.g. “Listen carefully, I am going to read out ten
words…”, how to deal with errors e.g. verbal prompts, non-verbal cues which are permitted,
stimulus layout e.g. task A must be presented first followed by task B, scoring criteria e.g. award
one mark for accuracy, award bonus point if completed in less than 10 seconds - CRITIQUE –
these tests require practice to administer in order to ensure validity – in reality a) how much
practice do clinicians have and b) how likely are we to be able to create the same, standardised
environments
● The client being assessed is representative of the standardisation sample. Tests are often
stratified by demographic variables such as:
o Age
o Years of formal education (not many task do this - Tombaugh et al 2004 – found education
norms for Trail Making A & B tests)
o Gender
o Ethnicity
CRITIQUE/PROFESSIONAL AND ETHICAL ISSUE – these have been identified as variables which
may matter in terms of performance. They may not necessarily matter however there may be
other variables which impact but have not been identified. It is a psychological decision to
stratify in this way.
● Typical performance (not the same as performance validity) has not changed since
standardisation e.g. Flynn effect with IQ. Cohort effects – experiential and cultural factors (e.g.
gender gap) and educational opportunities. For example – standardised American tests involving
identification of a ‘fire hydrant’ – may be unfamiliar term to British examinees but could be
mediated by culture – younger British generations more familiar with American culture through
TV, film and music. Tests of orientation asking who the prime minister is – mediated by
prevailing cultural focus on politics – does not mean individual is not orientated if they do not
know the answer. How are hibernation and migration alike? Migration previously most
commonly associated with animal behaviour – now a hot topic on political agenda concerning
migration of people – people less likely to automatically associate migration and hibernation

These three assumptions (same administrative procedures, representative sample and typical
performance) are often violated in clinical practice.

In some instances however standardisation can limit validity:

e.g. when using the Wechsler similarities to assess verbal abstraction (deducing conceptual
relationships) persons who are persuaded by the questions format “how are chair and table alike” to
focus on descriptive similarities will achieve a low score e.g. “they both have four legs” is correct and
descriptive but would not be awarded points as the desired answer is “they are both items of furniture”
– furniture being the desired conceptual category.

This does not mean the examinee has difficulties with verbal abstraction but due to standardisation the
examinee may not instruct or query further.

● Transformation (or standardization of scoring)

Since raw scores are in themselves meaningless, standardization also refers to the chosen means of
comparing an individual’s raw score to an external scale in order to determine their position in relation
to the standardisation sample.

When the raw score is converted into a relative measure this:

o Enables a comparison of the individuals performance with that of others in the standardisation
sample (across age, gender, ethnicity, years of education)
o Enables comparison of an individual’s performance on two assessment instruments (provided
both instruments have been standardised using the same norms)

The simplest approach might be merely comparing a result to the mean e.g. mean = 5 so 9 is ‘above the
mean’

A more useful approach is comparing to the mean and variance e.g. mean = 5, standard deviation (SD) =
2 so 9 is ‘two standard deviations above the mean’

This requires that the mean (measure of central tendency) and the standard deviation (measure of
variation around that centre) are themselves useful parameters. In order for this to be the case they
must be based on a normal distribution (Gaussian, bell curve). That is, standardizing transformations
entail that the raw scores were normally distributed. In most cases, non-normal distributions have been
normalised – ‘forced to fit’ the curve using statistical tailoring.
Image 1. Examples of normal distribution – bell curve

CRITIQUE - The Normal Distribution

There is the implicit assumption that what is not normally distributed:

● Is difficult to study, examine or assess


● Is not worth studying in that it cannot be a natural property of the species
● Cannot be subjected to rigorous statistical analysis and is outwith the scientific arena

This is plainly false as other scientific disciplines do not rely on this assumption.

The assumption of normal distribution in human function is often wrong

e.g. Scoring Distributions – for most cognitive functions (and other physiological properties) there is a
‘hump’ in the low end of the score distribution due to the cohort of people with congenital disorders,
birth problems, acquired impairments etc. Due to improvements in medical care and survival rates this
statistical ‘problem’ is likely to increase.

e.g. Tests of language based on picture naming – in any language there will be a small number of high
frequency words/names that all adults know and a greater number of less frequently used words/names
that any given sample of adults will not know

Linear Transformations

The ‘standard’ score is the z-score. Raw scores are transformed to the z-scores by subtracting the
sample mean, then dividing by the SD. This yields a normal distribution with a mean = 0 and SD = 1.
(Thus some scores will have negative values).

The standard score of a raw score x is


where:

μ is the mean of the population;


σ is the standard deviation of the population

Standarized scores are transformations based on the z-score: a new mean and SD are chosen and
applied; by multiplying by the new SD then adding the new mean. (This negative scores can be avoided).

Standard scores express the individual’s distance from the mean in terms of the standard deviation of
the distribution of test scores in the standardisation sample.

Familiar standardized scores include:

● Wechsler Indices and “IQ”: Mean = 100 SD= 15 (WAIS, WISC, WMS, CMS etc)
● Wechsler Scaled Scores: Mean = 10, SD = 3 (subscales - WAIS, WISC, WMS, CMS etc)
● T-scores: Mean = 50, SD = 10 (indices of WASI and BSI)

Stanine (Standard-nine) – normalized score based on a pre-defined 9-point scale, previously used widely
in achievement tests (education and occupational psychology). Mean = 5, SD = 2

STENs (Standard-ten) – normalized score based on a pre-defined 10-point scale: has no mid-point (as
mid-point is 5.5) and so 5 is just below average and 6 is just above. Rarely used. Mean = 5.5, SD = 2.

Area transformation

The ‘area’ refers to the area ‘under the normal curve’ e.g. at the mean 50% of people are below and
50% above the score. The most familiar are percentile ranks or ranges.

Whilst a percentage could be taken from any part of the sample, a percentile is a place in the order
(from 1 to 100, usually 1 being the lowest score). Scores are expressed not as an amount of the
construct but as a place in the ordinal scale.

Percentile rank: the percentage of people who would have obtained a lower score e.g. at the 40th
percentile; 39% of people would have scored lower and 60% have scored higher. Based on a normal
distribution…

…expressed as Cumulative percentile if based on non-normal distribution (or when range is highly
attenuated) and expresses the percentage of people who would have achieved that score or lower. They
are based on the frequencies of people scoring at a certain level.

Percentile range: captures the upper and lower limits e.g. scoring in the 10-24th percentile range implies
that 75% of people would have scored higher, though 9% scored lower. Conventionally the quartiles (25,
50, 75) and deciles (10, 90) have a special status

Classically the 5% percentile has been used as a cut-off point, scores below which are regarded as
impaired.
Age and grade equivalents: scores based on the means for given ages or school years, previously used
widely in developmental and achievement tests (e.g. WRAT-3)

An individuals score can be compared to those of others ate the same age/year

The score of a low-functioning individual may be (inauthentically) expressed as an age-equivalent e.g.


reading age – it would not be unusual for an adult to have a reading age of nine years and be able to
function

PROFESSIONAL AND ETHICAL ISSUES – the impact on the individual of being told they have
developmental functioning similar to a child – argument as to whether this is a helpful way of expressing
this as an adult with years of life experience will present globally as functioning in a different way to a
child. Consider individuals diagnosed with dementia, brain injury, LD being patronised – functioning
expressed as an age equivalent may be damaging. Counter argument that age-equivalent expression of
function may be easier for staff/families to comprehend and facilitate understanding and manage
expectations

● Performance Validity

Extent to which performance on a test score reflects underpinning construct. Mainly affected by:

● Task comprehension – was the examinee doing what they were supposed to do?
● Task impurity – in order to test visuospatial skills the examinee must also have intact verbal
ability in order to listen to and comprehend the task
● Non – cognitive factors
● Motivation and dissimulation

Task impurity

There are no ‘pure’ neuropsychological measures:

● All rely on multiple intact systems


● It can be difficult to interpret locus of the deficit
● Sensory and motor requirements are significant:
● Most need vision/acuity, hearing and motor function
● This can be a problem for older adults and the neurologically impaired
● No tests have been normed for the hearing and/or visually impaired

Non-cognitive factors (contributing to performance) – Annie covers these in more detail on page 18 of
her notes so I won’t repeat aside from to summarise (this area has also been flagged as a potential exam
topic!):

● Education
● Substances
● Medicines
● Metabolic factors
● Mood and Psychological Disorder
● Cross-culture and cross-language deficits
Motivation and dissimulation – flagged as a potential exam question several times. See the BPS
document on moodle (mentioned at the top of the page under references) for guidance. Again Annie
has already covered this on pages 16 & 17 of her notes so I’ve briefly summarised and added in some
additional bits:

Definition:

● Dissimulation is concealment of one’s thoughts, feelings or character; pretence


● It is intentional falsification or misrepresentation of symptoms by over representation or under
representation with an intention to appear different from the ‘true’ state.

Suspected where:

• Association with reward (e.g. compensational injury claim) or avoidance of penalty (e.g. diminished
responsibility in a criminal case); • presence of antisocial (or other) personality disorder • discrepancy
between complaints and findings; • lack of co-operation with assessment/treatment; • atypical, bizarre
or extreme symptoms • symptom inconsistency • symptom variability • nil or inconsistent response to
treatment

Identifying dissimulation:

‘Embedded’ Tests & Indices:

● ‘Easy tests’ – digit span forwards, memory recognition tests


● Statistical properties of tests – scoring at chance or below chance level – below chance level
indicates deliberate avoidance of correct response

Stand-alone Tests:

Easy tests (that look harder than they are) • e.g., Rey 15 item; Kapur’s “coin in the hand” test – ETHICAL
AND PROFESSIONAL/CRITIQUE – Kapur (1994) cited a case study where a client had performed at
chance level on this test, not because of malingering but possibly related to poor cooperation related to
behavioural problems as a consequence of frontal lobe damage – relates back to ‘task impurity’

(d) Forced-choice recognition of words or pictures • e.g., ToMM, (Green) WMT

• NB: Most of these ‘manufactured’ tests have been investigated only for sensitivity/specificity
(detection of ‘caseness’: The degree to which the accepted standardized diagnostic criteria for a given
condition are applicable to a given patient) and so lack the inferential move necessary for adequate
interpretation. That is, the tests in combination with other methods; clinical judgement, observation etc.
must be considered.

● Approaches to Interpretation

External Comparison

Criterion – compare performance to an accepted ‘basic’ standard of performance, achieved by normal


population. Tests of basic function (e.g. language/aphasia, motor/apraxia) are necessarily criterion tests.
The test aims to establish if the person can do X or Y, not how well. E.g Stycar norms in children, brief or
bedside tests of functioning in adults.
These tests tend to be easier and therefore provide less information. Usually define a ‘cut-off’ point.

● Based on clinical (subjective) impression


● Or below average normative data (e.g. VOSP & BIT – behavioural inattention test used to assess
unilateral visual neglect)

Normative – compare performance to the (assumed normal) population. Norms may be stratified by
age, gender, ethnic group, years of education etc. Raw scores can be converted to scalar. E.g.:

● Age stratified – Wechsler, DKEFS, D&PT)


● Stratified for years of education – verbal fluency
● Stratified for ethnic group, gender – Mitrushina et al
● Not stratified – GNT, Hayling & Brixton, HVOT

Problems: It can be difficult to compare across differently stratified data sets. Norms with
gender/ethnicity strata imply difference

Norms usually convert the raw score to a scalar. These also differ across different tests and batteries.
e.g. standard index scores, mean = 100, SD =15 (WAIS IQ), T-scores, mean = 50, SD =10, (WASI subtests),
Percentiles based on population distribution (D&PT)

Some use very limited scales (BADS, RBMT, Hayling & Brixton) and are more akin to criterion measures.

Internal Comparison

Profile Analysis – analysis of “scatter” of scores within the patient performance (usually, normative
data).

This assumes that in ‘normals’ cognitive functioning across different domains is roughly equivalent; that
there is limited “scatter”

● Compare attention to memory, executive functioning etc.


● Identify strengths and weaknesses – useful for rehabilitation
● Match pattern to known profiles, for neuropsychological diagnosis

This is the rationale behind the “significant difference” approach in WAIS/WISC/WPPSI analyses – it is
important to remember that there is in fact wide scatter in normal performance, and some functions
correlate poorly with each other (e.g. processing speed and verbal comprehension)

Deficit Analysis – compare present performance to pre-morbid or optimal level of ability (usually
normative data)

● Requires estimate of pre-morbid ability


● Compare optimal to current performance
● Gives richer interpretation, as independent of ability level

Main drawbacks are:

● It is difficult to gain an estimate


● Difficult to be confident in any estimate achieved
● Most techniques generate a single “value” rather than separate estimates of, for example verbal
skill, memory function, executive functions, processing speed

Estimating pre-morbid ability (flagged as a possible exam topic, covered in depth on pages 15&16 of
Annie’s notes so I will briefly summarise – Annie’s notes include the CRITICAL EVALUATION/ETHICAL
DILEMMAS)

Assessing the client using the following criteria:

● Over-learned Skills – commonly reading ability


● Best Performance – based on the assumption that individuals perform at a similar level across
all tests – using the ‘best’ score as an estimate of premorbid functioning
● Demographic-based – assumptions about socioeconomic status and years of education in terms
of functioning
● Clinical Judgement – based on the individual’s history and if available, prior assessments

Recommended: use all available methods and then consider their relative merits in the individual client,
and similarity of estimate.

● Interpreting and reporting data

Data may be reported in several ways.

● Test, Battery and Composite scores – ignore or treat with caution (e.g. IQ). It is not helpful to
the reader to report on a “test by test” basis – useful to gather performance into meaningful
groups when reporting.
● Convert scores to percentiles where possible – this is understood by GPs, psychiatrists and
physicians. This can be explained simply in the report and to clients. Some test are criterion or
‘functional’.
● Report percentiles and function ranges – reads better and is understood easily. Avoids over
specificity and arbitrary differences. Report raw data in appendix for record and future.

Suggested layout of reports

1. Background

• Reason for referral


• Relevant history
• Current problems

2. Neuropsychological Assessment

• Orientation and Presentation


• Attention and Executive Functions
• Learning and memory
• Visuo-spatial Functions
• Verbal-Academic Functions
• Further functions examined, Mood

3. Opinion and Recommendations


• Summary (of results)
• Formulation (diagnosis)
• Recommendations – treatment, Rehabilitation, onward referral

4. Raw Scores & Results

• Table of data, for future reference (this has been a matter for debate)

C7 Interpreting Psychometric Data


References:
A. S. Kaufman & E. O. Lichtenberger (1999). Essentials of WAIS-III Assessment. Chichester:
Wiley.
N. Hebben & W. Milberg (2002). Essentials of Neuropsychological Assessment. Chichester:
Wiley.
E. O. Lichtenberger, A. S. Kaufman, Z. C. Lai (2001). Essentials of WMS-III Assessment.
Chichester: Wiley.

Two Forms of Assessment


Direct
1. Methods:
a. Observation, Self-report and self-monitoring, Role-play (e.g., social skills), Some
physiological measures (e.g., BP)
2. Unmediated - not communicated via, or transformed by, an intervening medium or agency
3. Referential - results directly ‘refer to’ the individual, can be interpreted without comparing
to anyone else
e.g Jane banged her head 9 times in a one-hour period; Terry made 16 critical comments to his
wife during the interview; John spent 10% of the lesson paying attention to his teacher

Indirect
1. Methods:
Mood and symptom inventories, Cognitive tests, Personality inventories, Projective tests
2. Relational
a. procedures are used to make statements about a person's performance in relation
to the performance of other people on the same test/procedure
b. Therefore, reliability important
i. consistency of scores/scoring
ii. what people the score is being compared to
3. Inferential
a. the behaviour sample collected is a sign of some underlying but unobservable
theoretical construct
b. Therefore, validity important
i. utility of scores for prediction of construct
ii. whether performance corresponds to construct
c. answers on a self-report questionnaire e.g. depressed mood
d. number of correct items on a reading test e.g. verbal intelligence
e. responses to an ambiguous design e.g. personality
e.g. Nina has a BDI score of 26; Gianni has a reading age of 10; Jo has a Rorschach score of 90PJ;
Nina’s BDI score is higher than average; Gianni’s RA is 2 years below his coevals; Jo’s Rorschach
scores are normal

Reliability
Consistency: same test results for an unchanged person
a. Internal Consistency (in itself)
b. Temporal Consistency (over time)
c. Scoring Consistency (over raters)
d. standard error of measurement (SEM)
e. confidence intervals for a given score (CIs)

Standardization: the derivation of the norms of performance to which the individual is


compared.
a. Administrative: test format that permits consistency
i. standard rubric, stimulus layout, instructions
ii. processes for dealing with errors
iii. recording form, scoring and grading criteria
iv. NB: good tests are supplied with printed or verbatim instructions, a detailed
record form, and strict scoring criteria. They take practice to learn how to
administer.
b. Normative: people represented in the test sample. Who was the test developed for?
Who was used for standardisation and creating norms?
i. Norms may be stratified by demographic variables
ii. age, gender, ethnicity?
iii. years in formal education
iv. NB: Typical performance changes over time due to:
1. Flynn effect – IQ scores improve c15 points per generation
2. Cohort effects: experiential and cultural factors e.g., gender gap
Standardization Consistency. Scores are only meaningful:
a. if the same administrative procedures are used during the test development
b. if the examinee is represented in the test development sample
c. These requirements are often violated

Norms are specific:


a. to the procedures followed in administering the tests during test development
b. instructions and procedures
c. scoring and interpretation
d. to the test development normative sample
i. from what persons has the test data been derived
ii. people change with time/cohorts: so norms are not enduring

Test Validity

Qualitative Credibility
1. Content validity, with test-user
2. Face validity, with test-taker
Criterion validity
1. Predictive: association with real world function;
2. Concurrent: association with performance on another test

Construct validity
1. Correlations matrix
2. Convergent versus divergent correlations
3. Factor analysis
4. Experimental analysis or known-group studies

Performance Validity
1. Extent to which test performance (test score) reflects the underpinning construct
2. Task comprehension
Has client understood the task instructions? Are people doing badly in a test
because they didn’t understand instructions or because they forgot? Could it be
memory or attention difficulties?
iii. cognitive status: attention, learning, memory
iv. cross-language issues
v. cross-cultural issues
vi. Idiom, metaphor & non-specificity in instructions
vii. “work as hard as you can” maybe people from different educational
backgrounds have different standards about this?
viii. “go as quickly as you can without making too many mistakes”
ix. “tell me when you have finished”
3. EXAM - Task impurity – professional/ ethical issues about neuropsychology. Choice of
tool and moral treatment of person and results - competency and ethical risks of getting
wrong diagnosis if we use the wrong tools to assess – consider limitations/ culture etc.
There are no “pure” neuropsychological measures: all rely on multiple intact systems
i. can be difficult to interpret locus of deficit E.g., the verbal comprehension and
spatial construction skills required to perform a test of perception
ii. sensory and motor requirements are significant
1. most need good vision/acuity, hearing, motor function
2. problem for older adults and neurologically impaired
1. no tests normed for the hearing- and/or visually- impaired.
4. Non-cognitive factors affecting validity of tests
a. Age & development
b. Education
c. Substances & Medicinesa
d. Metabolic factors & physical health
e. Mood & Psychological Disorder
f. Language and culture

5. Engagement and dissimulation


b. Motivation to take part in the examination
c. Effort
d. malingering to avoid penalty - compensation.
e. Psychodynamics – conversion
f. Faking
g. Suspected where:
i. association with reward or avoidance of penalty;
ii. presence of antisocial (or other) personality disorder
iii. discrepancy between complaints and findings;
iv. lack of co-operation with assessment/treatment;
v. atypical, bizarre or extreme symptoms
vi. symptom inconsistency
vii. symptom variability
viii. nil or inconsistent response to treatment.
h. Detecting Dissimulation:
i. Embedded Tests & Indices:
1. ‘Easy’ and/or robust tests, low scores are always suggestive
a. Digit Span forward;
b. Recognition Memory: very easy for TDI and mildly impaired
2. Statistical properties of tests
a. Chance or below-chance responding on Forced-choice
Recognition
i. Scores at chance level are same as never having seen
originals
ii. Below-chance scores suggest avoiding giving correct
response
3. Stand-alone Tests:
a. Easy tests (that look harder than they are). e.g., Rey 15 item;
e.g., Kapur’s “coin in the hand” test
b. Forced-choice recognition of words or pictures e.g., ToMM;
e.g., (Green) WMT

Transformation
Raw scores are meaningless:
1. require transformation
2. converted into a relative measure
i. based on a comparison of an individual's performance with that of others in the
standardization sample
ii. simplest might be comparing to the average e.g. mean = 5, so score of 9 is
“above mean”
iii. more useful if compared to average and known range e.g. M=5 and SD=2, so 9 is
“two SDs > mean”
3. Transformations are usually based on the normal distribution
a. Strictly, this entails that the raw scores were normally distributed
b. In some cases, non-normal distributions have been normalized

Normal Distribution
1. Animating principle in psychometrics
2. assumption that what is not normally distributed is difficult to study or not worth
studying, not a ‘natural’ property of the species, is out with the scientific arena
3. With human (cognitive) function, is often wrong
a. scores at the low end of the distribution
b. criterion performances

Z-score Transformations
1. First derive the Standard scores
a. = z score = (Rn-M)/SD
b. normal distribution with a M=0 and SD=1.
c. some scores will have negative values
2. Then derive Standardized scores transformation based on the z-score
a. new Mean and SD are chosen and applied
b. multiply score by the new SD then add new mean.
c. negative scores will be avoided.

Familiar Standardized Score


1. Wechsler Indices and deviation IQs
a. M=100, SD=15
b. e.g., indices of WAIS, WISC, WMS, CMS etc.
2. Wechsler Scaled Scores
a. M=10, SD=3
b. e.g., subtests of WAIS, WISC, WMS, CMS etc.
3. T-scores
a. M=50, SD=10
b. e.g., indices of WASI and BSI

Other Standardized Scores


1. Stanine (standard nine)
a. M = 5, SD = 2, range from 1 to 9
b. subtests of many personality scales
2. Sten (standard ten)
a. M = 5.5, SD = 2, range from 1 to 1
b. subtests of Hayling & Brixton tests
3. Age-equivalents
a. yield the age at which the obtained score is the mean
b. subtests of WIAT

Percentile transformations
1. A percentage can be anywhere
2. A percentile is a place in a line (1-100)
3. Can be based on normal distribution or any non-normal distribution
a. Usually reported as cumulative percentile (non- normal distribution)
4. The percentile rank is the %age of people who would have obtained a lower score.
5. Percentile range captures upper and lower limits
a. e.g., scoring in the 10-24th %ile range implies that 75% of people have scored
higher, 9% of people score lower.
6. Cumulative percentage shows how many people had that score or lower.

Interpretative Approaches
a. External Comparison
i. Normative
1. Compare to the ‘normal population’
2. Norms may be stratified
3. Convert the raw score to a scalar
Problems:
1. Structure of norms and scalars offered differs
2. Difficult to compare differently stratified sets
3. Stratification may imply ‘difference’
ii. Criterion
1. Compare performance to an accepted standard
2. Easier “tests”: therefore provide less information
3. Usually define a “cut-off” point
Problems: Tests of basic language skills, motor skills, balance and apraxia, are
necessarily criterion tests
b. Internal Comparison
i. Profile Analysis
1. Analysis of “scatter” within the patient’s scores
2. Identify strengths and weaknesses
3. Match pattern to “known” profiles for diagnosis
Problems:
1. Assumes typically equivalent functioning across domains
2. But wide variability observed in typical performance
3. Some functions correlate poorly with each other
ii. Deficit Analysis
1. Compare present performance to ‘pre-morbid’
2. Richer interpretation: independent of ability level
Problems:
1. Requires estimate of pre-morbid ability
2. Difficult to gain a reliable estimate
3. Most techniques generate single “value”

Estimating Pre-morbid Ability


i. Over-learned Skills
ii. Best Performance
iii. Demographic-based
iv. Clinical Judgement
v. Recommended: use all available methods and then consider their relative merits in the
individual client, and similarity of estimate.

You might also like