0% found this document useful (0 votes)
161 views

Essentials of A Good Psychological Test

This document discusses key concepts in psychological testing, including reliability, validity, and potential sources of invalidity. It provides definitions and examples of: 1. Reliability as the consistency and repeatability of test scores. There are various types of reliability including test-retest, alternate forms, split-half, and internal consistency. Tests should aim for reliability coefficients of 0.7 or higher. 2. Validity as the extent to which a test measures what it is intended to measure. There are different types of validity including face, construct, criterion, convergent, and discriminant validity. Establishing validity is important for a test to actually reflect the measured construct. 3. Sources of potential invalid

Uploaded by

mguerrero1001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
161 views

Essentials of A Good Psychological Test

This document discusses key concepts in psychological testing, including reliability, validity, and potential sources of invalidity. It provides definitions and examples of: 1. Reliability as the consistency and repeatability of test scores. There are various types of reliability including test-retest, alternate forms, split-half, and internal consistency. Tests should aim for reliability coefficients of 0.7 or higher. 2. Validity as the extent to which a test measures what it is intended to measure. There are different types of validity including face, construct, criterion, convergent, and discriminant validity. Establishing validity is important for a test to actually reflect the measured construct. 3. Sources of potential invalid

Uploaded by

mguerrero1001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Individual Differences

Intelligence
Essentials of a Good Psychological Test
Last updated:
25 Jul 2004
Reliability - overview
Types of reliability
How reliable should tests be?
Validity
Types of validity
Sources of invalidity
Generalizability
Standardization
Recommended Links
Reliability - overview
Reliability is the extent to which a test is repeatable and yields consistent scores.
Note: In order to be valid, a test must be reliable; but reliability does not guarantee validity.
All measurement procedures have the potential for error, so the aim is to minimize it. An
observed test score is made up of the true score plus measurement error.
The goal of estimating reliability (consistency) is to determine how much of the variability in
test scores is due to measurement error and how much is due to variability in true scores.
Measurement errors are essentially random: a persons test score might not reflect the true
score because they were sick, hungover, anxious, in a noisy room, etc.
Reliability can be improved by:
getting repeated measurements using the same test and
getting many different measures using slightly different techniques and methods.
- e.g. Consider university assessment for grades involve several sources. You would not
consider one multiple-choice exam question to be a reliable basis for testing your knowledge
of "individual differences". Many questions are asked in many different formats (e.g., exam,
essay, presentation) to help provide a more reliable score.
Types of reliability
There are several types of reliability:
There are a number of ways to ensure that a test is reliable. Ill mention a few of them now:
1. Test-retest reliability
The test-retest method of estimating a test's reliability involves administering the test to the
same group of people at least twice. Then the first set of scores is correlated with the
second set of scores. Correlations range between 0 (low reliability) and 1 (high reliability)
(highly unlikely they will be negative!)
Remember that change might be due to measurement error e.g if you use a tape measure to
measure a room on two different days, any differences in the result is likely due to
measurement error rather than a change in the room size. However, if you measure childrens
reading ability in February and the again in June the change is likely due to changes in
childrens reading ability. Also the actual experience of taking the test can have an impact
(called reactivity). History quiz - look up answers and do better next time. Also might
remember original answers.
2. Alternate Forms
Administer Test A to a group and then administer Test B to same group. Correlation between
the two scores is the estimate of the test reliability
3. Split Half reliability
Relationship between half the items and the other half.
4. Inter-rater Reliability
Compare scores given by different raters. e.g., for important work in higher education (e.g.,
theses), there are multiple markers to help ensure accurate assessment by checking inter-
rater reliability
5. Internal consistency
Internal consistence is commonly measured as Cronbach's Alpha (based on inter-item
correlations) - between 0 (low) and 1 (high). The greater the number of similar items, the
greater the internal consistency. Thats why you sometimes get very long scales asking a
question a myriad of different ways if you add more items you get a higher cronbachs.
Generally, alpha of .80 is considered as a reasonable benchmark
How reliable should tests be? Some reliability guidelines
.90 = high reliability
.80 = moderate reliability
.70 = low reliability
High reliability is required when (Note: Most standardized tests of intelligence report reliability
estimates around .90 (high).
tests are used to make important decisions
individuals are sorted into many different categories based upon relatively small
individual differences e.g. intelligence
Lower reliability is acceptable when (Note: For most testing applications, reliability estimates
around .70 are usually regarded as low - i.e., 49% consistent variation (.7 to the power of 2).
tests are used for preliminary rather than final decisions
tests are used to sort people into a small number of groups based on gross individual
differences e.g. height or sociability /extraversion
Reliability estimates of .80 or higher are typically regarded as moderate to high (approx. 16%
of the variability in test scores is attributable to error)
Reliability estimates below .60 are usually regarded as unacceptably low.
Levels of reliability typically reported for different types of tests and measurement devices are
reported in Table 7-6: Murphy and Davidshofer (2001, p.142).
Validity
Validity is the extent to which a test measures what it is supposed to measure.
Validity is a subjective judgment made on the basis of experience and empirical indicators.
Validity asks "Is the test measuring what you think its measuring?"
For example, we might define "aggression" as an act intended to cause harm to another
person (a conceptual definition) but the operational definition might be seeing:
how many times a child hits a doll
how often a child pushes to the front of the queue
how many physical scraps he/she gets into in the playground.
Are these valid measures of aggression? i.e., how well does the operational definition match
the conceptual definition?
Remember: In order to be valid, a test must be reliable; but reliability does not guarantee
validity, i.e. it is possible to have a highly reliable test which is meaningless (invalid).
Note that where validity coefficients are calculated, they will range between 0 (low) to 1
(high)
Types of Validity
Face validity
Face validity is the least important aspect of validity, because validity still needs to be
directly checked through other methods. All that face validity means is:
"Does the measure, on the face it, seem to measure what is intended?"
Sometimes researchers try to obscure a measures face validity - say, if its measuring a
socially undesirable characteristic (such as modern racism). But the more practical point is to
be suspicious of any measures that purport to measure one thing, but seem to measure
something different. e.g., political polls - a politician's current popularity is not necessarily a
valid indicator of who is going to win an election.
Construct validity
Construct Validity is the most important kind of validity
If a measure has construct validity it measures what it purports to measure.
Establishing construct validity is a long and complex process.
The various qualities that contribute to construct validity include:
criterion validity (includes predictive and concurrent)
convergent validity
discriminant validity
To create a measure with construct validity, first define the domain of interest (i.e., what is
to be measured), then construct measurement items are designed which adequately measure
that domain. Then a scientific process of rigorously testing and modifying the measure is
undertaken.
Note that in psychological testing there may be a bias towards selecting items which can be
objectively written down, etc. rather than other indicators of the domain of interest (i.e. a
source of invalidity)
Criterion validity
Criterion validity consists of concurrent and predictive validity.
Concurrent validity: "Does the measure relate to other manifestations of the construct
the device is supposed to be measuring?"
Predictive validity: "Does the test predict an individuals performance in specific
abilities?"
Convergent validity
It is important to know whether this tests returns similar results to other tests which purport
to measure the same or related constructs.
Does the measure match with an external 'criterion', e.g. behaviour or another, well-
established, test? Does it measure it concurrently and can it predict this behaviour?
Observations of dominant behaviour (criterion) can be compared with self-report
dominance scores (measure)
Trained interviewer ratings (criterion) can be compared with self-report dominance
scores (measure)
Discriminant validity
Important to show that a measure doesn't measure what it isn't meant to measure - i.e. it
discriminates.
For example, discriminant validity would be evidenced by a low correlation between between a
quantitative reasoning test and scores on a reading comprehension test, since reading ability
is an irrelevant variable in a test designed to measure quantitative reasoning.
Sources of Invalidity
Unreliability
Response sets = psychological orientation or bias towards answering in a particular
way:
Acquiescence: tendency to agree, i.e. say "Yes. Hence use of half -vely and
half +vely worded items (but there can be semantic difficulties with -vely
wording)
Social desirability: tendency to portray self in a positive light. Try to design
questions which so that social desirability isn't salient.
Faking bad: Purposely saying 'no' or looking bad if there's a 'reward' (e.g.
attention, compensation, social welfare, etc.).
Bias
Cultural bias: does the psychological construct have the same meaning from one
culture to another; how are the different items interpreted by people from
different cultures; actual content (face) validity may be different for different
cultures.
Gender bias may also be possible.
Test Bias
Bias in measurement occurs when the test makes systematic errors in
measuring a particular characteristic or attribute e.g. many say that most
IQ tests may well be valid for middle-class whites but not for blacks or
other minorities. In interviews, which are a type of test, research shows
that there is a bias in favour of good-looking applicants.
Bias in prediction occurs when the test makes systematic errors in
predicting some outcome (or criterion). It is often suggested that tests used
in academic admissions and in personnel selection under-predict the
performance of minority applicants Also a test may be useful for predicting
the performance of one group e.g. males but be less accurate in predicting
the performance of females.
Generalizability
Just a brief word on generalizability. Reliability and validity are often discussed separately but
sometimes you will see them both referred to as aspects of generalizability. Often we want to
know whether the results of a measure or a test used with a particular group can be
generalized to other tests or other groups.
So, is the result you get with one test, lets say the WISC III, equivalent to the result you
would get using the Stanford-Binet? Do both these test give a similar IQ score? And do the
results you get from the people you assessed apply to other kinds of people? Are the results
generalizable?
So a test may be reliable and it may be valid but its results may not be generalizable to other
tests measuring the same construct nor to populations other than the one sampled.
Let me give you an example. If I measured the levels of aggression of a very large random
sample of children in primary schools in the ACT, I may use a scale which is perfectly reliable
and a perfectly valid measure of aggression. But would my results be exactly the same had I
used another equally valid and reliable measure of aggression? Probably not, as its difficult to
get a perfect measure of a construct like aggression.
Furthermore, could I then generalize my findings to ALL children in the world, or even in
Australia? No. The demographics of the ACT are quite different from those in Australia and
my sample is only truly representative of the population of primary school children in the
ACT. Could I generalize my findings of levels of aggression for all 5-18 year olds in the ACT?
No. Because Ive only measured primary school children and there levels of aggression are
not necessarily similar to levels of aggression shown by adolescents.
Standardization
Standardization: Standardized tests are:
administered under uniform conditions. i.e. no matter where, when, by whom or to
whom it is given, the test is administered in a similar way.
scored objectively, i.e. the procedures for scoring the test are specified in detail so that
ant number of trained scorers will arrive at the same score for the same set of
responses. So for example, questions that need subjective evaluation (e.g. essay
questions) are generally not included in standardized tests.
designed to measure relative performance. i.e. they are not designed to measure
ABSOLUTE ability on a task. In order to measure relative performance, standardized
tests are interpreted with reference to a comparable group of people, the
standardization, or normative sample. e.g. Highest possible grade in a test is 100. Child
scores 60 on a standardized achievement test. You may feel that the child has not
demonstrated mastery of the material covered in the test (absolute ability) BUT if the
average of the standardization sample was 55 the child has done quite well (RELATIVE
performance).
The normative sample should (for hopefully obvious reasons!) be representative of the target
population - however this is not always the case, thus norms and the structure of the test
would need to interpreted with appropriate caution.
Recommended Links
What are the essentials of a good psychological testing report on an older adult?
(American Psychological Association)
Factors influencing internal and external validity (Campbell & Stanley, 1963)
How to choose tools, instruments, & questionnaires for intervention research &
evaluation (James Neill, 2004)

You might also like