Chapter6_Validity and reliability
Chapter6_Validity and reliability
There are many different types of validity and reli- participants approached, the extent of triangula-
ability. Threats to validity and reliability can never tion and the disinterestedness or objectivity of the
be erased completely; rather the effects of these researcher (Winter 2000). In quantitative data va-
threats can be attenuated by attention to validity lidity might be improved through careful sampling,
and reliability throughout a piece of research. appropriate instrumentation and appropriate sta-
This chapter discusses validity and reliability in tistical treatments of the data. It is impossible
quantitative and qualitative, naturalistic research. for research to be 100 per cent valid; that is the
It suggests that both of these terms can be optimism of perfection. Quantitative research pos-
applied to these two types of research, though how sesses a measure of standard error which is inbuilt
validity and reliability are addressed in these two and which has to be acknowledged. In qualita-
approaches varies. Finally validity and reliability tive data the subjectivity of respondents, their
are addressed, using different instruments for data opinions, attitudes and perspectives together con-
collection. It is suggested that reliability is a tribute to a degree of bias. Validity, then, should be
necessary but insufficient condition for validity seen as a matter of degree rather than as an absolute
in research; reliability is a necessary precondition state (Gronlund 1981). Hence at best we strive to
of validity, and validity may be a sufficient but minimize invalidity and maximize validity.
not necessary condition for reliability. Brock- There are several different kinds of va-
Utne (1996: 612) contends that the widely lidity (see http://www.routledge.com/textbooks/
held view that reliability is the sole preserve 9780415368780 – Chapter 6, file 6.2. ppt):
of quantitative research has to be exploded,
and this chapter demonstrates the significance O content validity
of her view. O criterion-related validity
O construct validity
O internal validity
Defining validity
O external validity
Validity is an important key to effective re- O concurrent validity
search. If a piece of research is invalid then it O face validity
is worthless. Validity is thus a requirement for O jury validity
both quantitative and qualitative/naturalistic re- O predictive validity
search (see http://www.routledge.com/textbooks/ O consequential validity
9780415368780 – Chapter 6, file 6.1. ppt). O systemic validity
While earlier versions of validity were based on O catalytic validity
the view that it was essentially a demonstration O ecological validity
that a particular instrument in fact measures what O cultural validity
it purports to measure, more recently validity has O descriptive validity
taken many forms. For example, in qualitative data O interpretive validity
validity might be addressed through the honesty, O theoretical validity
depth, richness and scope of the data achieved, the O evaluative validity.
134 VALIDITY AND RELIABILITY
involvement and in-depth responses of individuals Theoretical validity (the theoretical construc-
Chapter 6
O
secure a sufficient level of validity and reliability. tions that the researcher brings to the research,
This claim is contested by Hammersley (1992: including those of the researched): theory here
144) and Silverman (1993: 153), who argue that is regarded as explanation. Theoretical validity
these are insufficient grounds for validity and is the extent to which the research explains
reliability, and that the individuals concerned phenomena; in this respect is it akin to con-
have no privileged position on interpretation. (Of struct validity (discussed below); in theoretical
course, neither are actors ‘cultural dopes’ who validity the constructs are those of all the
need a sociologist or researcher to tell them participants.
what is ‘really’ happening!) Silverman (1993) O Generalizability (the view that the theory
argues that, while immediacy and authenticity generated may be useful in understanding
make for interesting journalism, ethnography other similar situations): generalizing here
must have more rigorous notions of validity refers to generalizing within specific groups
and reliability. This involves moving beyond or communities, situations or circumstances
selecting data simply to fit a preconceived or ideal validly and, beyond, to specific outsider
conception of the phenomenon or because they communities, situations or circumstances
are spectacularly interesting (Fielding and Fielding (external validity); internal validity has greater
1986). Data selected must be representative of the significance here than external validity.
sample, the whole data set, the field, i.e. they O Evaluative validity (the application of an eval-
must address content, construct and concurrent uative, judgemental of that which is being
validity. researched, rather than a descriptive, explana-
Hammersley (1992: 50–1) suggests that validity tory or interpretive framework). Clearly this
in qualitative research replaces certainty with resonates with critical-theoretical perspectives,
confidence in our results, and that, as reality is in- in that the researcher’s own evaluative agenda
dependent of the claims made for it by researchers, might intrude.
our accounts will be only representations of that
reality rather than reproductions of it. Both qualitative and quantitative methods can
Maxwell (1992) argues for five kinds of validity address internal and external validity.
in qualitative methods that explore his notion of
‘understanding’:
Internal validity
O Descriptive validity (the factual accuracy of the Internal validity seeks to demonstrate that the
account, that it is not made up, selective or explanation of a particular event, issue or set
distorted): in this respect validity subsumes of data which a piece of research provides can
reliability; it is akin to Blumenfeld-Jones’s actually be sustained by the data. In some degree
(1995) notion of ‘truth’ in research – what this concerns accuracy, which can be applied to
actually happened (objectively factual). quantitative and qualitative research. The findings
O Interpretive validity (the ability of the research must describe accurately the phenomena being
to catch the meaning, interpretations, terms, researched.
intentions that situations and events, i.e. data, In ethnographic research internal validity can
have for the participants/subjects themselves, be addressed in several ways (LeCompte and
in their terms): it is akin to Blumenfeld-Jones’s Preissle 1993: 338):
(1995) notion of ‘fidelity’ – what it means to
the researched person or group (subjectively O using low-inference descriptors
meaningful); interpretive validity has no O using multiple researchers
clear counterpart in experimental/positivist O using participant researchers
methodologies. O using peer examination of data
136 VALIDITY AND RELIABILITY
O using mechanical means to record, store and O clarity on the kinds of claim made from
retrieve data. the research (e.g. definitional, descriptive,
explanatory, theory generative).
In ethnographic, qualitative research there are
several overriding kinds of internal validity Lincoln and Guba (1985: 219, 301) suggest that
(LeCompte and Preissle 1993: 323–4): credibility in naturalistic inquiry can be addressed
by
O confidence in the data
O the authenticity of the data (the ability of the O Prolonged engagement in the field.
research to report a situation through the eyes O Persistent observation: in order to establish the
of the participants) relevance of the characteristics for the focus.
O the cogency of the data O Triangulation: of methods, sources, investiga-
O the soundness of the research design tors and theories.
O the credibility of the data O Peer debriefing: exposing oneself to a dis-
O the auditability of the data interested peer in a manner akin to
O the dependability of the data cross-examination, in order to test honesty,
O the confirmability of the data. working hypotheses and to identify the next
LeCompte and Preissle (1993) provide greater steps in the research.
detail on the issue of authenticity, arguing for the O Negative case analysis: in order to establish a
following: theory that fits every case, revising hypotheses
retrospectively.
O Fairness: there should be a complete and O Member checking: respondent validation, to
balanced representation of the multiple assess intentionality, to correct factual errors,
realities in, and constructions of, a situation. to offer respondents the opportunity to add
O Ontological authenticity: the research should further information or to put information on
provide a fresh and more sophisticated record; to provide summaries and to check the
understanding of a situation, e.g. making adequacy of the analysis.
the familiar strange, a significant feature in
reducing ‘cultural blindness’ in a researcher, Whereas in positivist research history and
a problem which might be encountered in maturation are viewed as threats to the validity
moving from being a participant to being an of the research, ethnographic research simply
observer (Brock-Utne 1996: 610). assumes that this will happen; ethnographic
O Educative authenticity: the research should research allows for change over time – it builds
generate a new appreciation of these it in. Internal validity in ethnographic research
understandings. is also addressed by the reduction of observer
O Catalytic authenticity: the research gives rise to effects by having the observers sample both widely
specific courses of action. and staying in the situation for such a long
O Tactical authenticity: the research should bring time that their presence is taken for granted.
benefit to all involved – the ethical issue of Further, by tracking and storing information
‘beneficence’. clearly, it is possible for the ethnographer
to eliminate rival explanations of events and
Hammersley (1992: 71) suggests that internal situations.
validity for qualitative data requires attention to
O plausibility and credibility
External validity
O the kinds and amounts of evidence required
(such that the greater the claim that is being External validity refers to the degree to which
made, the more convincing the evidence has the results can be generalized to the wider
to be for that claim) population, cases or situations. The issue of
DEFINING VALIDITY 137
Chapter 6
researchers generalizability is a sine qua non, while argue, are more concerned to derive universal
this is attenuated in naturalistic research. For statements of general social processes rather than
one school of thought, generalizability through to provide accounts of the degree of commonality
stripping out contextual variables is fundamental, between various social settings (e.g. schools and
while, for another, generalizations that say classrooms). Bogdan and Biklen (1992) are more
little about the context have little that is interested not with the issue of whether their
useful to say about human behaviour (Schofield findings are generalizable in the widest sense but
1990). For positivists variables have to be with the question of the settings, people and
isolated and controlled, and samples randomized, situations to which they might be generalizable.
while for ethnographers human behaviour is In naturalistic research threats to external
infinitely complex, irreducible, socially situated validity include (Lincoln and Guba 1985:
and unique. 189, 300):
Generalizability in naturalistic research is
O selection effects: where constructs selected in
interpreted as comparability and transferability
fact are only relevant to a certain group
(Lincoln and Guba 1985; Eisenhart and Howe
O setting effects: where the results are largely a
1992: 647). These writers suggest that it is
function of their context
possible to assess the typicality of a situation – the
O history effects: where the situations have
participants and settings, to identify possible
been arrived at by unique circumstances and,
comparison groups, and to indicate how data might
therefore, are not comparable
translate into different settings and cultures (see
O construct effects: where the constructs being
also LeCompte and Preissle 1993: 348). Schofield
used are peculiar to a certain group.
(1990: 200) suggests that it is important in
qualitative research to provide a clear, detailed
Content validity
and in-depth description so that others can decide
the extent to which findings from one piece of To demonstrate this form of validity the
research are generalizable to another situation, instrument must show that it fairly and
i.e. to address the twin issues of comparability comprehensively covers the domain or items
and translatability. Indeed, qualitative research that it purports to cover. It is unlikely that
can be generalizable (Schofield 1990: 209), by each issue will be able to be addressed in its
studying the typical (for its applicability to other entirety simply because of the time available or
situations – the issue of transferability: LeCompte respondents’ motivation to complete, for example,
and Preissle 1993: 324) and by performing a long questionnaire. If this is the case, then
multi-site studies (e.g. Miles and Huberman the researcher must ensure that the elements of
1984), though it could be argued that this is the main issue to be covered in the research
injecting a degree of positivism into non-positivist are both a fair representation of the wider issue
research. Lincoln and Guba (1985: 316) caution under investigation (and its weighting) and that
the naturalistic researcher against this; they argue the elements chosen for the research sample
that it is not the researcher’s task to provide are themselves addressed in depth and breadth.
an index of transferability; rather, they suggest, Careful sampling of items is required to ensure their
researchers should provide sufficiently rich data representativeness. For example, if the researcher
for the readers and users of research to determine wished to see how well a group of students could
whether transferability is possible. In this respect spell 1,000 words in French but decided to have a
transferability requires thick description. sample of only 50 words for the spelling test, then
Bogdan and Biklen (1992: 45) argue that that test would have to ensure that it represented
generalizability, construed differently from its the range of spellings in the 1,000 words – maybe
usage in positivist methodologies, can be addressed by ensuring that the spelling rules had all been
138 VALIDITY AND RELIABILITY
included or that possible spelling errors had been then stipulate the interpretation that will be
covered in the test in the proportions in which used.
they occurred in the 1,000 words. In qualitative/ethnographic research construct
validity must demonstrate that the categories that
the researchers are using are meaningful to the
participants themselves (Eisenhart and Howe 1992:
Construct validity
648), i.e. that they reflect the way in which
A construct is an abstract; this separates it the participants actually experience and construe
from the previous types of validity which the situations in the research, that they see the
dealt in actualities – defined content. In this situation through the actors’ eyes.
type of validity agreement is sought on the Campbell and Fiske (1959), Brock-Utne (1996)
‘operationalized’ forms of a construct, clarifying and Cooper and Schindler (2001) suggest that
what we mean when we use this construct. construct validity is addressed by convergent and
Hence in this form of validity the articulation discriminant techniques. Convergent techniques
of the construct is important; is the researcher’s imply that different methods for researching the
understanding of this construct similar to that same construct should give a relatively high
which is generally accepted to be the construct? inter-correlation, while discriminant techniques
For example, let us say that the researcher wished suggest that using similar methods for researching
to assess a child’s intelligence (assuming, for the different constructs should yield relatively low
sake of this example, that it is a unitary quality). inter-correlations, i.e. that the construct in
The researcher could say that he or she construed question is different from other potentially similar
intelligence to be demonstrated in the ability to constructs. Such discriminant validity can also
sharpen a pencil. How acceptable a construction of be yielded by factor analysis, which clusters
intelligence is this? Is not intelligence something together similar issues and separates them from
else (e.g. that which is demonstrated by a high others.
result in an intelligence test)?
To establish construct validity the researcher
Ecological validity
would need to be assured that his or her
construction of a particular issue agreed with In quantitative, positivist research variables are
other constructions of the same underlying issue, frequently isolated, controlled and manipulated
e.g. intelligence, creativity, anxiety, motivation. in contrived settings. For qualitative, naturalistic
This can be achieved through correlations with research a fundamental premise is that the
other measures of the issue or by rooting the researcher deliberately does not try to manipulate
researcher’s construction in a wide literature variables or conditions, that the situations in the
search which teases out the meaning of a particular research occur naturally. The intention here is
construct (i.e. a theory of what that construct to give accurate portrayals of the realities of
is) and its constituent elements. Demonstrating social situations in their own terms, in their
construct validity means not only confirming the natural or conventional settings. In education,
construction with that given in relevant literature, ecological validity is particularly important and
but also looking for counter-examples which might useful in charting how policies are actually
falsify the researcher’s construction. When the happening ‘at the chalk face’ (Brock-Utne 1996:
confirming and refuting evidence is balanced, the 617). For ecological validity to be demonstrated
researcher is in a position to demonstrate construct it is important to include and address in the
validity, and can stipulate what he or she takes research as many characteristics in, and factors
this construct to be. In the case of conflicting of, a given situation as possible. The difficulty
interpretations of a construct, the researcher for this is that the more characteristics are
might have to acknowledge that conflict and included and described, the more difficult it
DEFINING VALIDITY 139
is to abide by central ethical tenets of much How do researchers in the target culture deal
Chapter 6
O
research – non-traceability, anonymity and non- with the issues related to the research question
identifiability. (including their method and findings)?
O Are appropriate gatekeepers and informants
chosen?
Cultural validity O Are the research design and research
A type of validity related to ecological validity instruments ethical and appropriate according
is cultural validity (Morgan 1999). This is to the standards of the target culture?
particularly an issue in cross-cultural, intercultural O How do members of the target culture define
and comparative kinds of research, where the the salient terms of the research?
intention is to shape research so that it is O Are documents and other information trans-
appropriate to the culture of the researched, lated in a culturally appropriate way?
and where the researcher and the researched are O Are the possible results of the research of
members of different cultures. Cultural validity potential value and benefit to the target
is defined as ‘the degree to which a study is culture?
appropriate to the cultural setting where research O Does interpretation of the results include the
is to be carried out’ (Joy 2003: 1). Cultural opinions and views of members of the target
validity, Morgan (1999) suggests, applies at all culture?
stages of the research, and affects its planning, O Are the results made available to members of
implementation and dissemination. It involves a the target culture for review and comment?
degree of sensitivity to the participants, cultures O Does the researcher accurately and fairly
and circumstances being studied. Morgan (2005) communicate the results in their cultural
writes that context to people who are not members of
the target culture?
cultural validity entails an appreciation of the
cultural values of those being researched. This
could include: understanding possibly different Catalytic validity
target culture attitudes to research; identifying Catalytic validity embraces the paradigm of critical
and understanding salient terms as used in the theory discussed in Chapter 1. Put neutrally,
target culture; reviewing appropriate target language catalytic validity simply strives to ensure that
literature; choosing research instruments that are research leads to action. However, the story does
acceptable to the target participants; checking not end there, for discussions of catalytic validity
interpretations and translations of data with native are substantive; like critical theory, catalytic
speakers; and being aware of one’s own cultural filters validity suggests an agenda. Lather (1986, 1991)
as a researcher. and Kincheloe and McLaren (1994) suggest that
(Morgan 2005: 1) the agenda for catalytic validity is to help
Joy (2003: 1) presents twelve important ques- participants to understand their worlds in order
tions that researchers in different cultural contexts to transform them. The agenda is explicitly
may face, to ensure that research is culture-fair and political, for catalytic validity suggests the need
culturally sensitive: to expose whose definitions of the situation are
operating in the situation. Lincoln and Guba
O Is the research question understandable and of (1986) suggest that the criterion of ‘fairness’ should
importance to the target group? be applied to research, meaning that it should
O Is the researcher the appropriate person to not only augment and improve the participants’
conduct the research? experience of the world, but also improve the
O Are the sources of the theories that the research empowerment of the participants. In this respect
is based on appropriate for the target culture? the research might focus on what might be (the
140 VALIDITY AND RELIABILITY
leading edge of innovations and future trends) the research and the action-related consequences
and what could be (the ideal, possible futures) of the research are both legitimate and fulfilled.
(Schofield 1990: 209). Clearly, once the research is in the public domain,
Catalytic validity – a major feature in femi- the researcher has little or no control over the
nist research which, Usher (1996) suggests, needs way in which it is used. However, and this is
to permeate all research – requires solidarity in often a political matter, research should not be
the participants, an ability of the research to used in ways in which it was not intended to be
promote emancipation, autonomy and freedom used, for example by exceeding the capability of
within a just, egalitarian and democratic soci- the research data to make claims, by acting on
ety (Masschelein 1991), to reveal the distortions, the research in ways that the research does not
ideological deformations and limitations that re- support (e.g. by using the research for illegitimate
side in research, communication and social struc- epistemic support), by making illegitimate claims
tures (see also LeCompte and Preissle 1993). by using the research in unacceptable ways (e.g.
Validity, it is argued (Mishler 1990; Scheurich by selection, distortion) and by not acting on the
1996), is no longer an ahistorical given, but research in ways that were agreed, i.e. errors of
contestable, suggesting that the definitions of omission and commission.
valid research reside in the academic commu- A clear example of consequential validity is
nities of the powerful. Lather (1986) calls for formative assessment. This is concerned with the
research to be emancipatory and to empower extent to which students improve as a result
those who are being researched, suggesting that of feedback given, hence if there is insufficient
catalytic validity, akin to Freire’s (1970) notion feedback for students to improve, or if students are
of ‘conscientization’, should empower partici- unable to improve as a result of – a consequence
pants to understand and transform their oppressed of – the feedback, then the formative assessment
situation. has little consequential validity.
Validity, it is proposed (Scheurich 1996), is but
a mask that in fact polices and sets boundaries
Criterion-related validity
to what is considered to be acceptable research
by powerful research communities; discourses of This form of validity endeavours to relate the
validity in reality are discourses of power to define results of one particular instrument to another
worthwhile knowledge. external criterion. Within this type of validity
How defensible it is to suggest that researchers there are two principal forms: predictive validity
should have such ideological intents is, perhaps, and concurrent validity.
a moot point, though not to address this area is Predictive validity is achieved if the data acquired
to perpetuate inequality by omission and neglect. at the first round of research correlate highly
Catalytic validity reasserts the centrality of ethics with data acquired at a future date. For example,
in the research process, for it requires researchers to if the results of examinations taken by 16
interrogate their allegiances, responsibilities and year olds correlate highly with the examination
self-interestedness (Burgess 1989). results gained by the same students when aged
18, then we might wish to say that the
first examination demonstrated strong predictive
Consequential validity
validity.
Partially related to catalytic validity is consequen- A variation on this theme is encountered in
tial validity, which argues that the ways in which the notion of concurrent validity. To demonstrate
research data are used (the consequences of the this form of validity the data gathered from
research) are in keeping with the capability or using one instrument must correlate highly with
intentions of the research, i.e. the consequences data gathered from using another instrument. For
of the research do not exceed the capability of example, suppose it was decided to research a
TRIANGULATION 141
student’s problem-solving ability. The researcher use of both quantitative and qualitative data.
Chapter 6
might observe the student working on a problem, Triangulation is a powerful way of demonstrating
or might talk to the student about how she is concurrent validity, particularly in qualitative
tackling the problem, or might ask the student research (Campbell and Fiske 1959).
to write down how she tackled the problem. The advantages of the multi-method approach
Here the researcher has three different data- in social research are manifold and we examine
collecting instruments – observation, interview two of them. First, whereas the single observation
and documentation respectively. If the results in fields such as medicine, chemistry and
all agreed – concurred – that, according to given physics normally yields sufficient and unambiguous
criteria for problem-solving ability, the student information on selected phenomena, it provides
demonstrated a good ability to solve a problem, only a limited view of the complexity of human
then the researcher would be able to say with behaviour and of situations in which human
greater confidence (validity) that the student was beings interact. It has been observed that as
good at problem-solving than if the researcher had research methods act as filters through which
arrived at that judgement simply from using one the environment is selectively experienced, they
instrument. are never atheoretical or neutral in representing
Concurrent validity is very similar to its the world of experience (Smith 1975). Exclusive
partner – predictive validity – in its core concept reliance on one method, therefore, may bias or
(i.e. agreement with a second measure); what distort the researcher’s picture of the particular
differentiates concurrent and predictive validity slice of reality being investigated. The researcher
is the absence of a time element in the former; needs to be confident that the data generated
concurrence can be demonstrated simultaneously are not simply artefacts of one specific method
with another instrument. of collection (Lin 1976). Such confidence can
An important partner to concurrent validity, be achieved, as far as nomothetic research
which is also a bridge into later discussions of is concerned, when different methods of data
reliability, is triangulation. collection yield substantially the same results.
(Where triangulation is used in interpretive
research to investigate different actors’ viewpoints,
Triangulation the same method, e.g. accounts, will naturally
Triangulation may be defined as the use of two produce different sets of data.)
or more methods of data collection in the study Further, the more the methods contrast with
of some aspect of human behaviour. The use of each other, the greater the researcher’s confidence.
multiple methods, or the multi-method approach If, for example, the outcomes of a questionnaire
as it is sometimes called, contrasts with the survey correspond to those of an observational
ubiquitous but generally more vulnerable single- study of the same phenomena, the more the
method approach that characterizes so much of researcher will be confident about the findings.
research in the social sciences. In its original Or, more extreme, where the results of a rigorous
and literal sense, triangulation is a technique experimental investigation are replicated in,
of physical measurement: maritime navigators, say, a role-playing exercise, the researcher will
military strategists and surveyors, for example, experience even greater assurance. If findings are
use (or used to use) several locational markers artefacts of method, then the use of contrasting
in their endeavours to pinpoint a single spot methods considerably reduces the chances of
or objective. By analogy, triangular techniques any consistent findings being attributable to
in the social sciences attempt to map out, or similarities of method (Lin 1976).
explain more fully, the richness and complexity We come now to a second advantage: some
of human behaviour by studying it from more theorists have been sharply critical of the limited
than one standpoint and, in so doing, by making use to which existing methods of inquiry in the
142 VALIDITY AND RELIABILITY
social sciences have been put. One writer, for the same country or within the same subculture
example, comments: by making use of cross-cultural techniques.
O Combined levels of triangulation: this type uses
Much research has employed particular methods or
more than one level of analysis from the three
techniques out of methodological parochialism or
principal levels used in the social sciences,
ethnocentrism. Methodologists often push particular
namely, the individual level, the interactive
pet methods either because those are the only ones
level (groups), and the level of collectivities
they have familiarity with, or because they believe
(organizational, cultural or societal).
their method is superior to all others.
O Theoretical triangulation: this type draws upon
(Smith 1975)
alternative or competing theories in preference
The use of triangular techniques, it is argued, to utilizing one viewpoint only.
will help to overcome the problem of ‘method- O Investigator triangulation: this type engages
boundedness’, as it has been termed; in- more than one observer, data are discovered
deed Gorard and Taylor (2004) demonstrate the independently by more than one observer
value of combining qualitative and quantitative (Silverman 1993: 99).
methods. O Methodological triangulation: this type uses
In its use of multiple methods, triangulation may either the same method on different occasions,
utilize either normative or interpretive techniques; or different methods on the same object of
or it may draw on methods from both these study.
approaches and use them in combination.
Referring us back to naturalistic inquiry, Lincoln Many studies in the social sciences are
and Guba (1985: 315) suggest that triangulation is conducted at one point only in time, thereby
intended as a check on data, while member check- ignoring the effects of social change and
ing, and elements of credibility, are to be used as a process. Time triangulation goes some way to
check on members’ constructions of data. rectifying these omissions by making use of cross-
sectional and longitudinal approaches. Cross-
sectional studies collect data at one point in
Types of triangulation and their time; longitudinal studies collect data from the
characteristics same group at different points in the time
We have just seen how triangulation is sequence. The use of panel studies and trend
characterized by a multi-method approach to studies may also be mentioned in this connection.
a problem in contrast to a single-method The former compare the same measurements
approach. Denzin (1970b) has, however, extended for the same individuals in a sample at several
this view of triangulation to take in several other different points in time; and the latter examine
types as well as the multi-method kind which he selected processes continually over time. The
terms ‘methodological triangulation’: weaknesses of each of these methods can be
strengthened by using a combined approach to
O Time triangulation: this type attempts to take a given problem.
into consideration the factors of change Space triangulation attempts to overcome the
and process by utilizing cross-sectional and limitations of studies conducted within one
longitudinal designs. Kirk and Miller (1986) culture or subculture. As Smith (1975) says,
suggest that diachronic reliability seeks stability ‘Not only are the behavioural sciences culture-
of observations over time, while synchronic bound, they are sub-culture-bound. Yet many
reliability seeks similarity of data gathered in such scholarly works are written as if basic
the same time. principles have been discovered which would
O Space triangulation: this type attempts to over- hold true as tendencies in any society, anywhere,
come the parochialism of studies conducted in anytime’. Cross-cultural studies may involve the
TRIANGULATION 143
testing of theories among different people, as in a research setting. Observers and participants
Chapter 6
Piagetian and Freudian psychology; or they may working on their own each have their own
measure differences between populations by using observational styles and this is reflected in the
several different measuring instruments. We have resulting data. The careful use of two or more
addressed cultural validity earlier. observers or participants independently, therefore,
Social scientists are concerned in their research can lead to more valid and reliable data (Smith
with the individual, the group and society. These 1975), checking divergences between researchers
reflect the three levels of analysis adopted by leading to minimal divergence, i.e. reliability.
researchers in their work. Those who are critical In this respect the notion of triangulation
of much present-day research argue that some bridges issues of reliability and validity. We have
of it uses the wrong level of analysis, individual already considered methodological triangulation
when it should be societal, for instance, or limits earlier. Denzin (1970b) identifies two categories
itself to one level only when a more meaningful in his typology: ‘within methods’ triangulation and
picture would emerge by using more than one ‘between methods’ triangulation. Triangulation
level. Smith (1975) extends this analysis and within methods concerns the replication of a study
identifies seven possible levels: the aggregative as a check on reliability and theory confirmation.
or individual level, and six levels that are more Triangulation between methods involves the use
global in that ‘they characterize the collective as a of more than one method in the pursuit of a
whole, and do not derive from an accumulation of given objective. As a check on validity, the
individual characteristics’ (Smith 1975). The six between methods approach embraces the notion
levels include: of convergence between independent measures of
the same objective (Campbell and Fiske 1959).
O group analysis: the interaction patterns of
Of the six categories of triangulation in Denzin’s
individuals and groups
typology, four are frequently used in education.
O organizational units of analysis: units which
These are: time triangulation with its longitudinal
have qualities not possessed by the individuals
and cross-sectional studies; space triangulation as
making them up
on the occasions when a number of schools in
O institutional analysis: relationships within and
an area or across the country are investigated in
across the legal, political, economic and
some way; investigator triangulation as when two
familial institutions of society
observers independently rate the same classroom
O ecological analysis: concerned with spatial
phenomena; and methodological triangulation. Of
explanation
these four, methodological triangulation is the
O cultural analysis: concerned with the norms,
one used most frequently and the one that possibly
values, practices, traditions and ideologies of a
has the most to offer.
culture
Triangular techniques are suitable when a more
O societal analysis: concerned with gross factors
holistic view of educational outcomes is sought
such as urbanization, industrialization, educa-
(e.g. Mortimore et al.’s (1988) search for school
tion, wealth, etc.
effectiveness), or where a complex phenomenon
Where possible, studies combining several levels requires elucidation. Triangulation is useful when
of analysis are to be preferred. Researchers are an established approach yields a limited and
sometimes taken to task for their rigid adherence frequently distorted picture. Finally, triangulation
to one particular theory or theoretical orientation can be a useful technique where a researcher is
to the exclusion of competing theories. Indeed engaged in a case study, a particular example of
Smith (1975) recommends the use of research to complex phenomena (Adelman et al. 1980).
test competing theories. Triangulation is not without its critics. For
Investigator triangulation refers to the use example, Silverman (1985) suggests that the very
of more than one observer (or participant) in notion of triangulation is positivistic, and that
144 VALIDITY AND RELIABILITY
this is exposed most clearly in data triangulation, O selecting appropriate instrumentation for
as it is presumed that a multiple data source gathering the type of data required
(concurrent validity) is superior to a single data O using an appropriate sample (e.g. one which is
source or instrument. The assumption that a single representative, not too small or too large)
unit can always be measured more than once O demonstrating internal, external, content,
violates the interactionist principles of emergence, concurrent and construct validity and ‘oper-
fluidity, uniqueness and specificity (Denzin 1997: ationalizing’ the constructs fairly
320). Further, Patton (1980) suggests that even O ensuring reliability in terms of stability
having multiple data sources, particularly of (consistency, equivalence, split-half analysis
qualitative data, does not ensure consistency or of test material)
replication. Fielding and Fielding (1986) hold that O selecting appropriate foci to answer the
methodological triangulation does not necessarily research questions
increase validity, reduce bias or bring objectivity O devising and using appropriate instruments:
to research. for example, to catch accurate, representative,
With regard to investigator triangula- relevant and comprehensive data (King et al.
tion, Lincoln and Guba (1985: 307) contend that 1987); ensuring that readability levels are
it is erroneous to assume that one investigator will appropriate; avoiding any ambiguity of
corroborate another, nor is this defensible, particu- instructions, terms and questions; using
larly in qualitative, reflexive inquiry. They extend instruments that will catch the complexity
their concern to include theory and methodolog- of issues; avoiding leading questions; ensuring
ical triangulation, arguing that the search for that the level of test is appropriate – e.g.
theory and methodological triangulation is episte- neither too easy nor too difficult; avoiding
mologically incoherent and empirically empty (see test items with little discriminability; avoiding
also Patton 1980). No two theories, it is argued, making the instruments too short or too long;
will ever yield a sufficiently complete explanation avoiding too many or too few items for each
of the phenomenon being researched. These criti- issue
cisms are trenchant, but they have been answered O avoiding a biased choice of researcher or
equally trenchantly by Denzin (1997). research team (e.g. insiders or outsiders as
researchers).
Ensuring validity There are several areas where invalidity or bias
It is very easy to slip into invalidity; it is both might creep into the research at the stage of data
insidious and pernicious as it can enter at every gathering; these can be minimized by:
stage of a piece of research. The attempt to build O reducing the Hawthorne effect (see the
out invalidity is essential if the researcher is to accompanying web site: http://www.routledge.
be able to have confidence in the elements of the com/textbooks/9780415368780 – Chapter 6,
research plan, data acquisition, data processing file 6.1.doc)
analysis, interpretation and its ensuing judge- O minimizing reactivity effects: respondents
ment (see http://www.routledge.com/textbooks/ behaving differently when subjected to scrutiny
9780415368780 – Chapter 6, file 6.3. ppt). or being placed in new situations, for example
At the design stage, threats to validity can be the interview situation – we distort people’s
minimized by: lives in the way we go about studying them
O choosing an appropriate time scale (Lave and Kvale 1995: 226)
O ensuring that there are adequate resources for O trying to avoid dropout rates among respon-
the required research to be undertaken dents
O selecting an appropriate methodology for O taking steps to avoid non-return of question-
answering the research questions naires
ENSURING VALIDITY 145
avoiding having too long or too short an avoiding unfair aggregation of data (particu-
Chapter 6
O O
O presenting the data without misrepresenting its would mean that if a test and then a retest were
message undertaken within an appropriate time span, then
O making claims which are sustainable by the similar results would be obtained. The researcher
data has to decide what an appropriate length of time is;
O avoiding inaccurate or wrong reporting of data too short a time and respondents may remember
(i.e. technical errors or orthographic errors) what they said or did in the first test situation,
O ensuring that the research questions are too long a time and there may be extraneous
answered; releasing research results neither too effects operating to distort the data (for example,
soon nor too late. maturation in students, outside influences on the
students). A researcher seeking to demonstrate
Having identified where invalidity lurks, the
this type of reliability will have to choose an
researcher can take steps to ensure that, as far
appropriate time scale between the test and retest.
as possible, invalidity has been minimized in all
Correlation coefficients can be calculated for the
areas of the research.
reliability of pretests and post-tests, using formulae
which are readily available in books on statistics
Reliability in quantitative research and test construction.
In addition to stability over time, reliability as
The meaning of reliability differs in quan- stability can also be stability over a similar sample.
titative and qualitative research (see http:// For example, we would assume that if we were
www.routledge.com/textbooks/9780415368780 – to administer a test or a questionnaire simulta-
Chapter 6, file 6.4 ppt). We explore these concepts neously to two groups of students who were very
separately in the next two sections. Reliability in closely matched on significant characteristics (e.g.
quantitative research is essentially a synonym for age, gender, ability etc. – whatever characteristics
dependability, consistency and replicability over are deemed to have a significant bearing, on the
time, over instruments and over groups of respon- responses), then similar results (on a test) or re-
dents. It is concerned with precision and accuracy; sponses (to a questionnaire) would be obtained.
some features, e.g. height, can be measured pre- The correlation coefficient on this form of the
cisely, while others, e.g. musical ability, cannot. test/retest method can be calculated either for the
For research to be reliable it must demonstrate whole test (e.g. by using the Pearson statistic or a
that if it were to be carried out on a sim- t-test) or for sections of the questionnaire (e.g. by
ilar group of respondents in a similar context using the Spearman or Pearson statistic as appro-
(however defined), then similar results would be priate or a t-test). The statistical significance of
found. There are three principal types of relia- the correlation coefficient can be found and should
bility: stability, equivalence and internal consis- be 0.05 or higher if reliability is to be guaranteed.
tency (see http://www.routledge.com/textbooks/ This form of reliability over a sample is particularly
9780415368780 – Chapter 6, file 6.5. ppt). useful in piloting tests and questionnaires.
In using the test-retest method, care has to be
taken to ensure (Cooper and Schindler 2001: 216)
Reliability as stability
the following:
In this form reliability is a measure of consistency
over time and over similar samples. A reliable O The time period between the test and retest is
instrument for a piece of research will yield not so long that situational factors may change.
similar data from similar respondents over time. O The time period between the test and retest is
A leaking tap which each day leaks one litre is not so short that the participants will remember
leaking reliably whereas a tap which leaks one litre the first test.
some days and two litres on others is not. In the O The participants may have become interested
experimental and survey models of research this in the field and may have followed it up
RELIABILITY IN QUANTITATIVE RESEARCH 147
themselves between the test and the retest observational data, and his method can be used
Chapter 6
times. with other types of data.
Within this type of reliability there are two main Whereas the test/retest method and the equivalent
sorts. Reliability may be achieved first through forms method of demonstrating reliability require
using equivalent forms (also known as alternative the tests or instruments to be done twice,
forms) of a test or data-gathering instrument. demonstrating internal consistency demands that
If an equivalent form of the test or instrument the instrument or tests be run once only through
is devised and yields similar results, then the the split-half method.
instrument can be said to demonstrate this form of Let us imagine that a test is to be administered to
reliability. For example, the pretest and post-test a group of students. Here the test items are divided
in an experiment are predicated on this type of into two halves, ensuring that each half is matched
reliability, being alternate forms of instrument to in terms of item difficulty and content. Each half
measure the same issues. This type of reliability is marked separately. If the test is to demonstrate
might also be demonstrated if the equivalent forms split-half reliability, then the marks obtained on
of a test or other instrument yield consistent results each half should be correlated highly with the
if applied simultaneously to matched samples (e.g. other. Any student’s marks on the one half should
a control and experimental group or two random match his or her marks on the other half. This can
stratified samples in a survey). Here reliability be calculated using the Spearman-Brown formula:
can be measured through a t-test, through the 2r
Reliability =
demonstration of a high correlation coefficient 1+r
and through the demonstration of similar means where r = the actual correlation between the
and standard deviations between two groups. halves of the instrument (see http://www.
Second, reliability as equivalence may be routledge.com/textbooks/9780415368780 –
achieved through inter-rater reliability. If more Chapter 6, file 6.6. ppt).
than one researcher is taking part in a piece of This calculation requires a correlation coeffi-
research then, human judgement being fallible, cient to be calculated, e.g. a Spearman rank order
agreement between all researchers must be correlation or a Pearson product moment correla-
achieved, through ensuring that each researcher tion.
enters data in the same way. This would be Let us say that using the Spearman-Brown
particularly pertinent to a team of researchers formula, the correlation coefficient is 0.85; in this
gathering structured observational or semi- case the formula for reliability is set out thus:
structured interview data where each member of
2 × 0.85 1.70
the team would have to agree on which data would Reliability = = = 0.919
1 + 0.85 1.85
be entered in which categories. For observational
data, reliability is addressed in the training sessions Given that the maximum value of the coefficient
for researchers where they work on video material is 1.00 we can see that the reliability of this
to ensure parity in how they enter the data. instrument, calculated for the split-half form of
At a simple level one can calculate the inter- reliability, is very high indeed.
rater agreement as a percentage: This type of reliability assumes that the test
administered can be split into two matched halves;
Number of actual agreements
× 100 many tests have a gradient of difficulty or different
Number of possible agreements items of content in each half. If this is the case and,
Robson (2002: 341) sets out a more sophisticated for example, the test contains twenty items, then
way of measuring inter-rater reliability in coded the researcher, instead of splitting the test into two
148 VALIDITY AND RELIABILITY
by assigning items one to ten to one half and items may be simply unworkable for qualitative research.
eleven to twenty to the second half, may assign all Quantitative research assumes the possibility of
the even numbered items to one group and all the replication; if the same methods are used with the
odd numbered items to another. This would move same sample then the results should be the same.
towards the two halves being matched in terms of Typically quantitative methods require a degree
content and cumulative degrees of difficulty. of control and manipulation of phenomena. This
An alternative measure of reliability as internal distorts the natural occurrence of phenomena (see
consistency is the Cronbach alpha, frequently earlier: ecological validity). Indeed the premises of
referred to as the alpha coefficient of reliability, or naturalistic studies include the uniqueness and
simply the alpha. The Cronbach alpha provides a idiosyncrasy of situations, such that the study
coefficient of inter-item correlations, that is, the cannot be replicated – that is their strength rather
correlation of each item with the sum of all the than their weakness.
other relevant items, and is useful for multi-item On the other hand, this is not to say that
scales. This is a measure of the internal consistency qualitative research need not strive for replica-
among the items (not, for example, the people). We tion in generating, refining, comparing and vali-
address the alpha coefficient and its calculation in dating constructs (see http://www.routledge.com/
Part Five. textbooks/9780415368780 – Chapter 6, file 6.7.
Reliability, thus construed, makes several ppt). Indeed LeCompte and Preissle (1993: 334)
assumptions, for example that instrumentation, argue that such replication might include repeat-
data and findings should be controllable, ing
predictable, consistent and replicable. This
O the status position of the researcher
presupposes a particular style of research,
O the choice of informant/respondents
typically within the positivist paradigm. Cooper
O the social situations and conditions
and Schindler (2001: 218) suggest that, in
O the analytic constructs and premises that are
this paradigm, reliability can be improved by
used
minimizing any external sources of variation:
O the methods of data collection and analysis.
standardizing and controlling the conditions under
which the data collection and measurement take Further, Denzin and Lincoln (1994) suggest that
place; training the researchers in order to ensure reliability as replicability in qualitative research
consistency (inter-rater reliability); widening the can be addressed in several ways:
number of items on a particular topic; excluding
O stability of observations: whether the researcher
extreme responses from the data analysis (e.g.
would have made the same observations and
outliers, which can be done with SPSS).
interpretation of these if they had been
observed at a different time or in a different
Reliability in qualitative research place
O parallel forms: whether the researcher would
While we discuss reliability in qualitative research
have made the same observations and
here, the suitability of the term for qualitative
interpretations of what had been seen if he
research is contested (e.g. Winter 2000; Stenbacka
or she had paid attention to other phenomena
2001; Golafshani 2003). Lincoln and Guba (1985)
during the observation
prefer to replace ‘reliability’ with terms such as
O inter-rater reliability: whether another observer
‘credibility’, ‘neutrality’, ‘confirmability’, ‘depend-
with the same theoretical framework and
ability’, ‘consistency’, ‘applicability’, ‘trustworthi-
observing the same phenomena would have
ness’ and ‘transferability’, in particular the notion
interpreted them in the same way.
of ‘dependability’.
LeCompte and Preissle (1993: 332) suggest that Clearly this is a contentious issue, for it is seeking
the canons of reliability for quantitative research to apply to qualitative research the canons of
RELIABILITY IN QUALITATIVE RESEARCH 149
reliability of quantitative research. Purists might results, in terms of process and product (Golafshani
Chapter 6
argue against the legitimacy, relevance or need for 2003: 601). These are a safeguard against the
this in qualitative studies. charge levelled against qualitative researchers,
In qualitative research reliability can be namely that they respond only to the ‘loudest
regarded as a fit between what researchers record bangs or the brightest lights’.
as data and what actually occurs in the natural Dependability raises the important issue of
setting that is being researched, i.e. a degree respondent validation (see also McCormick and
of accuracy and comprehensiveness of coverage James 1988). While dependability might suggest
(Bogdan and Biklen 1992: 48). This is not to that researchers need to go back to respondents
strive for uniformity; two researchers who are to check that their findings are dependable,
studying a single setting may come up with very researchers also need to be cautious in placing
different findings but both sets of findings might exclusive store on respondents, for, as Hammersley
be reliable. Indeed Kvale (1996: 181) suggests and Atkinson (1983) suggest, they are not in a
that, in interviewing, there might be as many privileged position to be sole commentators on
different interpretations of the qualitative data as their actions.
there are researchers. A clear example of this Bloor (1978) suggests three means by which
is the study of the Nissan automobile factory respondent validation can be addressed:
in the United Kingdom, where Wickens (1987)
found a ‘virtuous circle’ of work organization O researchers attempt to predict what the
practices that demonstrated flexibility, teamwork participants’ classifications of situations will
and quality consciousness, whereas the same be
practices were investigated by Garrahan and O researchers prepare hypothetical cases and then
Stewart (1992), who found a ‘vicious circle’ of predict respondents’ likely responses to them
exploitation, surveillance and control respectively. O researchers take back their research report to
Both versions of the same reality coexist because the respondents and record their reactions to
reality is multilayered. What is being argued for that report.
here is the notion of reliability through an eclectic
use of instruments, researchers, perspectives and The argument rehearses the paradigm wars dis-
interpretations (echoing the comments earlier cussed in the opening chapter: quantitative mea-
about triangulation) (see also Eisenhart and Howe sures are criticized for combining sophistication
1992). and refinement of process with crudity of con-
Brock-Utne (1996) argues that qualitative cept (Ruddock 1981) and for failing to distin-
research, being holistic, strives to record the guish between educational and statistical signif-
multiple interpretations of, intention in and icance (Eisner 1985); qualitative methodologies,
meanings given to situations and events. Here the while possessing immediacy, flexibility, authenti-
notion of reliability is construed as dependability city, richness and candour, are criticized for being
(Lincoln and Guba 1985: 108–9; Anfara et al. impressionistic, biased, commonplace, insignif-
2002), recalling the earlier discussion on internal icant, ungeneralizable, idiosyncratic, subjective
validity. For them, dependability involves member and short-sighted (Ruddock 1981). This is an arid
checks (respondent validation), debriefing by debate; rather the issue is one of fitness for purpose.
peers, triangulation, prolonged engagement in the For our purposes here we need to note that criteria
field, persistent observations in the field, reflexive of reliability in quantitative methodologies differ
journals, negative case analysis, and independent from those in qualitative methodologies. In quali-
audits (identifying acceptable processes of tative methodologies reliability includes fidelity to
conducting the inquiry so that the results are real life, context- and situation-specificity, authen-
consistent with the data). Audit trails enable the ticity, comprehensiveness, detail, honesty, depth
research to address the issue of confirmability of of response and meaningfulness to the respondents.
150 VALIDITY AND RELIABILITY
Validity and reliability in interviews the interviewee and, thereby, on the data. Fielding
and Fielding (1986: 12) make the telling comment
In interviews, inferences about validity are made
that even the most sophisticated surveys only
too often on the basis of face validity (Cannell
manipulate data that at some time had to be
and Kahn 1968), that is, whether the questions
gained by asking people! Interviewer neutrality is
asked look as if they are measuring what they
a chimera (Denscombe 1995).
claim to measure. One cause of invalidity is bias,
Lee (1993) indicates the problems of conducting
defined as ‘a systematic or persistent tendency
interviews perhaps at their sharpest, where the
to make errors in the same direction, that is,
researcher is researching sensitive subjects, i.e.
to overstate or understate the ‘‘true value’’ of
research that might pose a significant threat
an attribute’ (Lansing et al. 1961). One way of
to those involved (be they interviewers or
validating interview measures is to compare the
interviewees). Here the interview might be seen as
interview measure with another measure that has
an intrusion into private worlds, or the interviewer
already been shown to be valid. This kind of
might be regarded as someone who can impose
comparison is known as ‘convergent validity’. If
sanctions on the interviewee, or as someone who
the two measures agree, it can be assumed that the
can exploit the powerless; the interviewee is in the
validity of the interview is comparable with the
searchlight that is being held by the interviewer
proven validity of the other measure.
(see also Scheurich 1995). Indeed Gadd (2004)
Perhaps the most practical way of achieving
reports that an interviewee may reduce his or
greater validity is to minimize the amount
her willingness to ‘open up’ to an interviewer
of bias as much as possible. The sources of
if the dynamics of the interview situation are
bias are the characteristics of the interviewer,
too threatening, taking the role of the ‘defended
the characteristics of the respondent, and the
subject’. The issues also embrace transference and
substantive content of the questions. More
counter-transference, which have their basis in
particularly, these will include:
psychoanalysis. In transference the interviewees
O the attitudes, opinions and expectations of the project onto the interviewer their feelings, fears,
interviewer desires, needs and attitudes that derive from their
O a tendency for the interviewer to see the own experiences (Scheurich 1995). In counter-
respondent in his or her own image transference the process is reversed.
O a tendency for the interviewer to seek answers One way of controlling for reliability is
that support preconceived notions to have a highly structured interview, with
O misperceptions on the part of the interviewer the same format and sequence of words and
of what the respondent is saying questions for each respondent (Silverman 1993),
O misunderstandings on the part of the though Scheurich (1995: 241–9) suggests that
respondent of what is being asked. this is to misread the infinite complexity and
open-endedness of social interaction. Controlling
Studies have also shown that race, religion, the wording is no guarantee of controlling the
gender, sexual orientation, status, social class and interview. Oppenheim (1992: 147) argues that
age in certain contexts can be potent sources of wording is a particularly important factor in
bias, i.e. interviewer effects (Lee 1993; Scheurich attitudinal questions rather than factual questions.
1995). Interviewers and interviewees alike bring He suggests that changes in wording, context
their own, often unconscious, experiential and and emphasis undermine reliability, because
biographical baggage with them into the interview it ceases to be the same question for each
situation. Indeed Hitchcock and Hughes (1989) respondent. Indeed he argues that error and
argue that because interviews are interpersonal, bias can stem from alterations to wording,
humans interacting with humans, it is inevitable procedure, sequence, recording and rapport, and
that the researcher will have some influence on that training for interviewers is essential to
VALIDITY AND RELIABILITY IN INTERVIEWS 151
minimize this. Silverman (1993) suggests that it the interviewee had been a frequent complainer,
Chapter 6
is important for each interviewee to understand and the question ‘How satisfied are you with
the question in the same way. He suggests that the new Mathematics scheme?’ assumes a degree
the reliability of interviews can be enhanced by: of satisfaction with the scheme. The leading
careful piloting of interview schedules; training of questions here might be rendered less leading by
interviewers; inter-rater reliability in the coding rephrasing, for example: ‘How frequently do you
of responses; and the extended use of closed have conversations with the headteacher?’ and
questions. ‘What is your opinion of the new Mathematics
On the other hand, Silverman (1993) argues for scheme?’ respectively.
the importance of open-ended interviews, as this In discussing the issue of leading questions, we
enables respondents to demonstrate their unique are not necessarily suggesting that there is not a
way of looking at the world – their definition of place for them. Indeed Kvale (1996: 158) makes
the situation. It recognizes that what is a suitable a powerful case for leading questions, arguing
sequence of questions for one respondent might be that they may be necessary in order to obtain
less suitable for another, and open-ended questions information that the interviewer suspects the
enable important but unanticipated issues to be interviewee might be withholding. Here it might
raised. be important to put the ‘burden of denial’ onto
Oppenheim (1992: 96–7) suggests several the interviewee (e.g. ‘When did you stop beating
causes of bias in interviewing: your wife?’). Leading questions, frequently used
in police interviews, may be used for reliability
O biased sampling (sometimes created by
checks with what the interviewee has already said,
the researcher not adhering to sampling
or may be deliberately used to elicit particular
instructions)
non-verbal behaviours that give an indication of
O poor rapport between interviewer and inter-
the sensitivity of the interviewee’s remarks.
viewee
Hence reducing bias becomes more than
O changes to question wording (e.g. in attitudinal
simply: careful formulation of questions so that
and factual questions)
the meaning is crystal clear; thorough training
O poor prompting and biased probing
procedures so that an interviewer is more aware
O poor use and management of support materials
of the possible problems; probability sampling of
(e.g. show cards)
respondents; and sometimes matching interviewer
O alterations to the sequence of questions
characteristics with those of the sample being
O inconsistent coding of responses
interviewed. Oppenheim (1992: 148) argues, for
O selective or interpreted recording of data/
example, that interviewers seeking attitudinal
transcripts
responses have to ensure that people with known
O poor handling of difficult interviews.
characteristics are included in the sample – the
One can add to this the issue of ‘acquiescence’ criterion group. We need to recognize that the
(Breakwell 2000: 254), the tendency that interview is a shared, negotiated and dynamic
respondents may have to say ‘yes’, regardless of social moment.
the question or, indeed, regardless of what they The notion of power is significant in the
really feel or think. interview situation, for the interview is not simply
There is also the issue of leading questions. A a data collection situation but a social and
leading question is one which makes assumptions frequently a political situation. Literally the word
about interviewees or ‘puts words into their ‘inter-view’ is a view between people, mutually, not
mouths’, where the question influences the answer, the interviewer extracting data, one-way, from the
perhaps illegitimately. For example (Morrison interviewee. Power can reside with interviewer
1993: 66–7) the question ‘When did you stop and interviewee alike (Thapar-Björkert and Henry
complaining to the headteacher?’ assumes that 2004), though Scheurich (1995: 246) argues that,
152 VALIDITY AND RELIABILITY
typically, more power resides with the interviewer: where she instances being kept waiting, and
the interviewer generates the questions and the subsequently being interrupted, being patronized,
interviewee answers them; the interviewee is and being interviewed by the interviewee (see
under scrutiny while the interviewer is not. Kvale also Walford 1994d). Indeed Scheurich (1995)
(1996: 126), too, suggests that there are definite suggests that many powerful interviewees will
asymmetries of power as the interviewer tends to rephrase or not answer the question. Connell
define the situation, the topics, and the course of et al. (1996) argue that a working-class female
the interview. talking with a multinational director will be
J. Cassell (cited in Lee 1993) suggests that elites very different from a middle-class professor
and powerful people might feel demeaned or talking to the same person. Limerick et al. (1996)
insulted when being interviewed by those with comment on occasions where interviewers have
a lower status or less power. Further, those with felt themselves to be passive, vulnerable, helpless
power, resources and expertise might be anxious and indeed manipulated. One way of overcoming
to maintain their reputation, and so will be more this is to have two interviewers conducting each
guarded in what they say, wrapping this up in well- interview (Walford 1994c: 227). On the other
chosen, articulate phrases. Lee (1993) comments hand, Hitchcock and Hughes (1989) observe that
on the asymmetries of power in several interview if the researchers are known to the interviewees
situations, with one party having more power and they are peers, however powerful, then a
and control over the interview than the other. degree of reciprocity might be taking place, with
Interviewers need to be aware of the potentially interviewees giving answers that they think the
distorting effects of power, a significant feature of researchers might want to hear.
critical theory, as discussed in Chapter 1. The issue of power has not been lost on fem-
Neal (1995) draws attention to the feelings inist research (e.g. Thapar-Björkert and Henry
of powerlessness and anxieties about physical 2004), that is, research that emphasizes subjec-
presentation and status on the part of interviewers tivity, equality, reciprocity, collaboration, non-
when interviewing powerful people. This is hierarchical relations and emancipatory poten-
particularly so for frequently lone, low-status tial (catalytic and consequential validity) (Neal
research students interviewing powerful people; 1995), echoing the comments about research that
a low-status female research student might find is influenced by the paradigm of critical theory.
that an interview with a male in a position of Here feminist research addresses a dilemma of
power (e.g. a university Vice-chancellor, a senior interviews that are constructed in the dominant,
politician or a senior manager) might turn out to male paradigm of pitching questions that demand
be very different from an interview with the same answers from a passive respondent.
person if conducted by a male university professor Limerick et al. (1996) suggest that, in fact,
where it is perceived by the interviewee to be more it is wiser to regard the interview as a gift,
of a dialogue between equals (see also Gewirtz as interviewees have the power to withhold
and Ozga 1993, 1994). Ball (1994b) comments information, to choose the location of the
that, when powerful people are being interviewed, interview, to choose how seriously to attend
interviews must be seen as an extension of the ‘play to the interview, how long it will last, when
of power’ – with its game-like connotations. He it will take place, what will be discussed – and
suggests that powerful people control the agenda in what and whose terms – what knowledge is
and course of the interview, and are usually very important, even how the data will be analysed
adept at this because they have both a personal and used (see also Thapar-Björkert and Henry
and professional investment in being interviewed 2004). Echoing Foucault, they argue that power is
(see also Batteson and Ball 1995; Phillips 1998). fluid and is discursively constructed through the
The effect of power can be felt even before interview rather than being the province of either
the interview commences, notes Neal (1995), party.
VALIDITY AND RELIABILITY IN INTERVIEWS 153
Miller and Cannell (1997) identify some intimate situation. Hence, telephone interviews
Chapter 6
particular problems in conducting telephone have their strengths and weaknesses, and their use
interviews, where the reduction of the interview should be governed by the criterion of fitness for
situation to just auditory sensory cues can be purpose. They tend to be shorter, more focused
particularly problematical. There are sampling and useful for contacting busy people (Harvey
problems, as not everyone will have a telephone. 1988; Miller, 1995).
Further, there are practical issues, for example, In his critique of the interview as a research
interviewees can retain only a certain amount tool, Kitwood (1977) draws attention to the con-
of information in their short-term memory, so flict it generates between the traditional concepts
bombarding the interviewee with too many of validity and reliability. Where increased relia-
choices (the non-written form of ‘show cards’ of bility of the interview is brought about by greater
possible responses) becomes unworkable. Hence control of its elements, this is achieved, he argues,
the reliability of responses is subject to the at the cost of reduced validity. He explains:
memory capabilities of the interviewee – how
In proportion to the extent to which ‘reliability’
many scale points and descriptors, for example,
is enhanced by rationalization, ‘validity’ would
can an interviewee retain about a single item?
decrease. For the main purpose of using an interview
Further, the absence of non-verbal cues is
in research is that it is believed that in an
significant, e.g. facial expression, gestures, posture,
interpersonal encounter people are more likely to
the significance of silences and pauses (Robinson
disclose aspects of themselves, their thoughts, their
1982), as interviewees may be unclear about
feelings and values, than they would in a less
the meaning behind words and statements. This
human situation. At least for some purposes, it
problem is compounded if the interviewer is
is necessary to generate a kind of conversation in
unknown to the interviewee.
which the ‘respondent’ feels at ease. In other words,
Miller and Cannell (1997) report important
the distinctively human element in the interview is
research evidence to support the significance of
necessary to its ‘validity’. The more the interviewer
the non-verbal mediation of verbal dialogue. As
becomes rational, calculating, and detached, the less
discussed earlier, the interview is a social situation;
likely the interview is to be perceived as a friendly
in telephone interviews the absence of essential
transaction, and the more calculated the response
social elements could undermine the salient
also is likely to be.
conduct of the interview, and hence its reliability
(Kitwood 1977)
and validity. Non-verbal paralinguistic cues affect
the conduct, pacing and relationships in the Kitwood (1977) suggests that a solution to the
interview and the support, threat and confidence problem of validity and reliability might lie in the
felt by the interviewees. Telephone interviews can direction of a ‘judicious compromise’.
easily slide into becoming mechanical and cold. A cluster of problems surround the person being
Further, the problem of loss of non-verbal cues interviewed. Tuckman (1972), for example, has
is compounded by the asymmetries of power that observed that, when formulating their questions,
often exist between interviewer and interviewee; interviewers have to consider the extent to which
the interviewer will need to take immediate steps a question might influence respondents to show
to address these issues (e.g. by putting interviewees themselves in a good light; or the extent to which
at their ease). a question might influence respondents to be
On the other hand, Nias (1991) and Miller unduly helpful by attempting to anticipate what
and Cannell (1997) suggest that the very the interviewer wants to hear; or the extent to
factor that interviews are not face-to-face may which a question might be asking for information
strengthen their reliability, as the interviewee about respondents that they are not certain or
might disclose information that may not be likely to know themselves. Further, interviewing
so readily forthcoming in a face-to-face, more procedures are based on the assumption that the
154 VALIDITY AND RELIABILITY
people interviewed have insight into the cause in unfamiliar and uncongenial surroundings to
of their behaviour. Insight of this kind may be extended responses in the more congenial and less
rarely achieved and, when it is, it is after long and threatening surroundings – more sympathetic to
difficult effort, usually in the context of repeated the children’s everyday world. The language, argot
clinical interviews. and jargon (Edwards 1976), social and cultural
In educational circles interviewing might be factors of the interviewer and interviewee all exert
a particular problem in working with children. a powerful influence on the interview situation.
Simons (1982) and McCormick and James (1988) The issue is also raised here (Lee 1993) of
comment on particular problems involved in whether there should be a single interview
interviewing children, for example: that maintains the detachment of the researcher
(perhaps particularly useful in addressing sensitive
O establishing trust
topics), or whether there should be repeated
O overcoming reticence
interviews to gain depth and to show fidelity to
O maintaining informality
the collaborative nature of research (a feature, as
O avoiding assuming that children ‘know the
was noted above, which is significant for feminist
answers’
research: Oakley 1981).
O overcoming the problems of inarticulate
Kvale (1996: 148–9) suggests that a skilled
children
interviewer should:
O pitching the question at the right level
O choosing the right vocabulary O know the subject matter in order to conduct
O being aware of the giving and receiving of an informed conversation
non-verbal cues O structure the interview well, so that each stage
O moving beyond the institutional response or of the interview is clear to the participant
receiving what children think the interviewer O be clear in the terminology and coverage of the
wants to hear material
O avoiding the interviewer being seen as an O allow participants to take their time and answer
authority, spy or plant in their own way
O keeping to the point O be sensitive and empathic, using active
O breaking silences on taboo areas and those listening and being sensitive to how something
which are reinforced by peer-group pressure is said and the non-verbal communication
O seeing children as being of lesser importance involved
than adults (maybe in the sequence in which O be alert to those aspects of the interview which
interviews are conducted, e.g. the headteacher, may hold significance for the participant
then the teaching staff, then the children). O keep to the point and the matter in hand,
steering the interview where necessary in order
These are not new matters. The studies by
to address this
Labov in the 1960s showed how students
O check the reliability, validity and consistency
reacted very strongly to contextual matters in an
of responses by well-placed questioning
interview situation (Labov 1969). The language
O be able to recall and refer to earlier statements
of children varied according to the ethnicity
made by the participant
of the interviewee, the friendliness of the
O be able to clarify, confirm and modify the
surroundings, the opportunity for the children
participants’ comments with the participant.
to be interviewed with friends, the ease with
which the scene was set for the interview, the Walford (1994c: 225) adds to this the need for
demeanour of the adult (e.g. whether the adult interviewers to have done their homework when
was standing or sitting) and the nature of the interviewing powerful people, as such people could
topics covered. The differences were significant, well interrogate the interviewer – they will assume
varying from monosyllabic responses by children up-to-dateness, competence and knowledge in the
VALIDITY AND RELIABILITY IN EXPERIMENTS 155
interviewer. Powerful interviewees are usually busy that are of greater consequence to the validity
Chapter 6
people and will expect the interviewer to have read of quasi-experiments (more typical in educational
the material that is in the public domain. research) than to true experiments in which ran-
The issues of reliability do not reside solely dom assignment to treatments occurs and where
in the preparations for and conduct of the both treatment and measurement can be more
interview; they extend to the ways in which adequately controlled by the researcher. The fol-
interviews are analysed. For example, Lee (1993) lowing summaries adapted from Campbell and
and Kvale (1996: 163) comment on the issue Stanley (1963), Bracht and Glass (1968) and
of ‘transcriber selectivity’. Here transcripts of Lewis-Beck (1993) distinguish between ‘internal
interviews, however detailed and full they validity’ and ‘external validity’. Internal valid-
might be, remain selective, since they are ity is concerned with the question, ‘Do the
interpretations of social situations. They become experimental treatments, in fact, make a differ-
decontextualized, abstracted, even if they record ence in the specific experiments under scrutiny?’.
silences, intonation, non-verbal behaviour etc. External validity, on the other hand, asks the
The issue, then, is how useful they are to question, ‘Given these demonstrable effects, to
researchers overall rather than whether they are what populations or settings can they be gener-
completely reliable. alized?’ (see http://www.routledge.com/textbooks/
One of the problems that has to be considered 9780415368780 – Chapter 6, file 6.8. ppt).
when open-ended questions are used in the
interview is that of developing a satisfactory
Threats to internal validity
method of recording replies. One way is to
summarize responses in the course of the interview. O History: Frequently in educational research,
This has the disadvantage of breaking the events other than the experimental treatments
continuity of the interview and may result in occur during the time between pretest and
bias because the interviewer may unconsciously post-test observations. Such events produce
emphasize responses that agree with his or her effects that can mistakenly be attributed to
expectations and fail to note those that do not. It differences in treatment.
is sometimes possible to summarize an individual’s O Maturation: Between any two observations
responses at the end of the interview. Although subjects change in a variety of ways. Such
this preserves the continuity of the interview, it changes can produce differences that are
is likely to induce greater bias because the delay independent of the experimental treatments.
may lead to the interviewer forgetting some of The problem of maturation is more acute in
the details. It is these forgotten details that are protracted educational studies than in brief
most likely to be the ones that disagree with the laboratory experiments.
interviewer’s own expectations. O Statistical regression: Like maturation effects,
regression effects increase systematically with
the time interval between pretests and
Validity and reliability in experiments
post-tests. Statistical regression occurs in
As we have seen, the fundamental purpose of educational (and other) research due to the
experimental design is to impose control over unreliability of measuring instruments and to
conditions that would otherwise cloud the true extraneous factors unique to each experimental
effects of the independent variables upon the group. Regression means, simply, that subjects
dependent variables. scoring highest on a pretest are likely to score
Clouding conditions that threaten to jeopardize relatively lower on a post-test; conversely,
the validity of experiments have been identi- those scoring lowest on a pretest are likely
fied by Campbell and Stanley (1963), Bracht and to score relatively higher on a post-test. In
Glass (1968) and Lewis-Beck (1993), conditions short, in pretest-post-test situations, there is
156 VALIDITY AND RELIABILITY
regression to the mean. Regression effects can a number of factors (adapted from Campbell and
lead the educational researcher mistakenly Stanley 1963; Bracht and Glass 1968; Hammersley
to attribute post-test gains and losses to low and Atkinson 1983; Vulliamy 1990; Lewis-Beck
scoring and high scoring respectively. 1993) that jeopardize external validity.
O Testing: Pretests at the beginning of experi-
ments can produce effects other than those due O Failure to describe independent variables explicitly:
to the experimental treatments. Such effects Unless independent variables are adequately
can include sensitizing subjects to the true described by the researcher, future replications
purposes of the experiment and practice ef- of the experimental conditions are virtually
fects which produce higher scores on post-test impossible.
measures. O Lack of representativeness of available and
O Instrumentation: Unreliable tests or instru- target populations: While those participating
ments can introduce serious errors into ex- in the experiment may be representative of
periments. With human observers or judges an available population, they may not be
or changes in instrumentation and calibration, representative of the population to which the
error can result from changes in their skills and experimenter seeks to generalize the findings,
levels of concentration over the course of the i.e. poor sampling and/or randomization.
experiment. O Hawthorne effect: Medical research has long
O Selection: Bias may be introduced as a result recognized the psychological effects that
of differences in the selection of subjects arise out of mere participation in drug
for the comparison groups or when intact experiments, and placebos and double-
classes are employed as experimental or control blind designs are commonly employed to
groups. Selection bias, moreover, may interact counteract the biasing effects of participation.
with other factors (history, maturation, etc.) Similarly, so-called Hawthorne effects threaten
to cloud even further the effects of the to contaminate experimental treatments in
comparative treatments. educational research when subjects realize their
O Experimental mortality: The loss of subjects role as guinea pigs.
through dropout often occurs in long-running O Inadequate operationalizing of dependent vari-
experiments and may result in confounding ables: Dependent variables that experimenters
the effects of the experimental variables, for operationalize must have validity in the non-
whereas initially the groups may have been experimental setting to which they wish to
randomly selected, the residue that stays the generalize their findings. A paper and pencil
course is likely to be different from the unbiased questionnaire on career choice, for example,
sample that began it. may have little validity in respect of the actual
O Instrument reactivity: The effects that the employment decisions made by undergraduates
instruments of the study exert on the people in on leaving university.
the study (see also Vulliamy et al. 1990). O Sensitization/reactivity to experimental conditions:
O Selection-maturation interaction: This can occur As with threats to internal validity, pretests
where there is a confusion between the research may cause changes in the subjects’ sensitivity
design effects and the variable’s effects. to the experimental variables and thus cloud
the true effects of the experimental treatment.
O Interaction effects of extraneous factors and
Threats to external validity
experimental treatments: All of the above threats
Threats to external validity are likely to limit to external validity represent interactions of
the degree to which generalizations can be made various clouding factors with treatments. As
from the particular experimental conditions to well as these, interaction effects may also
other populations or settings. We summarize here arise as a result of any or all of those factors
VALIDITY AND RELIABILITY IN QUESTIONNAIRES 157
identified under the section on ‘Threats to questionnaires would have given the same
Chapter 6
internal validity’. distribution of answers as did the returnees. The
O Invalidity or unreliability of instruments: The question of accuracy can be checked by means
use of instruments which yield data in which of the intensive interview method, a technique
confidence cannot be placed (see below on consisting of twelve principal tactics that include
tests). familiarization, temporal reconstruction, probing
O Ecological validity, and its partner, the extent and challenging. The interested reader should
to which behaviour observed in one context consult Belson (1986: 35-8).
can be generalized to another: Hammersley The problem of non-response – the issue of
and Atkinson (1983: 10) comment on the ‘volunteer bias’ as Belson (1986) calls it – can,
serious problems that surround attempts to in part, be checked on and controlled for,
relating inferences from responses gained under particularly when the postal questionnaire is sent
experimental conditions, or from interviews, to out on a continuous basis. It involves follow-
everyday life. up contact with non-respondents by means of
interviewers trained to secure interviews with
By way of summary, we have seen that an such people. A comparison is then made between
experiment can be said to be internally valid the replies of respondents and non-respondents.
to the extent that, within its own confines, its Further, Hudson and Miller (1997) suggest several
results are credible (Pilliner 1973); but for those strategies for maximizing the response rate to
results to be useful, they must be generalizable postal questionnaires (and, thereby, to increase
beyond the confines of the particular experiment. reliability). They involve:
In a word, they must be externally valid
also: see also Morrison (2001b) for a critique
O including stamped addressed envelopes
of randomized controlled experiments and the
O organizing multiple rounds of follow-up to
problems of generalizability. Pilliner (1973) points
request returns (maybe up to three follow-ups)
to a lopsided relationship between internal and
O stressing the importance and benefits of the
external validity. Without internal validity an
questionnaire
experiment cannot possibly be externally valid.
O stressing the importance of, and benefits to,
But the converse does not necessarily follow; an
the client group being targeted (particularly if
internally valid experiment may or may not have
it is a minority group that is struggling to have
external validity. Thus, the most carefully designed
a voice)
experiment involving a sample of Welsh-speaking
O providing interim data from returns to non-
children is not necessarily generalizable to a target
returners to involve and engage them in the
population which includes non-Welsh-speaking
research
subjects.
O checking addresses and changing them if
It follows, then, that the way to good
necessary
experimentation in schools, or indeed any other
O following up questionnaires with a personal
organizational setting, lies in maximizing both telephone call
internal and external validity.
O tailoring follow-up requests to individuals
(with indications to them that they are
personally known and/or important to the
Validity and reliability in questionnaires
research – including providing respondents
Validity of postal questionnaires can be seen with clues by giving some personal information
from two viewpoints (Belson l986). First, whether to show that they are known) rather than
respondents who complete questionnaires do blanket generalized letters
so accurately, honestly and correctly; and O detailing features of the questionnaire itself
second, whether those who fail to return their (ease of completion, time to be spent,
158 VALIDITY AND RELIABILITY
sensitivity of the questions asked, length of about the subjective and idiosyncratic nature of
the questionnaire) the participant observation study are about its
O issuing invitations to a follow-up interview external validity. How do we know that the results
(face-to-face or by telephone) of this one piece of research are applicable to
O providing encouragement to participate by a other situations? Fears that observers’ judgements
friendly third party will be affected by their close involvement in
O understanding the nature of the sample the group relate to the internal validity of the
population in depth, so that effective targeting method. How do we know that the results of
strategies can be used. this one piece of research represent the real
The advantages of the questionnaire over inter- thing, the genuine product? In Chapter 4 on
views, for instance, are: it tends to be more reliable; sampling, we refer to a number of techniques
because it is anonymous, it encourages greater hon- (quota sampling, snowball sampling, purposive
esty (though, of course, dishonesty and falsification sampling) that researchers employ as a way
might not be able to be discovered in a question- of checking on the representativeness of the
naire); it is more economical than the interview in events that they observe and of cross-checking
terms of time and money; and there is the possibil- their interpretations of the meanings of those
ity that it can be mailed. Its disadvantages, on the events.
other hand, are: there is often too low a percentage In addition to external validity, participant
of returns; the interviewer is unable to answer ques- observation also has to be rigorous in its internal
tions concerning both the purpose of the interview validity checks. There are several threats to
and any misunderstandings experienced by the in- validity and reliability here, for example:
terviewee, for it sometimes happens in the case of O the researcher, in exploring the present, may
the latter that the same questions have different be unaware of important antecedent events
meanings for different people; if only closed items O informants may be unrepresentative of the
are used, the questionnaire may lack coverage or sample in the study
authenticity; if only open items are used, respon- O the presence of the observer might bring about
dents may be unwilling to write their answers different behaviours (reactivity and ecological
for one reason or another; questionnaires present validity)
problems to people of limited literacy; and an in- O the researcher might ‘go native’, becoming too
terview can be conducted at an appropriate speed attached to the group to see it sufficiently
whereas questionnaires are often filled in hurriedly. dispassionately.
There is a need, therefore, to pilot questionnaires
and refine their contents, wording, length, etc. as To address this Denzin (1970a) suggests
appropriate for the sample being targeted. triangulation of data sources and methodologies.
One central issue in considering the reliability Chapter 18 discusses the principal ways of
and validity of questionnaire surveys is that of overcoming problems of reliability and validity
sampling. An unrepresentative, skewed sample, in observational research in naturalistic inquiry.
one that is too small or too large can easily distort In essence it is suggested that the notion
the data, and indeed, in the case of very small of ‘trustworthiness’ (Lincoln and Guba 1985)
samples, prohibit statistical analysis (Morrison replaces more conventional views of reliability
1993). The issue of sampling was covered in and validity, and that this notion is devolved on
Chapter 4. issues of credibility, confirmability, transferability and
dependability. Chapter 18 indicates how these areas
can be addressed.
Validity and reliability in observations
If observational research is much more
There are questions about two types of validity in structured in its nature, yielding quantitative data,
observation-based research. In effect, comments then the conventions of intra- and inter-rater
VALIDITY AND RELIABILITY IN TESTS 159
reliability apply. Here steps are taken to ensure that situational factors: the psychological and
Chapter 6
O
observers enter data into the appropriate categories physical conditions for the test – the context
consistently (i.e. intra- and inter-rater reliability) O test marker factors: idiosyncrasy and subjectiv-
and accurately. Further, to ensure validity, a ity
pilot must have been conducted to ensure O instrument variables: poor domain sampling,
that the observational categories themselves are errors in sampling tasks, the realism of the
appropriate, exhaustive, discrete, unambiguous tasks and relatedness to the experience of the
and effectively operationalize the purposes of the testees, poor question items, the assumption or
research. extent of unidimensionality in item response
theory, length of the test, mechanical errors,
Validity and reliability in tests scoring errors, computer errors.
certificate or employment or entry into higher pupils before the actual testing session. Ideally,
education). The results of a test completed test instructions should be simple, direct and
in a desultory fashion by resentful pupils as brief as possible.
are hardly likely to supply the students’ O The Hawthorne effect, wherein, in this
teacher with reliable information about context, simply informing students that this
the students’ capabilities (Wiggins 1998). is an assessment situation will be enough to
Motivation to participate in test-taking disturb their performance – for the better or
sessions is strongest when students have the worse (either case not being a fair reflection
been helped to see its purpose, and where of their usual abilities).
the examiner maintains a warm, purposeful O Distractions, including superfluous informa-
attitude toward them during the testing session tion, will have an effect.
(Airasian 2001). O Students respond to the tester in terms of
O The relationship (positive to negative) their perceptions of what he/she expects of
between the assessor and the testee exerts them (Haladyna 1997; Tombari and Borich
an influence on the assessment. This 1999; Stiggins, 2001).
takes on increasing significance in teacher O The time of the day, week, month will exert
assessment, where the students know the an influence on performance. Some students
teachers personally and professionally – and are fresher in the morning and more capable of
vice versa – and where the assessment situation concentration (Stiggins 2001).
involves face-to-face contact between the O Students are not always clear on what they
teacher and the student. Both test-takers think is being asked in the question; they may
and test-givers mutually influence one another know the right answer but not infer that this is
during examinations, oral assessments and the what is required in the question.
like (Harlen 1994). During the test situation, O The students may vary from one question to
students respond to such characteristics of another – a student may have performed better
the evaluator as the person’s sex, age and with a different set of questions which tested
personality. the same matters. Black (1998) argues that two
O The conditions – physical, emotional, so- questions which, to the expert, may seem to be
cial – exert an influence on the assessment, asking the same thing but in different ways, to
particularly if they are unfamiliar. Wherever the students might well be seen as completely
possible, students should take tests in familiar different questions.
settings, preferably in their own classrooms un- O Students (and teachers) practise test-like
der normal school conditions. Distractions in materials, which, even though scores are
the form of extraneous noise, walking about raised, might make them better at taking tests
the room by the examiner and intrusions into but the results might not indicate increased
the room, all have significant impact upon the performance.
scores of the test-takers, particularly when they O A student may be able to perform a specific skill
are younger pupils (Gipps 1994). An important in a test but not be able to select or perform it
factor in reducing students’ anxiety and tension in the wider context of learning.
during an examination is the extent to which O Cultural, ethnic and gender background affect
they are quite clear about what exactly they are how meaningful an assessment task or activity
required to do. Simple instructions, clearly and is to students, and meaningfulness affects their
calmly given by the examiner, can significantly performance.
lower the general level of tension in the test- O Students’ personalities may make a difference
room. Teachers who intend to conduct testing to their test performance.
sessions may find it beneficial in this respect to O Students’ learning strategies and styles may
rehearse the instructions they wish to give to make a difference to their test performance.
VALIDITY AND RELIABILITY IN TESTS 161
Marking practices are not always reliable, A single error early on in a complex sequence
Chapter 6
O O
markers may be being too generous, marking may confound the later stages of the sequence
by effort and ability rather than performance. (within a question or across a set of questions),
O The context in which the task is presented even though the student might have been able
affects performance: some students can perform to perform the later stages of the sequence,
the task in everyday life but not under test thereby preventing the student from gaining
conditions. credit for all she or he can, in fact, do.
O Questions might favour boys more than girls or
With regard to the test items themselves, there may
vice versa.
be problems (e.g. test bias):
O Essay questions favour boys if they concern
O The task itself may be multidimensional, impersonal topics and girls if they concern
for example, testing ‘reading’ may require personal and interpersonal topics (Haladyna
several components and constructs. Students 1997; Wedeen et al. 2002).
can execute a Mathematics operation in the O Boys perform better than girls on multiple
Mathematics class but they cannot perform the choice questions and girls perform better than
same operation in, for example, a Physics class; boys on essay-type questions (perhaps because
students will disregard English grammar in a boys are more willing than girls to guess in
Science class but observe it in an English class. multiple-choice items), and girls perform better
This raises the issue of the number of contexts in written work than boys.
in which the behaviour must be demonstrated O Questions and assessment may be culture-
before a criterion is deemed to have been bound: what is comprehensible in one culture
achieved (Cohen et al. 2004). The question of may be incomprehensible in another.
transferability of knowledge and skills is also O The test may be so long, in order to
raised in this connection. The context of the ensure coverage, that boredom and loss of
task affects the student’s performance. concentration may impair reliability.
O The validity of the items may be in question.
O The language of the assessment and the
Hence specific contextual factors can exert a
assessor exerts an influence on the testee, for
significant influence on learning and this has to be
example if the assessment is carried out in the
recognised in conducting assessments, to render
testee’s second language or in a ‘middle-class’
an assessment as unthreatening and natural as
code (Haladyna 1997).
possible.
O The readability level of the task can exert an
Harlen (1994: 140-2) suggests that incon-
influence on the test, e.g. a difficulty in reading
sistency and unreliability in teacher-based and
might distract from the purpose of a test which
school-based assessment may derive from differ-
is of the use of a mathematical algorithm.
ences in:
O The size and complexity of numbers or
operations in a test (e.g. of Mathematics) might
distract the testee who actually understands the O interpreting the assessment purposes, tasks and
operations and concepts. contents, by teachers or assessors
O The number and type of operations and stages O the actual task set, or the contexts and
to a task: the students might know how to circumstances surrounding the tasks (e.g. time
perform each element, but when they are and place)
presented in combination the size of the task O how much help is given to the test-takers
can be overwhelming. during the test
O The form and presentation of questions affects O the degree of specificity in the marking criteria
the results, giving variability in students’ O the application of the marking criteria and the
performances. grading or marking system that accompanies it
162 VALIDITY AND RELIABILITY
O how much additional information about the particular issue in Computer Adaptive Testing
student or situation is being referred to in the see chapter 19: Thissen 1990)
assessment. O ensuring effective levels of item discriminabil-
ity and item difficulty.
Harlen (1994) advocates the use of a range of
moderation strategies, both before and after the Reliability has to be not only achieved but also
tests, including: seen to be achieved, particularly in ‘high stakes’
testing (where a lot hangs on the results of the test,
O statistical reference/scaling tests e.g. entrance to higher education or employment).
O inspection of samples (by post or by visit) Hence the procedures for ensuring reliability must
O group moderation of grades be transparent. The difficulty here is that the
O post-hoc adjustment of marks more one moves towards reliability as defined
O accreditation of institutions above, the more the test will become objective,
O visits of verifiers the more students will be measured as though they
O agreement panels are inanimate objects, and the more the test will
O defining marking criteria become decontextualized.
O exemplification An alternative form of reliability, which is
O group moderation meetings. premissed on a more constructivist psychology,
emphasizes the significance of context, the
While moderation procedures are essentially post- importance of subjectivity and the need to engage
hoc adjustments to scores, agreement trials and and involve the testee more fully than a simple test.
practice-marking can be undertaken before the This rehearses the tension between positivism and
administration of a test, which is particularly more interpretive approaches outlined in Chapter
important if there are large numbers of scripts 1 of this book. Objective tests, as described in
or several markers. this chapter, lean strongly towards the positivist
The issue here is that the results as well as the paradigm, while more phenomenological and
instruments should be reliable. Reliability is also interpretive paradigms of social science research
addressed by: will emphasize the importance of settings, of
individual perceptions, of attitudes, in short, of
O calculating coefficients of reliability, split-half
‘authentic’ testing (e.g. by using non-contrived,
techniques, the Kuder-Richardson formula,
non-artificial forms of test data, for example
parallel/equivalent forms of a test, test/retest
portfolios, documents, course work, tasks that
methods, the alpha coefficient
are stronger in realism and more ‘hands on’).
O calculating and controlling the standard error
Though this latter adopts a view which is closer
of measurement
to assessment rather than narrowly ‘testing’,
O increasing the sample size (to maximize the
nevertheless the two overlap, both can yield marks,
range and spread of scores in a norm-
grades and awards, both can be formative as well
referenced test), though criterion-referenced
as summative, both can be criterion-referenced.
tests recognize that scores may bunch around
With regard to validity, it is important to note
the high level (in mastery learning for
here that an effective test will adequately ensure
example), i.e. that the range of scores might
the following:
be limited, thereby lowering the correlation
coefficients that can be calculated O Content validity (e.g. adequate and representa-
O increasing the number of observations made tive coverage of programme and test objectives
and items included in the test (in order to in the test items, a key feature of domain
increase the range of scores) sampling): this is achieved by ensuring that
O ensuring effective domain sampling of items the content of the test fairly samples the class
in tests based on item response theory (a or fields of the situations or subject matter
VALIDITY AND RELIABILITY IN TESTS 163
in question. Content validity is achieved by In this respect construct validity also subsumes
Chapter 6
making professional judgements about the rel- content and criterion-related validity. It is ar-
evance and sampling of the contents of the gued (Loevinger 1957) that, in fact, construct
test to a particular domain. It is concerned validity is the queen of the types of validity be-
with coverage and representativeness rather cause it is subsumptive and because it concerns
than with patterns of response or scores. It is constructs or explanations rather than method-
a matter of judgement rather than measure- ological factors. Construct validity is threat-
ment (Kerlinger 1986). Content validity will ened by under-representation of the construct,
need to ensure several features of a test (Wolf i.e. the test is too narrow and neglects signifi-
1994): (a) test coverage (the extent to which cant facets of a construct, and by the inclusion
the test covers the relevant field); (b) test rel- of irrelevancies – excess reliable variance.
evance (the extent to which the test items O Concurrent validity is where the results of the
are taught through, or are relevant to, a par- test concur with results on other tests or in-
ticular programme); (c) programme coverage struments that are testing/assessing the same
(the extent to which the programme covers construct/performance – similar to predictive
the overall field in question). validity but without the time dimension. Con-
O Criterion-related validity is where a high correla- current validity can occur simultaneously with
tion coefficient exists between the scores on the another instrument rather than after some time
test and the scores on other accepted tests of has elapsed.
the same performance: this is achieved by com- O Face validity is where, superficially, the test ap-
paring the scores on the test with one or more pears – at face value – to test what it is designed
variables (criteria) from other measures or tests to test.
that are considered to measure the same fac- O Jury validity is an important element in con-
tor. Wolf (1994) argues that a major problem struct validity, where it is important to agree
facing test devisers addressing criterion-related on the conceptions and operationalization of
validity is the selection of the suitable criterion an unobservable construct.
measure. He cites the example of the difficulty O Predictive validity is where results on a test accu-
of selecting a suitable criterion of academic rately predict subsequent performance – akin
achievement in a test of academic aptitude. to criterion-related validity.
The criterion must be: relevant (and agreed to O Consequential validity is where the inferences
be relevant); free from bias (i.e. where external that can be made from a test are sound.
factors that might contaminate the criterion O Systemic validity (Frederiksen and Collins
are removed); reliable – precise and accurate; 1989) is where programme activities both
capable of being measured or achieved. enhance test performance and enhance perfor-
O Construct validity (e.g. the clear related- mance of the construct that is being addressed
ness of a test item to its proposed con- in the objective. Cunningham (1998) gives an
struct/unobservable quality or trait, demon- example of systemic validity where, if the test
strated by both empirical data and logical and the objective of vocabulary performance
analysis and debate, i.e. the extent to which leads to testees increasing their vocabulary,
particular constructs or concepts can give an then systemic validity has been addressed.
account for performance on the test): this is
achieved by ensuring that performance on the To ensure test validity, then, the test must
test is fairly explained by particular appropriate demonstrate fitness for purpose as well as
constructs or concepts. As with content valid- addressing the several types of validity outlined
ity, it is not based on test scores, but is more a above. The most difficult for researchers to
matter of whether the test items are indicators address, perhaps, is construct validity, for it
of the underlying, latent construct in question. argues for agreement on the definition and
164 VALIDITY AND RELIABILITY