Kobayashi 2002 Method Effects On Reading Comprehension Test Performance Text Organization and Response Format
Kobayashi 2002 Method Effects On Reading Comprehension Test Performance Text Organization and Response Format
I Introduction
The theoretical framework for this article is twofold: research in the
areas of language testing and of reading. In the area of language test-
ing, Bachman’s (1990 ) model of language ability (later revised in
Bachman and Palmer, 1996) was the main inspiration for this study.
By including ‘method facets’ as well as ‘trait facets’ in his discussion
of language ability, Bachman draws our attention to a range of factors
that can affect test performance and, therefore, jeopardize test val-
idity. His model is the most in uential and comprehensive available,
although there has hitherto been little further research to validate it
empirically.
According to Bachman, method facets can be divided into ve
categories:
1) testing environment;
2) test rubrics;
3) the nature of input;
4) the nature of the expected response; and
5) the interaction between input and response.
This study focuses on the third and fourth of these facets by manipul-
ating text organization and test format, both of which play a signi -
cant role in reading comprehension tests.
1 Text organization
This study draws on insights from reading research to shed more light
on the third of Bachman’s categories: the ‘nature of input’. A review
of studies examining text characteristics and readability suggests that
the coherence and organization of the text are signi cant factors
in uencing reading comprehension (Reder and Anderson, 1980;
Davison and Kantor, 1982; Duffy and Kabance, 1982; Klare, 1985;
Duffy et al., 1989; Olsen and Johnson, 1989; a series of studies by
Beck and his colleagues, e.g., 1982, 1984, 1989, 1991, 1995). Surface-
level features, such as syntactic or lexical elements, also affect
readability but are secondary. Background knowledge is an important
factor, but this has been extensively researched elsewhere (see, for
example, Steffensen et al., 1979; Johnson, 1981; 1982; Carrell and
Eisterhold, 1983; Alderson and Urquhart, 1983; 1985; 1988;
Mohammed and Swales, 1984; Steffenson and Joag-Dev, 1984; Ulijn
and Strother, 1990; Bernhardt, 1991; Salager-Meyer, 1991; Cla-
pham, 1996).
There have been some attempts to characterize coherence by:
1) examining how sentences are related to one another (Connor,
1987; Olsen and Johnson, 1989; Connor and Farmer, 1990);
2) quantifying and mapping the links between key words and
phrases (Hasan, 1984; Hoey, 1991 ); and
3) rating coherence holistically (Bamberg, 1984; Golden et al.,
1988 ).
These attempts have given useful insights. However, some of them
are too complicated for practical use, and none seem to characterize
the concept of ‘coherence’ precisely enough for research purposes.
Various researchers have also tried to establish schemes to identify
the overall structure of a text. Such schemes include:
· story grammar (e.g., Mandler, 1982);
· macro-structure (e.g., Kintsch and van Dijk, 1978 );
· content structure analysis (Meyer, 1975a; 1985); and
· causal chains (e.g., Trabasso et al., 1984 ).
Miyoko Kobayashi 195
The Meyer model of text analysis has been applied by a great number
of researchers (Kintsch and Yarbrough, 1982; McGee, 1982; Taylor
and Samuels, 1983; Carrell, 1984; Richgels et al., 1987; Golden et
al., 1988; Goh, 1990; Salager-Meyer, 1991 ). Their ndings suggest
that text organization has a signi cant effect on comprehension and
that texts with a better or more natural structure enhance comprehen-
sion (also see Dixon et al., 1984; Urquhart, 1984 ). The present study
builds on these ndings and explores their applicability in foreign
language reading comprehension tests.
My preliminary attempts to identify text types in naturally-occur-
ring texts suggested that the ‘comparison’ text type could be regarded
as an elaboration of the ‘description’ text type. Therefore, it was
decided to modify Meyer’s framework by combining ‘description’
and ‘comparison’ into a single category called ‘description’. In
addition, since too many text types would complicate the research
design, it was decided to adopt only the rst category of ‘collection’
as an example of the most loosely-organized text type: this was
renamed ‘association’. As noted above, Meyer herself renamed the
‘response’ text type, calling it ‘problem–solution’, which is a better
indication of what it entails. Thus, this study investigated the compre-
hension of four types of top-level rhetorical organization: ‘associ-
ation’, ‘description, ‘causation’ and ‘problem–solution’.
2 Response format
Returning to Bachman’s model and concern with test format, it should
be noted that Meyer and her associates used ‘recall’ as a way of
measuring reading comprehension performance when examining the
Miyoko Kobayashi 197
III Methodology
1 Pilot study
The pilot study was conducted before the main study and involved
219 Japanese university students. Its purpose was, rst, to examine
the viability of the research questions and, secondly, to identify poten-
tial pitfalls in the proposed research methodology. To this end, the
in uence of a number of relevant variables was explored. These
included: topic areas of reading passages, text length, text readability,
the number of questions, the nature of questions, students’ language
pro ciency and appropriacy of test level for the students. Although
the variables were not tightly controlled, the ndings suggested that
text structure and response format had an important impact on reading
comprehension. The main study was therefore designed to explore
this further. In addition to this relatively large-scale pilot study, the
preparation involved a series of mini-pilots and reviews by expert
judges (see Section III.4 below) to ensure the quality of the test
materials.
2 Participants
A total of 754 Japanese university students participated in the main
study, the majority being 18–19 years of age and in the rst or second
years of their courses. All had previously had six years of English
language learning at secondary schools. The students in intact English
language classes were randomly divided into twelve groups, with each
Miyoko Kobayashi 199
3 Materials
a English pro ciency test In order to establish the comparability of
the twelve groups, an English pro ciency test, consisting of 50 multiple-
choice grammar and vocabulary items, was conducted (for the
relationship between knowledge of grammar and/or vocabulary and
reading ability see, for example, Grabe, 1991; Alderson, 1993b ). The
test was designed to t the level of the students in the light of the
pilot study results. It drew on past papers of the Cambridge First
Certi cate and an English pro ciency test for overseas students used
in a British university. Statistical analysis con rmed that there was
no signi cant difference between the twelve groups in their English
language pro ciency (F = .39 d.f. = 11, 723, n.s.). The test results
were also used to divide the participants into three different pro-
ciency groups according to the rank order of their scores – Low,
Middle and High – as a basis for comparison at a later stage of the
study. The test statistics were: x̄ = 29.7 out of 50; s.d.= 8.07;
reliability alpha = .82; facility values ranging from .17 to .99 with a
mean of .59; item-total correlation ranging from .08 to .53 with a
mean of .34.
b Reading comprehension tests The texts used in the study were
specially prepared to maximize control over the variables identi ed
in the pilot study. Topic areas were rst chosen and model texts were
selected from several educational sources. Care was taken to minim-
ize the potential effects of cultural bias or student familiarity with the
topic (cf. Alderson and Urquhart, 1985; 1988; Clapham, 1996). Six
topics were chosen and, for each topic, four different texts rep-
resenting four text types were prepared, resulting in a total of 24 texts.
From the six topics, two sets of texts concerning ‘international aid’
and ‘sea safety’, were nally selected for use in the study on the
basis of expert judgement (see Section III.4 below) regarding their
suitability as representative samples of the selected text types. The
mean length of the texts was 369.3 words (with the range of 352–
384), and the mean score was 64.4 (with the range of 58.5–69.9 ) on
the Flesch Reading Ease Formula, which is one of the most widely
recognized readability indices.
After the eight texts had been selected, test items were developed
for each text in three formats: cloze, open-ended questions and sum-
mary writing. The number of items for each test was:
· 25 for the cloze test;
200 Method effects on reading comprehension tests
4 Expert judgement
Use of expert judgement is a fairly recent development in the second
language testing eld (e.g., Zuck and Zuck, 1984; Alderson and
Lukmani, 1989; Alderson, 1993a; Cohen, 1993). In this study expert
judges were asked to assist at different stages, ranging from text selec-
tion and item analysis to establishing marker reliability. Most of the
judges had MAs in applied linguistics and were currently engaged in
EFL teaching, materials development or testing consultancy. Where
non-native speakers were involved, their English pro ciency was of
a suf cient level to enable them to study for higher degrees at British
universities. Varying numbers of people were involved at different
points. For example, four educated native speakers of English were
asked to answer the cloze tests and to identify item characteristics;
10 educated native and non-native speakers of English were invited
to identify and rate the importance of ideas in the texts to provide a
basis for marking summaries, and so on.
Another example is text selection, which was conducted in the fol-
lowing manner: 27 people were given a description of the four text
types adapted from Meyer (see Section I.1) and a set of 12 passages
Miyoko Kobayashi 201
5 Procedure
6 Statistical analysis
The cloze tests were marked by the semantically and syntactically
acceptable word scoring method. The results were analysed using
SPSS/PC. For both the pro ciency test and the reading comprehen-
sion tests, descriptive statistics (i.e., means, standard deviations, item-
total correlations for individual items and reliability) were calculated.
In addition, for the reading comprehension tests, the analyses included
correlations with the pro ciency test and t-tests. On the basis of the
results of these initial statistics, ANOVAs (both one-way and two-
way) were conducted to test the research hypotheses. The signi cance
level was set at p , .05.
To assess the reliability of marking, 15% of the papers (n = 64)
of open-ended questions and summary writing were independently
marked by other expert judges (one other in open-ended questions
and two others in summary writing) in addition to the researcher. All
of these were native speakers of Japanese and experienced teachers
of English, with MA degrees in TESOL from a British university.
The correlations were .92 between the two markers for open-ended
questions and between .85 and .90 among the three markers for sum-
mary writing.
IV Results
1 Overall results
On the whole, reliability values were higher in cloze tests regardless
of text types (a = .86 ~ .90) in comparison to open-ended questions
and summary writing (a = .69 ~ .79). However, this seemed to be
because there were more items in the cloze test. When the values
were adjusted by using the Spearman–Brown prophecy formula to
Miyoko Kobayashi 203
60
50
Cloze
40 Open-ended
Summary
30
20
Association Description Causation Problem-solution
2 Hypothesis testing 1
To test the rst research hypothesis, ANOVAs were conducted to
examine the differences in the text type effects on reading-comprehension
test performance. The results showed that the observed effects were
statistically signi cant in all the three test formats individually and
overall as shown in Table 2.
The effects of response formats were also examined for each text
type, and it was con rmed that such effects were statistically signi -
cant in all text types except for the Causation texts (see Table 3).
The response format effect in the four text types overall was also
statistically signi cant.
The most important and interesting aspect of the results is that the
two-way interaction between the two effects proved to be statistically
signi cant (F (11, 723) = 6.149**, p , .005). This means that text
F df
F df
type and response format not only have signi cant effects on reading
comprehension performance separately, but they also interact with
each other. This con rms the statistical signi cance of the pattern
shown in Figure 2.
The statistical results reported here clearly reject the rst null
hypothesis. That is, the differences in test performance observed
across text types and response formats were statistically signi cant,
and therefore it cannot be posited that test performance is unaffected
by text type or response format.
60
50
40
30 High
Middle
20
Low
10
0
Association Description Causation Problem-solution
70
60
50
40
High
30 Middle
20 Low
10
0
Association Description Causation Problem-solution
60
50
40
30 High
Middle
20
Low
10
0
Association Description Causation Problem-solution
texts – were better able to exploit text structure in line with their
greater ability, and had their pro ciency magni ed. The impact of
different kinds of text organization varied considerably across differ-
ent pro ciency groups, and this suggests that it is essential to take
text types of reading passages into account when summary writing is
to be used as a measure of reading comprehension.
Table 4 Comparison of text types: correlation coef cients with the pro ciency test
5 Hypothesis testing 2
Table 5 summarizes the results of the one-way ANOVAs, which
examined text type effects for the three pro ciency groups. The table
presents an interesting picture. First of all, F values were always high-
est for the High pro ciency group, in all three response formats. This
suggests that effects of text type were most evident for this group of
learners, regardless of the response format. More interestingly, the
difference between the pro ciency groups became greatest in sum-
mary writing. Text type had no signi cant effect on the Low group,
whereas its effect on the High group was greater than either of the
other two response formats. This suggests that it does not matter for
learners of lower language pro ciency what kind of text structure is
involved in the passages that are used as input for summary writing,
but it does matter to a great extent for learners of higher pro ciency.
For open-ended questions, the difference between the pro ciency
groups was not so striking, but there was still a gradual increase in
F values as the pro ciency level rose. This suggests that the choice
of passages was also important in this response format when learners’
language pro ciency is higher. By comparison, in cloze tests, both
low and high groups showed signi cant values. However, as seen in
Section 3 above, the signi cant text type effects were not so problem-
atic in cloze tests because text types did not seem to affect discrimi-
nation between the different pro ciency groups.
Table 6, which shows the results of one-way ANOVAs examining
response format effects, presents an even more striking contrast
between the High and Low pro ciency groups. While the Low groups
reached a signi cance level only in the Description text type, in the
High group F values were signi cant in three text types and, more-
over, the values were always greater. This suggests that response for-
mat effects are more evident when the learners’ pro ciency level is
Table 5 The results of analysis of variance: text type effects by pro ciency levels
Main effect
Response format n.s. 8.57** 14.28**
(d.f. = 2, 236) (d.f. = 2, 235) (d.f. = 2, 255)
Text type n.s. 3.82* 8.87**
(d.f. = 3, 235) (d.f. = 3, 234) (d.f. = 3, 254)
Interaction effect 2.63* 3.06* 6.61**
(d.f. = 6, 227) (d.f. = 6, 226) (d.f. = 6, 246)
V Implications
1 Selection of reading passages
Typically, passages for reading comprehension tests have been arbi-
trarily selected without any coherent guiding principles. The basis of
selection may be linguistic dif culty (e.g., vocabulary or syntactic
complexity ) or the tester’s preferred topics. However, the ndings of
this study clearly suggest that it is essential to know in advance what
type of text organization is involved in passages used for reading
comprehension tests. Types of text organization do not seem to make
much difference if reading comprehension is measured by cloze tests
or if the learners’ level of language pro ciency is not high enough
for them to be able to exploit text organization for comprehension.
However, text types become most important if – as in summary writ-
ing – the test is intended to measure overall understanding. This is
particularly signi cant with learners of higher language pro ciency
because they seem to be unfairly disadvantaged and their pro ciency
will not be re ected accurately in test performance when unstructured
texts are presented. It is important that testing boards take these nd-
ings into account, especially since they may need to adjust test
methods according to the test-takers’ language pro ciency levels.
It is of particular interest that learners of lower language pro-
ciency did not bene t from clear text structure. At rst sight this
nding may seem surprising: clear structure ought to help them, and
some research studies suggest that this is the case (e.g., Reder and
Anderson, 1980 ). However, other studies suggest that better readers
are more aware of overall text organization, and that this awareness
enhances their comprehension (e.g., Meyer et al., 1980; Taylor and
Samuels, 1983; Golden et al., 1988). The nding of the present study
is in accordance with the second set of results.
This apparent contradiction in research ndings may be related to
the learners’ level of language pro ciency. That is, learners may need
to have reached a certain pro ciency level before being able to utilize
text organization for overall understanding of the text. This seems to
support a concept of linguistic threshold (e.g., Clarke, 1979; Alder-
son, 1984; Devine, 1988; Clapham, 1996; Ridgway, 1997). When the
level of language pro ciency is low, the learners have dif culty at a
Miyoko Kobayashi 211
VI Conclusions
1 Limitations of the study and future directions
This study has examined the test performance of a limited sample of
Japanese university students, whose English language pro ciency
levels ranged from lower-intermediate to intermediate. It would there-
fore be interesting to replicate the study by extending learner vari-
ables, such as language pro ciency level, age and different language
212 Method effects on reading comprehension tests
2 Final word
This research has employed Bachman’s in uential model of language
ability as an organizing framework. The ndings of this study have
provided data supporting two aspects of his model: the nature of input
and the nature of expected response. More research needs to be con-
ducted in this area so that the ndings reported here can be illumi-
nated further, but the main implications are clear.
Test results are often used to make major decisions for educational
purposes, and they play an essential role in many applied linguistics
research projects. This study has clearly demonstrated that there is a
systematic relationship between the students’ test performance and
the two variables examined: text type and response format. It is there-
fore vitally important for language testers, or anyone involved in
assessment, to pay great attention to the test methods they use.
VII References
Alderson, J.C. 1984: Reading in a foreign language: a reading problem or
a language problem? In Alderson, J.C. and Urquhart, A.H., editors,
1–27.
Miyoko Kobayashi 213
Open-ended questions
Answer the following questions in Japanese.
1. When they give aid to Third World countries, what do industrial-
ised countries want to happen in the future? (Literal understand-
ing; Local)
218 Method effects on reading comprehension tests
Summary writing
Write a summary of the passage in about 100 Japanese characters.
Appendix 2
Descriptive statistics of the reading comprehension tests
(converted in percentages)
Range Mean