0% found this document useful (0 votes)
9 views14 pages

Speaking and Writing Distinct Patterns of Wor 1988 Journal of Memory and La 2

Uploaded by

highlwlwt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views14 pages

Speaking and Writing Distinct Patterns of Wor 1988 Journal of Memory and La 2

Uploaded by

highlwlwt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

JOURNAL OF MEMORY AND LANGUAGE 21, 572-585 (1988)

Speaking and Writing: Distinct Patterns of Word Choice

DONALD P. HAYES
Cornell University
Natural conversations and popular television shows provide sufftcient lexical diversity for
children to develop novice levels of lexical expertise, but they are ill-suited for developing
the higher levels of such expertise. In large part, this is due to distinct patterns of word
choice in speaking and writing. Those patterns are revealed in a half-million word corpus,
designed to represent all major language sources (print, television, and conversation). Each
sample text is first compared against a common lO,OOO-type reference lexicon. This permits
comparisons between texts from different language sources. The pattern of word choice in
most printed texts is described by a simple linear equation, but conversations are tit by one
or another cubic equation. Linear and S-curve texts differ in their relative use of the major
articles, conjunctions, prepositions, pronouns, and most other function words, as well as the
more common and rarer content words. The ranking of the 10 most common function words
is uncorrelated in the two patterns. Consequently, conversations between college graduates
more closely resemble a preschool child’s speech to its parents than texts from newspapers.
Implications of the linear and S-curved patterns for models of lexical acquisition are con-
sidered. Q 1988AC&~ PESS. h.

Word choice has long been the subject of processing. As a sociologist, I share with
statistical analysis. Regularities, such as others an interest in the pronounced differ-
the lognormal model of word frequencies ences in rates of lexical acquisition among
(Herdan, 1960; Mandelbrot, 1953, 1961; children from diverse social backgrounds
Carroll, 1971), have been identified and de- and in how these differences are related to
scribed. Several pioneers in child develop- the child’s subsequent academic achieve-
ment and psycholinguistics were involved ment and adult’status attainment. My anal-
in these investigations, but theoretical and yses have focused on the lexical resources
applied mathematicians, biologists, statisti- children encounter in the three major
cians, linguists, and educators have also sources of their input language experience:
made contributions. Their research has conversations, television, and printed
been applied to education, publishing, com- texts. The analyses show there are distinc-
puter software, and artificial intelligence tive patterns of word choice in speaking
and to several important new areas in psy- and writing.
cholinguistics, e.g., lexical acquisition, My first objective is to describe these
connectionism, and parallel distributive general patterns of word choice and dem-
onstrate their robustness-by examining
how the 10,000 most common types in En-
I thank Anthony Wootton, Gordon Wells, Catherine glish were used in sample texts representa-
Snow, and Harry Levin for allowing me to use samples
from their field recordings of parent-child conversa- tive of all common forms of print, televi-
tions. I especially thank Bruce Hayes and Margaret sion, and natural conversation. The second
Ahrens for their helpful comments and assistance. The objective is to determine which terms (in-
entire line of research would not have been possible cluding the first 10 most common function
without the anonymous gift of a minicomputer by a words) are responsible for the differences
generous Cornell alumnus. Requests for reprints
should be addressed to Donald P. Hayes, Dept. of in word choice in writing and speaking.
Sociology, 346 Uris Hall, Cornell University, Ithaca, Third, the role of lexical accessduring real-
NY 14853. time speech and off-line writing is exam-

572
0749-596x/88 $3.00
Copyright 0 I!388 by Academic Press. Inc.
A8 rights of repmiuction in my fom reserved.
PATTERNS OF WORD CHOICE 573

ined. Fourth, the effects of the audience on States, school readers, comic books, chil-
the speaker’s and writer’s word choice are dren’s and adult fiction, and scientific writ-
described. Finally, the effects which these ing. All television samples were based on
two major patterns of word choice may transcriptions from the sound tracks of
have on children’s lexical acquisition are three different programs of the same show.
considered. All sample texts taken from print or televi-
Our best estimate for how often the terms sion are based on lOOO-wordsamples. Each
in the English lexicon occur in natural lan- sample consists of 10 subsamples of 100 or
guage is found in the American Heritage more words, in complete sentences, strati-
Dictionary study (Carroll, Davis, & Rich- fied to ensure that all segments of a text or
man, 1971). This is the largest, most mod- TV show were included. Not all conversa-
ern, broadest in topic coverage, machine- tions reach the full lOOO-word standard. In
oriented compilation available. In a corpus those cases, they are based on the full text,
of over 5 million tokens (based on over 1000 not samples.
samples, each with 500 or more words-in
complete sentences), 86,741 word types Lexical Analyses
(terms having unique spellings) were found.
The frequency of each word’s use (per mil- After a conversation was transcribed or a
lion tokens) and rank of each word were text copied in machine-readable form, a set
determined. The first and most common of computer programs examined the sample
10,000 types in that study (all terms occur- text for its word choices.’ The programs
ring three or more times per million tokens) determine how often each of the 10,000
are used as a common yardstick for com- most common English types was used in
paring a variety of texts. that specific text. The most comprehensive
description of a text’s word choice from
THE DATA SETS AND THEIR these 10,000 word types is a figure. Figure
LEXICAL ANALYSIS 1 shows how those types were used in 17
This section briefly describes the sample elite and mass British and American news-
texts, how they were sampled, where they papers.
came from, and how they were analyzed. The horizontal axis arrays the reference
All the data for these analyses were taken lexicon according to word frequency rank
from printed texts, broadcasts, or conver- (based on the U statistic in the American
sations recorded in their natural contexts Heritage Dictionary study). “The,” the
(the corpus is described in detail in Hayes, most common word, is located at the left
1986a). The conversations between parents margin. The 10,OOOthranked term is at the
and their children were recorded in their right margin. On this logarithmic scale, the
homes. These samples came from families 10th word is “it,” the 100th word is
living in Scotland, England, and the United “most,” the 1000th word is “chief,” and
States. No stranger was present during
these recordings. Other types of natural ’ The programs used in these analyses were devel-
conversation were recorded in nursery The oped by the author and are available to researchers.
schools, classrooms, a hospital nursery, a Scottprograms were initially written by Peter Bond and
McAllister in MING-BASIC for DEC PDP-11
hospital “pre-op” room, courtrooms, dor- machines. They were rewritten by David Post in
mitories, a bar, and a fire station. Court- Turbo-PASCAL for IBM-PC compatible machines
room testimony is based on transcriptions (DOS 2.0 or higher; 256 kb minimum). A new release
made by court recorders. Samples of under of LEX containing the latest lexical pitch measures is
printed matter include a wide selection of chines. development for the IBM-PC families of ma-
Programs and protocolssufkent to undertake
British and American newspapers, the ma- independent analyses of lexicalpitchin anytext canbe
jor high-circulation magazinesin the United obtained, at nominal cost, by writing to the author.
574 DONALD P. HAYES

1 10 100 1000 10000

WORO RANK (LOG SCALE)

FIG. 1. Linear patterns of word choice: newspapers and popular magazines (cumulative proportions
of tokens in texts). Star = newspapers; circle = popular magazines.

the 10,OOOthis “bleak.“* The vertical axis, ranked terms-“and,” “a,” and “to’‘-are
ranging from 0 to lOO%, represents the cu- added, producing the cumulative propor-
mulative proportion of all the tokens in a tion of all terms in a text accounted for by
text. these five most common function terms.
The complete pattern of word choice in a That process is repeated by the computer
text is shown by a line running from the until all instances of the first and most com-
lower left to the upper right. It begins with mon 10,000 American Heritage Dictionary
the proportion that “the” is of all words in types have been cumulated. The intercept
that text. In the Carroll et al. reference lex- of the line with the right vertical axis is the
icon, “the” alone accounted for 73 of every proportion of the text’s word choices which
1000tokens. In Fig. 1, the average of the 17 came from the reference lexicon. If that in-
British and American newspapers was 72 tercept occurs below the 100% level, then
per 1000tokens. The second most common the author or speaker used types whose
type, “of,” when cumulated with “the,” rank lies beyond the 10,OOOthrank term.
together account for 102 of every 1000 to- Most such words are simple inflected vari-
kens of text in the reference list (9 per 1000 ants of stem terms already listed among the
in the newspaper sample). This line contin- most common 10,000-but if not, and the
ues to cumulate toward the upper right term is neither a number nor a proper name,
when instances of the third through the fifth then the word is considered to be “rare.”
All rare words occur fewer than three times
* In the Carroll et al. listing of 86,741 word types, in a million words of running text according
instances of the same term, one capitalized and one in
lower case, were kept distinct. In this analysis, their to the Carroll et al. list.
relative frequencies were combined. The result is that
the 10,OOOth type in this analysis was ranked approx- THE PRINCIPAL PATTERNS OF
imately 10,500th on the Carroll et al. list. These anal- LEXICAL CHOICE
yses also exclude all numbers beyond 10, all proper In this section, the complex lexical
names, and treat tilled pauses (e.g., uh) with special
codes. Word polysemy has simply been ignored in choices made by speakers and writers are
these analyses, pending computational developments shown to be variants of two principal pat-
which will permit their valid and reliable recognition. terns-linear and S-curved. The patterns
PATTERNS OF WORD CHOICE 575

apply to all texts drawn to represent the full both the Word Frequency Book corpus and
range of printed matter, television shows, newspapers. The first 10 most common
and natural conversations. types in that corpus accounted for 23% of
all tokens, 24% in newspapers. The first 50
The Linear Pattern: Newspapers, most common types (all grammatical terms)
Magazines, and Many Books account for 40% of all tokens in the Carroll
The most popular forms of print have the et al. corpus, 41% in newspapers. This re-
same general pattern of word choice-a lin- semblance continues among the less com-
ear pattern. The clearest instance of the lin- mon content terms. Sampling error in such
ear pattern is the most widely read of all a small sample of newspapers (18,000 to-
texts-the newspaper. Figure 1 describes kens vs the 5 million tokens in the Dictio-
this pattern in British and American news- nary study) may account for the small dis-
papers and in 13 of the highest circulation crepancies between these two sets of per-
magazines in the United States. Such texts centages. The robustness of the pattern
are generally expressed in formal style. shows there must be an identifiable (but
The full pattern of word use in newspa- presently unknown) set of determinants of
pers is described by a simple linear equa- word choice to produce so similar a pattern
tion, and the empirical tit is tight. Joumal- of word choice from newspaper journalists
ists evidently draw upon the lexicon in vir- and authors of school books-despite the
tually the same way as the authors of the differences in topics and stages of language
over 1000texts which Carroll et al. sampled development among the readers.
in producing the Word Frequency Book. At Such a high level of similarity is impor-
the level of lOOO-word samples, word tant to this kind of analysis. It implies that
choice is remarkably orderly. This linear the Dictionary study’s word rankings and
pattern applies to each of the individual relative frequencies of the 10,000 most
newspapers, showing that this pattern is ro- common types in English is a reasonable
bust and not simply an artifact of pooling common reference lexicon for a wide vari-
word choices across 17 newspapers. ety of printed texts.
Popular magazine texts exhibit the same Finally, a key feature of these three sets
linear pattern. There is, however, a small of texts should be noted. Newspapers,
departure from linearity at two regions of magazines, and school children’s books
the lexicon. Both of these discrepancies and magazines all cover a broad range of
foreshadow the second major pattern of topics. Since every topic has its own sub-
word choice found in all natural conversa- lexicon of topic-related terms, wide topic
tions, dialogues in print and the simulated coverage ensures that these three sets of
conversations on popular television shows. texts will use a wide-ranging vocabulary.
Magazines slightly underuse the most com- This fact becomes relevant when the linear
mon of the function words (“the” and pattern is later compared with the nonlinear
“of ‘), and slightly overuse words ranked patterns of word choice in all forms of spon-
between approximately the 50th to 1000th taneous conversations-whose topic range
words. These discrepancies in probability is typically much narrower.
of word use are so small that the linear Nonlinear Patterns of Word Choice
model remains an excellent description for While generally following an S-curved
word choice in popular magazines.
The similarity of word choice in the Car- patterns there
pattern, is a wide range of nonlinear
of word choice.
roll et al. reference corpus to that in news-
papers and magazines also extends to the The Linear-with-a-Bow Pattern: Scientific
most common individual words. “The” and Technical Writing
alone accounted for over 7% of all tokens in In this entire corpus, samples of science
576 DONALD P. HAYES

100

90

SO

70

60

60

40

30

20

10

01, I I 1 I t 0
1 10 100 1000 10000

WORD RANK (LOG SCALE)


FIG. 2. Nonlinear patterns of word choice: scientific writing and preschool books (cumulative
proportions of tokens in texts). Square = Scientific American; circle = science, long articles; triangle
= science, abstracts of technical reports; hatch = preschool books; star = the linear pattern: news-
papers.

writing are unique, forming one extreme of What makes science texts unique is their
word choice. Conversations involving chil- unusual use of function words, the reduced
dren, books for preschool children, and use of the common content terms, and the
television shows designed for preschool extensive use of uncommon technical con-
children constitute the other extreme. tent words. Their use of the 10 most com-
Three lOOO-wordsamples of science writing mon types (including “the,” “of,” “and,”
were taken to illustrate the continuous na- “ ri and “to”) approximates that of linear
a,
ture of a transition from the linear to this texts, but from the 1lth ranked term to the
distinctly nonlinear word choice pattern. word ranked approximately 500, each text
The first sample was taken from ScieniiJic cumulatively underuses the remaining func-
American, written for a generally literate tion words and common content words.
audience interested in science. That sample Science texts are also unique in making rel-
is contrasted with the journal published by atively less use of the most common 100
the American Association for the Advance- types of English-40%. The closest cate-
ment of Science for its members-the gory of language texts is newspapers (48%),
weekly Science. The first of two Science while 57% of the tokens in basal school
samples was taken from the long articles at readers come from that short list. Scientific
the beginning of each issue, which are ad- American articles, the least technical of
dressed to the scientifically literate audi- these science samples, depart least from
ence. The third sample of 1000 words was the linear pattern. Considered alone, those
taken from abstracts of “technical articles might be described as following the
reports” -short articles on highly specific linear pattern, but that modest bow is evi-
research designed for subject specialists. dent and is more pronounced (at the same
Lexically, these abstracts are the most de- points) in the other more technical science
manding texts ever measured by these pro- samples, especially in the abstracts. To-
cedures (e.g., they contain the largest pro- gether, the pattern of these three samples
portion of rare words). suggeststhat whatever the determinants of
PATTERNS OF WORD CHOICE 577

word choice turn out to be, their effect is and the lower-than-normal use of words
continuous, not discrete. above rank 1000.
A Shallow S-Curved Pattern: Books for The Full S-Curved Pattern: Elementary
Preschool Children School Readers
The first clear indication of the common Readers designed for first through sixth
S-curved pattern of word choice appeared graders were sampled from two of the ma-
in books designed to be read to preschool jor educational publishers-Ginn and
children. While books generally follow the Scott, Foresman (Fig. 3). While written in
linear pattern, these books are intermediate formal style and designed for children rang-
between the linear and full S-curved pat- ing in age from 5 to 12 years, all exhibit the
terns. Twenty-one such books (including full S-curved pattern. The most striking fea-
such familiar characters as Winnie the ture of this set of readers is the systematic
Pooh, Babar, and Peter Rabbit) are summa- variation in those curves, as was the case
rized by the top curve in Fig. 2. Except for with the progressively technical science
the principal preposition, “of,” preschool samples. The most S-curved basal readers
books use the 10 most common functions are those designed for first graders; the
words as do linear texts. Beginning in the least curved are for sixth graders. One con-
region of the 11th ranked word, preschool tributing factor to this transition toward the
children’s books (in contrast to science linear pattern is the reduction in the propor-
texts) use function words and the most tion of dialogue in the advanced texts.
common content words more often than lin-
The Full S-Curved Pattern: Popular
ear texts. As would be expected, rare
words are statistically uncommon in pre- Television Shows
school books. The two bends of the S-curve The texts of most popular television
are produced by the early dip, the greater shows simulate conversations. Word
use of words ranked in the 10 to 1000range choice in 17 such shows (including

100 100
P
R 90 90
0
p 60 GO
0
R 70 70
T
I 60 60
0
N
50 50
0
F 4o 40

T 3o 30
0
K 20 20
E
N 10 10
S
0 0
1 10 100 1000 10000

WORD RANK (LOG SCALE)

FIG. 3. Nonlinear patterns of word choice: elementary school readers (cumulative proportions of
tokens in texts). Square = first-grade school readers; circle = second-grade school readers; triangle
= third-grade school readers; plus = fifth-grade school readers; diamond = sixth-grade school read-
ers; star = the linear pattern: newspapers.
578 DONALD P. HAYES

M*A*S*H, Dallas, and Love Boat) is and children underuse (compared with the
shown in Fig. 4. In producing scripts for the linear pattern) the same words and overuse
actors, writers have followed the word the same sets of function and content
choice pattern of natural conversation-the words.
full S-curved pattern. These scripted con-
versations are difficult to distinguish (lexi- The Full S-Curved Pattern: Formal
tally) from natural conversations. The Courtroom Testimony
shape of the S-curve for these popular adult The same S-curved pattern of word
television shows is approximately the same choice found in conversations expressed in
as that for school readers designed for thirdinformal style is also used in speech in one
and fourth grade pupils. of the most formal of social settings-the
courtroom. Figure 5 describes the curves
The Full S-Curved Pattern: Informal, for the same 22 witnesses while giving tes-
Natural Conversations timony under direct interrogation and again
All spontaneous conversations follow the under cross-examination in the Patricia
S-curved pattern of word choice. Figure 4 Hearst trial. A third curve depicts the tes-
depicts the cumulative frequency curves timony of 17expert witnesses in the Hearst
for (a) 47 adults (parents, teachers, and and other trials and U.S. Senate hearings
nurses) talking with children ranging in age on abortion.
from newborns through 12; (b) 18 children, Though the setting is formal, the pattern
age 4 through 12, talking with their parents of word choice among both lay and expert
at home; and (c) 30 college-educated adults witnesses remains S-shaped, though the lat-
in informal, spontaneous conversations in ter is less bowed than the former. All wit-
natural contexts. Despite the wide range of nessesmake reduced use of the primary ar-
lexical development represented by young ticles, prepositions, and main conjunctions;
children and college-educated adults, the they make greater use of the other function
three curves are remarkably similar. Adults and content words (through approximately

100
P
A 90
0
P 60
0
R 70
1
I 60
0
N 50

0
I=

T
0
K
E
N
S

1 10 100 1000 10000

WORO RANK (LOG SCALE1

FIG. 4. Nonlinear patterns of word choice: popular television shows and informal natural conver-
sations (cumulative proportions of tokens in texts). Square = popular television shows; circle =
adults speaking to children (newborn to 12 years); triangle = children (ages 4-12 years) speaking to
parents; hatch = adults speaking to adults (college-educated); star = the linear pattern: newspapers.
PATTERNSOFWORDCHOICE 579

100
P
R 90
0
p 60
0
” 704
T
I 60 60
0

N 50: 50

40

: 30

20

30 10
S

1 10 100 1000 10000

WORD RANK (LO6 SCALE)

FIG. 5. Nonlinear patterns of word choice: formal courtroom testimony (cumulative proportions of
tokens in texts). Square = witnesses under direct examination; circle = witnesses under cross-
examination; triangle = expert witnesses; star = the linear pattern: newspapers.

word rank 1000); and they make lesser use 1988), which we attribute to the utterly
of types beyond 1000 than would be ex- mundane nature of the topics discussed.
pected from the linear pattern.
To conclude this section on patterns of DIFFERENTIAL WORD USEPRODUCESTHE
word choice, a point illustrated by science S-CURVE AND LINEARPATTERNS
texts and elementary school readers should The programs which describe word use
be reinforced. By focusing on the central make it possible to identify those terms
tendency of the word choice patterns, im- which are used with different frequencies in
portant and systematic variations between linear and S-curve patterns. Their differen-
texts in the same language category have tial use provides clues to the underlying de-
been neglected. These variations generally terminants of word choice.
reflect the speaker’s or writer’s attempt to
tailor word choice to specific audiences of The Initial Bend in the S-Curve
readers or listeners. Among comic books, By themselves, the 10 most common
Richie Rich is designed for young readers. function words account for approximately
Such comics follow the full S-curved pat- 22-24% of all tokens used, varying little
tern. Others, like Dazzler and X-Men (pur- across all categories of printed texts, tele-
chased mainly by high school and college- vision shows, and natural conversations.
age students), have lexically demanding Compared with their use in linear texts, S-
texts and are nearly linear in their word curved texts systematically underutilize
choice pattern. The most popular comics five of the first six most common types in
are intermediate, more demanding in their English and overutilize the next four. These
lexicon but not as sharply S-curved as are the primary articles (“the” and “a”),
Richie Rich. I find the same adjustment of the primary prepositions (“of’ and “in”),
text to audience in alI language categories, and the primary conjunction (“and”).
not just comics and science articles. There About half of the initial bend in S-curves is
is one notable exception-parental talk to accounted for by the underuse of “the,”
children under age 12 (Hayes & Ahrens, alone. There is no systematic bias in the use
580 DONALD P. HAYES

of “to,” the fifth ranked term. The remain- gions of the lexicon causes the formation of
ing most common types are used more in the second bend in the curve. Content
S-curved than in linear texts. “Is,” the sev- terms in the region up to approximately
enth ranked term, is used slightly more of- word rank 1000 are in especially heavy use
ten. The eighth, “you,” appears far more in discussions about mundane topics.
often in S-curved texts than in linear texts. In short, the highly personalized nature
Both the ninth and tenth rank terms, “that” of natural conversation (as shown by the
and “it,” are used primarily as indefinite extraordinary use of “you” and “I”), the
pronouns. In conversation, the context of- contextualized nature of face-to-face con-
ten provides the missing referents for versation (permitting the more extensive
“that” and “it.” The systematic underuse use of “that” and “it”), and the mundane-
of five of the first six terms, followed by the ness of natural conversation topics (as
overuse of these next four most common shown by the heavy reliance upon the ma-
words in English, produces the initial bend jor content words ranked 100through 1000)
of the full S-curve pattern. help to account for the distinct patterns of
The term “I,” though listed as the 24th word choice in writing and conversation.
ranked term on the American Heritage Dic-
DISCUSSION
tionary list, has a strikingly different use in
this corpus of natural conversations. “I” is Our ability to describe the differences be-
ranked as the first or second most common tween word choice in writing and conver-
type in adult-adult conversations and in sation greatly exceeds our understanding of
children’s speech with their parents. It is their determinants. Two probable determi-
not so highly ranked in adult speech with nants of these distinct but robust patterns
children, probably because such conversa- of word choice are considered, as is an ob-
tions are typically more child-oriented. “I” jection to the methodology, and one impli-
is ranked fifth in frequency when adults cation.
speak with infants, sixth when they speak
with preschool children, and eighth when Probable Determinants of Word
they talk with elementary school-age chil- Choice Patterns
dren. The effect of this greater use of “I” in Speech and writing normally occur under
S-curved texts is to increase sharply the an- different time pressures. Differential access
gle of the cumulative frequency curve in to words in our productive/recognition vo-
that region. Alone, however, “I” is by no cabularies has long been suspected and is
means the sole contributor to that sharper now known to be frequency dependent
angle, because conversationalists make (Carpenter & Just, 1983; Just 8zCarpenter,
heavier use of the entire set of function 1987). The less common the term, the
terms, especially when talking with young longer the latency for its recognition and
children. retrieval from memory, as though the
words were arrayed according to the Car-
The Second Bend in the S-Curve roll et al. list (Marshalek, Lohman, &
The second bend in the S-curve is pro- Snow, 1981). The implication is that under
duced when the probabilities of using a the time pressures of natural conversation,
function and the most common content and especially where control of the floor is
words begin to drop, relative to linear texts, problematic, speakers are compelled to rely
generally around word rank 1000, but ear- upon the most accessible (common) words.
lier in conversations involving children. If this laboratory finding can be general-
The use of terms ranked 1000 to 10,000 is ized, conversation should contain a bias to-
generally lower than that in linear texts; ward the common words, relative to the
thus the differential use of these two re- standard of written texts. Among the texts
PATTERNS OF WORD CHOICE 581

in this corpus, that is precisely what these variations in word choice among comic
data show. In any one individual’s case, books, basal readers, and science texts-
however, the pattern would not necessarily where texts were tailored to a child’s read-
be the same. For example, while at work, ing level or to an adult reader’s scientific
we may often use occupational jargon- knowledge. Not all forms of print, how-
terms others might consider rare. Our fre- ever, have adjusted word choice to their
quent use of such words puts them “on the audience. Texts from the New York Times
tip of our tongues,” as it were, making and New York Daily News, on the same
those terms much more accessible than stories, differ little in their use of uncom-
they would be to someone else for whom mon words, despite major differences in the
such terms would be rare. style of the coverage and average educa-
Lexical access, while predicting the gen- tional attainments of their readers.
eral skew toward common words in S- Finally, because writing is more decon-
curved texts, does not predict that the top textualized than conversation, words must
10 words in linear texts will have a different be chosen to overcome the advantages of a
rank order in S-curved texts-in fact, their variety of nonverbal and contextual cues
rank orders are uncorrelated. Such a pre- available during face-to-face conversation.
diction is made by a second probable deter- Explicit reference is more important to
minant-a general audience effect. comprehension in writing. Evidence for
Audience effects were among the first that is the relative infrequency with which
and best documented observations about indefinite pronouns (e.g., “that” and “it”)
social interaction (Chapple, 1940; Giles, are used and the greater reliance in writing
Mulac, Bradac, & Johnson, 1986). The im- upon words belonging to the much larger,
mediate, continuing relevance of others but less frequently used, lexicon beyond
during interaction produces a wide range of the first 5000 types.
behavioral accommodations to one’s part-
ner, including such language accommoda- Describing Word Choice in a Text by Its
tions as “motherese” (Snow & Ferguson, Lexical Pitch
1977). More so than in writing, conversa- Lexical access and audience effect are
tional topics are chosen so as to overlap two among the factors which produce and
speaker and audience knowledge, interests, distinguish the major patterns of word
and activities. Having to take partners into choice. Their effects produce a continuous,
account increases references to one an- not discrete, shift in word choice toward or
other in conversation (increasing the use of away from the most common content
“I” and “you”), and increases the proba- words. This variation in lexical choice-
bility that mundane topics will dominate. ranging from lexically undemanding texts
Mundane topics rely heavily upon common like first grade readers, Richie Rich comics,
content words, skewing word choice into and preschool children’s speech to their
the S-curve pattern. parents to lexically demanding texts like
In writing, there is normally more time to newspapers, magazines, and science arti-
reflect on the most suitable (generally a less cles-is the dimension to which the expres-
common) word. The skewing effect of real- sion “lexical pitch” refers. “Pitch” is used
time decisions about word choice is relaxed in the sense of such lay expressions as: “I
in writing. There are also audience effects had to pitch my remarks to my audience”;
in print, though the text cannot be tailored “He talked down to her”; or “My doctor
as precisely to the audience as in conversa- can’t help talking over my head.” Statisti-
tion. The best evidence of such lexical ac- cal measures of lexical pitch describe this
commodation in print was the systematic range of continuous variations in word
582 DONALD P. HAYES

choice. While every word has its own lexi- The failure of some speakers and writers to
cal pitch (its relative frequency of use in make appropriate lexical pitch accommoda-
English), lexical pitch generally refers to tions for their audiences may have clinical
statistical descriptions of texts, such as a or educational significance.
passage from a book, an entire speech, or
the script of a television show. This is the A Methodological Issue in This
use to which lexical pitch was put in this Lexical Analysis
paper. An objection can be raised as to the use
Among the most easily interpreted mea- of the first 10,000 types on the Carroll et al.
sures of lexical pitch are the sample text’s list as the reference lexicon for all text com-
median and third quartile (42 and 43) con- parisons. Though based on a careful sam-
tent words. When all the words of a partic- pling from over a thousand school-related
ular text sample (e.g., Winnie the Pooh) are books and magazines, that corpus is not
arrayed according to their frequency of use representative of word use in all sources of
in that sample and all the 75 most common natural language. The analyses of newspa-
(function) words in English have been ex- pers, popular magazines, and books for
cluded, the text’s 42 content term is that adults and children showed that their word
term which divides the remaining array in choices closely tit the ranking of the Carroll
half. What is the lexical pitch of that 42 et al. list, justifying its use as a reference
term? It is the frequency per million words lexicon for other written texts, but is that
of that term in the master reference list, list appropriate for analyses of natural con-
Carroll et al.‘s Word Frequency Book, versations?
which serves as the common standard for The biological foundations for our cur-
all comparisons. Texts are compared by rent human language capacities were laid
these 42 and Q3 word frequencies. Texts down long before writing developed. Given
may be compared by their number of gen- the primacy of speech in the evolution of
uine rare words (per thousand tokens) and our biological capacity for language, it may
by more complex measures.3 be more appropriate to use spoken word
Every written and spoken text has been order, not written word order, as the refer-
“pitched” to one or another lexical pitch ence lexicon for these analyses and com-
level. Variations in lexical pitch levels indi- parisons. At present, that is impossible. A
cate the lexical resources in texts and show large and representative corpus of natural
how much speakers and writers have ad- conversation, comparable to the Carroll et
justed their word choice for their audience. al. list, does not yet exist. There are large
data sets of natural speech in their normal
3 Elsewhere (Hayes, 1986a; Hayes 13 Ahrens, 1988), contexts, recorded in Australia, Great Brit-
I have described the lexical resources that children ain, and the United States, but they are not
and adults encounter in a wide variety of natural con- sufficiently contemporary, large, or broad
versations, television shows, and written materials. in their demographic sampling or topic
Tables containing the statistical comparisons of the range or sufficiently diverse in their degrees
major language categories are given in those papers.
Previously reported statistics are outdated by periodic of formality to serve as a reference lexicon.
enlargements and reanalyses of the entire corpus. The The half-million word corpus of natural
most inclusive measure of a text’s lexical pitch is given conversations used for this report falls far
by an equation describing the extent of which that text short of what would be required for a cor-
departs from the linear model. Those values are then pus comparable to the Word Frequency
decomposed to show how much of the departure from
linearity can be attributed to different segments of the Book.
fmt 10,000 most common English terms, e.g., the If such a corpus existed and were used as
function terms. the Carroll et al. list was used here, the
PATTERNS OF WORD CHOICE 583

major distinction between speech and writ- the basic vocabulary. The profound effects
ing would not change-only their statistical of lifelong deafness, in otherwise healthy
descriptions would change. Texts de- children, suggest how powerful these lexi-
scribed in these analyses as linear would cal experiences must be to lexical develop-
probably have an inverted S-curve pattern, ment.
and texts described as one or another of the Knowledge of this 5000-word-type basic
S-curved patterns would probably be de- vocabulary is so widespread among chil-
scribed as linear or some variant of the in- dren entering school that vocabulary
verted S-curved pattern. The provisional subtests of standardized achievement and
conclusion must be that the two major pat- intelligence tests use only a few basic vo-
terns of word choice identified here are cabulary terms as test words (and then only
valid but their description is method depen- for testing lexical development among the
dent. Word choice in most forms of writing youngest children). The real power of these
and conversation differs in pronounced vocabulary tests to discriminate novice
ways, but the shapes of their distributions from competent, proficient, and mastery
are a function of what corpus serves as the levels of lexical expertise comes from the
reference lexicon. child’s knowledge of the other 600,000
rarely used types in English (Carroll,
An Implication for Lexical Acquisition 1971)-those which provide the language
In conclusion, one implication of these with its breadth, depth, and subtlety of ref-
analyses bears on how children’s biological erence and meaning. Of the 45 test words
capacities for language and their natural making up the Stanford-Binet vocabulary
language experiences interact to set the tra- subtest of intelligence, 25 have word fre-
jectory of their lexical development. Unlike quencies of less than once in a million. Thir-
the controversy over the origins of a child’s teen word types on the list did not occur
grammar, there is little dispute that children once in the entire 5-million-word Word Fre-
must have reached a certain developmental quency Book. Knowledge of such rare
stage for lexical acquisition to occur, must words distinguishes those at the highest
encounter concepts, learn the conventional levels of “verbal intelligence” (the mastery
word names by which concepts are commu- level) from the novice, the competent, and
nicated, and learn the uses to which those the proficient word users.
words may be put. Between the first and sixth grades, there
One perspective is to treat lexical acqui- is a lo-fold increase in the use of “rare”
sition like other forms of human expertise. words in the Ginn and Scott, Foresman
There are many gradations separating the basal readers (i.e., terms whose rank is be-
inexperienced/novice chess player, violin- yond the first 10,000 are not numbers,
ist, dancer, basketball player, or cab driver proper names, or inflections of more com-
from those operating at still higher levels of mon terms). To advance beyond the limita-
competence, proficiency, and mastery. To tions of conversation and popular television
attain even the novice level of lexical ex- (which are superbly suited to produce nov-
pertise requires years of immersion in over- ice-level lexical expertise), the child must
hearing conversations, daily participation both become literate and practice that liter-
in conversations, and listening to an aver- acy. As with chess or dance, the most rapid
age of 3% h of television daily. At the end development occurs with hours of daily
of 5 years of such experience, all healthy practice, carried on over many years with
children have come to know the main uses language sources rich in topic breadth.
for all function words and most of the 5000 “Natural ability” may make it possible for
word types which make up a language’s some children to develop this expertise
most common terms-what may be called more rapidly and more efficiently and to
584 DONALD P. HAYES

attain higher levels than others, but the three times higher a lexical pitch level than
higher levels of any expertise require pro- adult-child talk. Though these lexical re-
digious amounts of practice with lexically sources are readily available, the input ex-
rich sources. perience with them may vary considerably
Recent work on children’s language in- across different population groups, even
puts suggests that lexical knowledge is within the same family. For example, Gott-
strongly affected by the quantity of inputs. fried and Gottfiied (1984) found that par-
Carroll (1975) showed that acquisition of ents claim to have read over three times as
expertise in a second language (French) many minutes a day to first-born as they
was strongly related to the sheer amount of have to later-born children. Such informa-
time students devoted to its study. More tion, together with these lexical analyses of
recently, Dreeben and Gamoran’s (1986) texts, may help to explain Zajonc’s (1986)
microanalyses in Chicago’s first grade findings about sibling order and verbal
classrooms showed that differential prog- “IQ” scores. These and other studies sug-
ress in black and white students’ vocabu- gest that global measures of the richness of
lary acquisition can be traced to the number family environments are simply not suffi-
of minutes of classroom instruction de- ciently sensitive to detect important differ-
voted to actual daily reading and to differ- ences between children’s input lexical ex-
ences in the richness of the lexical re- periences and their effects on personal lex-
sources in the basal readers from which the icons and general knowledge.
children worked. Illiterates and nonreaders Our knowledge of the processes affecting
will continue to enlarge their conceptual word acquisition is limited, but we are con-
and lexical repertoires from low-pitched fident that sheer exposure to novel terms
conversations and popular television, but alone cannot be sufficient to account for
the pace of that lexical development is pre- differentials in lexical knowledge (Keil,
dicted to be much slower than that for chil- 1982; Anglin, 1978; Just & Carpenter,
dren who are both literate and read broadly 1987). Extensive contact is a necessary
across diverse topics. condition but the child must have reached
Texts rich in lexical resources are readily some essential cognitive developmental
available, inexpensive, and often tailored to milestones. Those encounters must be in
children’s interests. Daily newspapers, contextually rich settings and should sam-
most popular magazines, and even most ple a wide range of topics for word acqui-
comic books contain several times as many sition and generalization of knowledge to
rare terms as conversation and television proceed maximally. Rational public policy
(Hayes, 1986a, 1986b). While some sec- to raise achievement levels of many low-
tions of the newspaper are avoided by chil- income children requires a much better un-
dren, the comics, sports, Ann Landers-like derstanding of how these income-related
columns, and the entertainment sections all lexical differentials develop. Since Gor-
provide at least twice as many rare don’s (1923) early research on children who
words/thousand as natural conversation. were unschooled because they lived with
Children’s books, such as the Black Stal- their families on coal and grain barges in
lion or the Nancy Drew series, are on av- continuous movement around England, our
erage, three times richer in these terms than understanding of the relationship between a
texts of comparable length from parent or child’s natural input language experience
teacher speech. Books designed to be read and the size and depth of his or her concep-
to preschool children have texts whose lex- tual and lexical knowledge has grown very
ical pitch is 50% higher than average adult- slowly. The line of research represented by
child talk. Even Peter Rabbit and the Bugs the work of Carroll (1975), Dreeben and
Bunny cartoons are expressed at two to Gamoran (1986), Gottfried and Gottfried
PATTERNS OF WORD CHOICE 585

(1984), Hart (1982), McDermott (1978), and HART, B. (1982). Process in the teaching of pragmat-
others suggeststhat we may soon be able to its. In L. Feagans & D. C. Farran, The language
of children reared in poverty. New York: Aca-
specify much more clearly how a child’s demic Press.
specific input language experiences relate HAYES, D. P. (1986a). The Cornell corpus (Technical
to his or her lexical knowledge. That should Report No. 86-l). Ithaca: Dept. of Sociology, Cor-
permit more effective public policy to nar- nell University.
row this most important gap in children’s HAYES, D. P. (1986b). Validation analyses of lexical
pitch measures (Technical Report No. 86-3). Ith-
cognitive resources. aca: Dept. of Sociology, Cornell University.
HAYES, D. P., & AHRENS, M. G. (1988). Vocabulary
REFERENCES simplification for children: a special case of
ANGLIN, J. M. (1978). From reference to meaning. “Motherese”? 15, 395-410. Journal of Child Lan-
Child Development, 49, %9-976. guage.
CARPENTER, P. A., & JUST, M. A. (1983). What your HERDEN, G. (1960). Type-token mathematics. The
eyes do while your mind is reading. In K. Rayner Hague: Mouton.
(Ed.), Eye movements in reading: Perceptual and JUST, M. A., & CARPENTER, P. A. (1987). The psy-
language processes. New York: Academic Press. chology of reading and language comprehen-
CARROLL, J. B. (1971). Statistical analysis of the cor- sion. Boston: Allyn 8z Bacon.
pus. In J. B. Carroll, P. Davis, & H. Richman KEIL, F. C. (1982). Intelligence and the rest of cogni-
(Eds.), Word frequency book. Boston: Houghton- tion. Intelligence, 6, 1-21.
Mifilin. MANDELBROT, B. (1953). An information theory of the
CARROLL, J. B. (1975). The teaching of French as a structure of language based on the theory of sta-
foreign language in eight countries. International tistical matching of messages and codes. In W.
studies of evaluations, V. International Associa- Jackson (Ed.) Proceedings of a Symposium on
tion for the Evaluation of Educational Achieve- Applications in Communication Theory. London:
ment. New York: Halsted. Butterworth Scientific.
CARROLL, J. B., DAVIS, P., & RICHMAN, H. (1971). MANDELBROT, B. (l%l). On the theory of word fre-
Word frequency book. Boston: Houghton-MilBin. quencies and on related Markovian models of dis-
CHAPPLE, E. D. (1940). Measuring human relations: course. In R. Jacobson (Ed.), Structure of Lan-
An introduction to the study of the interaction of guage and its Mathematical Aspects, Providence:
individuals. General Psychology Monographs, 23, American Mathematical Society.
3-147. MARSHALEK, B., LOHMAN, & SNOW, C. E. (1981).
DREEBEN, R., 8c GAMORAN, A. (1986). Race, instruc- Trait and process aspects of vocabulary knowl-
tion, and learning. American Sociological Review, edge and verbal ability. (Technical Report No.
51, 660-669. 15). Stanford, CA: Aptitude Research Project,
GILES, H., MULAC, A., BRADAC, J. J., &JOHNSON, P. School of Education, Stanford University.
(1986). Speech accommodation theory: The first MCDERMOTT, R. P. (1978). Relating and learning: An
decade and beyond. In M. McClough (Ed.), Com- analysis of two classrooms reading groups. In R.
munication yearbook. Beverly Hills: Sage. Shuyp (Ed.), Linguistics and reading. Rowley,
GORDON, H. (1923). Mental and scholastic tests MA: Newberry.
among retarded children: An inquiry into the ef- SNOW, C. E., & FERGUSON, C. A. (1977). Talking to
fects of schooling on the various tests. (Educa- children. Cambridge: Cambridge University
tional Pamphlets No. 44). London: Board of Edu- Press.
cation. ZAJONC, R. B. (1986). The decline and rise in scholas-
GOTTFRIED, A. W., & GOTTFRIED, A. E. (1984). tic aptitude scores. American Psychologist, 41,
Home environment and cognitive development in 862-867.
young children of middle-socioeconomic-status ZIPF, G. K. (1949). Human behavior and the principle
families. In A. Gottfried (Ed.), Home environment of least effort. Cambridge: Addison-Wesley.
and early cognitive development. New York: Ac- (Received September 24, 1987)
ademic Press. (Revision received March 18, 1988)

You might also like