0% found this document useful (0 votes)
13 views

Out

Uploaded by

Paula Marrodan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Out

Uploaded by

Paula Marrodan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

THE IMMEDIACY OF LINGUISTIC COMPUTATION

Spencer Caplan

A DISSERTATION

in

Linguistics

Presented to the Faculties of the University of Pennsylvania

in

Partial Fulfillment of the Requirements for the

Degree of Doctor of Philosophy

W
2021
IE
EV

Supervisor of Dissertation Co-Supervisor of Dissertation


PR

Charles Yang, Professor of Linguistics John C. Trueswell, Professor of Psychology

Graduate Group Chairperson

Eugene Buckley, Associate Professor of Linguistics

Dissertation Committee

Mitchell P. Marcus, RCA Professor of Artificial Intelligence


W
IE
EV
PR

THE IMMEDIACY OF LINGUISTIC COMPUTATION

© COPYRIGHT

2021

Spencer Philip Caplan


For Julia/n

.........

I’d say you make a perfect Angel in the snow

All crushed out on the way you are

Better stop before it goes too far

Don’t you know that I love you

W
IE
EV
PR

iii
ACKNOWLEDGEMENT

A dissertation is the culmination of a lot of work and a lot of growth. On both fronts I

have an immense amount to be thankful for. Yet, as I write this I can’t help but feel a bit

conflicted. This bookends a major chapter in my life, both personally and academically; and

I will always be tremendously grateful for the wonderful opportunities and conversations I’ve

had at Penn over the last six years. But there’s also an element of bittersweetness for me

(you’re only a graduate student once after all!), particularly as I reflect on an environment

that will never again be quite as I remember it. Nonetheless, many acknowledgments are in

order.

W
First among those whom I’d like to thank for guiding and supporting me throughout my

time at Penn is my exceptional team of advisors: Charles Yang, Mitch Marcus, and John

Trueswell — my triumvirate, the computational psycholinguistics dream team. I always felt


IE
“built up” by you all: fostering great confidence for the things I could go on to study, while

always pushing me to improve at my shortcomings. I was always given high expectations


EV
but never demands; freedom to pursue whatever path I chose, but always a patient ear to

guide me towards the real questions, wherever they were hiding. I’m certain that I would

have been far worse off had I chosen to study elsewhere.


PR

I am immensely thankful for the time I spent in conversation/debate with Tony Kroch,

who taught me as much as anyone else in grad school. I could always express exactly what

I was thinking to Tony, and I’d get nothing but the same intensity and earnestness right

back. I would also like to thank Lila Gleitman — whose depth of knowledge and infectiously

invested attitude were unparalleled in psycholinguistics — for many wonderful comments and

discussions during lab meetings.

I have co-authored papers with a number of people during my time here, all of whom

deserve mention: Deniz Beser, Kajsa Djärv, Alon Hafri, Jordan Kodner, Mitch Marcus,

Katie Schuler, John Trueswell, Hongzhi Xu, and Charles Yang.

iv
My time at Penn was also made much brighter by a host of friends, both near and far. Worthy

of particular mention is Jordan Kodner, my Comrade in StarLab and academic double-

sibling. The CIS- and Ling-fueled journey was far more fulfilling taken together. Thank you

to Doug Guilbeault, with whom I’ve shared countless laughs and an even greater countless

number of stimulating conversations (“soak to squeeze, cherish to yearn, 8 to 11 jug milk, 7

days”). A big thank you to Patrick O’Callahan and Billy Shinevar for our continued weekly

correspondence over the last six years: the Adorno-inspired, Zohar-curious, Freudo-Marxist

reading group was as good an intellectual outlet as it gets, and in many ways represents

an instantiation of the academic ideal. I would like to acknowledge all my classmates in

the cohort of 2015 (the Smartbeginners) and the licorice-fueled DARPA LORELEI team.

Additional thanks to Faruk Akkuş, Ryan Budnick, Andrea Ceolin, Victor Gomes, Alex

W
Kalomoiros, Steve O’Neill, Vichet Ou, Zack Wiener, Hongzhi Xu, and many others not

listed here. And lastly, thank you to Hannah Brooks, whose radiant and persistent kindness
IE
is so rare in this world and always appreciated.

I would like to thank everyone I’ve known through the Ballroom dance community, as dance
EV

was a particularly helpful outlet to escape the stresses of graduate life. In particular I’d

like to acknowledge my dance partners Maria Peifer and Alexa Gamburg for hundreds of

rewarding hours of practice, training, co-teaching, and competition; as well as my coaches


PR

Emanuele Pappacena and Francesca Lazzari for never letting my ego get too big.

Finally, I would like to thank my parents, who taught me to learn independently and never

offered anything less than their unconditional support. And to Julian, you still have such an

impact on me. I only wish you would have been able to read this and tell me it’s beautiful,

or tell me it’s shit :)

— Thanks for the ride!

v
ABSTRACT

THE IMMEDIACY OF LINGUISTIC COMPUTATION

Spencer Caplan

Charles Yang

John C. Trueswell

This dissertation investigates the wide-ranging implications of a simple fact: language un-

folds over time. Whether as cognitive symbols in our minds, or as their physical realization

in the world, if linguistic computations are not made over transient and shifting information

W
as it occurs, they cannot be made at all. This dissertation explores the interaction between

the computations, mechanisms, and representations of language acquisition and language


IE
processing — with a central theme being the unique study of the temporal restrictions in-

herent to information processing that I term the immediacy of linguistic computation. This
EV
program motivates the study of intermediate representations recruited during online pro-

cessing and acquisition rather than simply an Input/Output mapping. While ultimately

extracted from linguistic input, such intermediate representations may differ significantly
PR

from the underlying distributional signal. I demonstrate that, due to the immediacy of

linguistic computation, such intermediate representations are necessary, discoverable, and

offer an explanatory connection between competence (linguistic representation) and per-

formance (psycholinguistic behavior). The dissertation is comprised of four case studies.

First, I present experimental evidence from a perceptual learning paradigm that the inter-

mediate representation of speech consists of probabilistic activation over discrete linguistic

categories but includes no direct information about the original acoustic-phonetic signal.

Second, I present a computational model of word learning grounded in category formation.

Instead of retaining experiential statistics over words and all their potential meanings, my

model constructs hypotheses for word meanings as they occur. Uses of the same word

vi
are evaluated (and revised) with respect to the learner’s intermediate representation rather

than to their complete distribution of experience. In the third case study, I probe predic-

tions about the time-course, content, and structure of these intermediate representations of

meaning via a new eye-tracking paradigm. Finally, the fourth case study uses large-scale

corpus data to explore syntactic choices during language production. I demonstrate how a

mechanistic account of production can give rise to highly “efficient” outcomes even without

explicit optimization. Taken together these case studies represent a rich analysis of the im-

mediacy of linguistic computation and its system-wide impact on the mental representations

and cognitive algorithms of language.

W
IE
EV
PR

vii
TABLE OF CONTENTS

ACKNOWLEDGEMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv

LIST OF ILLUSTRATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix

CHAPTER 1 : Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

W
1.1 Outline of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 Study 1: The Intermediate Representation of Speech . . . . . . . . . . 2

1.1.2 Study 2: Word Learning as Category Formation . . . . . . . . . . . . 3

1.1.3
IE
Study 3: A More Direct Probe of Intermediate Representations during

Word Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
EV
1.1.4 Study 4: Choices in Language Production . . . . . . . . . . . . . . . . 6

1.2 In Sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
PR

CHAPTER 2 : The Immediacy of Linguistic Computation and the Representation

of Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1 Experiment 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1.2 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.1.3 Stimuli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.1.4 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.1.5 Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.1.6 Exclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.1.7 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

viii
2.1.8 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.1.9 Interim Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.2 Experiment 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.2.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.2.2 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.2.3 Stimuli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.2.4 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.2.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.3 General Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

CHAPTER 3 : Word Learning as Category Formation . . . . . . . . . . . . . . . . . 34

W
3.0.1 Word Learning and Generalization . . . . . . . . . . . . . . . . . . . . 35

3.0.2 Algorithms and Rational Behavior . . . . . . . . . . . . . . . . . . . . 37

3.1
3.0.3
IE
Organization of the Chapter . . . . . . . . . . . . . . . . . . . . . . . . 39

Models, Experiments, and Major Findings in Generalization . . . . . . . . . . 40

3.1.1 Word Learning as Bayesian Inference . . . . . . . . . . . . . . . . . . . 40


EV

3.1.2 Immediate Generalization Paradigm . . . . . . . . . . . . . . . . . . . 41

3.1.3 Experimental Phenomena . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.2 Robustness of the Presentation-Style Effect . . . . . . . . . . . . . . . . . . . 46


PR

3.2.1 Analyzing data from Lewis and Frank . . . . . . . . . . . . . . . . . . 47

3.2.2 Presentation-Style and Learning in Similar Domains . . . . . . . . . . 50

3.3 Naïve Generalization Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.3.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.3.2 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.3.3 Computing distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.4 Modeling Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.4.1 Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.4.2 Parameter-independent Evaluation . . . . . . . . . . . . . . . . . . . . 61

3.4.3 Parameter-tuned Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 62

ix
3.5 General Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

CHAPTER 4 : Selective Attention and the Intermediate Representation of Word

Meanings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.0.1 Design Constraints on Word Learning . . . . . . . . . . . . . . . . . . 70

4.0.2 Hypothesis Generation vs. Evaluation . . . . . . . . . . . . . . . . . . 73

4.1 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.1.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.1.2 Stimuli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.1.3 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.1.4 Measures for Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

W
4.1.5 Exclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

4.1.6 Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

4.2
IE
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

4.2.1 Effect of Timing on generalization . . . . . . . . . . . . . . . . . . . . 84

4.2.2 Relationship between eye-gaze and learning outcome . . . . . . . . . . 85


EV

4.3 General Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

CHAPTER 5 : The Incremental Mechanisms of Functional Design: Language Pro-


PR

duction and the Immediacy of Computation . . . . . . . . . . . . . . 92

5.1 Language Production . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.2 Verb-Particle Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5.2.1 Data Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5.2.2 Data Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.3 IG Predictions on the Verb-Particle Construction . . . . . . . . . . . . . . . . 102

5.3.1 Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.3.2 Predictability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5.3.3 Definiteness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.3.4 Object Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

x
5.3.5 Prior Mention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

5.4 Primary Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

5.5 Efficiency, Optimization and Uniform Information Density . . . . . . . . . . . 109

5.5.1 UID and Levels of Analysis . . . . . . . . . . . . . . . . . . . . . . . . 112

5.5.2 UIDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

5.6 Object Length Experiment between IG and UIDA . . . . . . . . . . . . . . . 115

5.7 General Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

CHAPTER 6 : Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

APPENDIX A: Supplemental Material for Chapter 2 . . . . . . . . . . . . . . . . . . . 124

A.1 Stimulus Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

W
A.2 Full Null Model Structures for Mixed Effects Regression Analyses . . . . . . . 125

A.3 Full Regression Outputs for Best Fitting Models . . . . . . . . . . . . . . . . 127


IE
A.4 Bayes Factor Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

A.5 Secondary Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133


EV
A.6 Effect of Pre-Registered Exclusion Criteria for All Experiments . . . . . . . . 137

A.7 Distribution of Participant Exclusions . . . . . . . . . . . . . . . . . . . . . . 140

A.8 Visualizing three-way interactions in main experiments . . . . . . . . . . . . . 141


PR

A.9 Norming study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

A.10 Experiment S1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

APPENDIX B: Supplemental Material for Chapter 3 . . . . . . . . . . . . . . . . . . . 151

B.1 Parameter Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

B.2 Gradient analysis of PSE in Lewis and Frank (2018) . . . . . . . . . . . . . . 151

APPENDIX C: Supplemental Material for Chapter 4 . . . . . . . . . . . . . . . . . . . 153

C.1 Nonce word labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

C.2 Possible Feature Alternations . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

C.3 Gaze Heatmaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

xi
C.4 Timecourse Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

W
IE
EV
PR

xii
LIST OF TABLES

TABLE 1 : Output of the best fitting model predicting /t/ responses on the
first half of test trials for Experiment 1. Bracketed values are 95%
confidence intervals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
TABLE 2 : Output of the best fitting model predicting /t/ responses on the
first half of test trials for Experiment 2. Bracketed values are 95%
confidence intervals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

TABLE 3 : Data from Lewis and Frank (2018). Dependent variable is the out-
come of broad vs. narrow generalization on all trials. Mixed-effects
logistic regression predicting generalization based on listed effects as
well as random slopes for subject and stimulus class. PSE and SCE
emerge as significant main effects along with a three-way interaction
between Presentation-Style, Training-Number, and Block-Order. . . . 48
TABLE 4 : Mixed-effects logistic regression predicting generalization outcome

W
on second-block trials (data from Lewis and Frank (2018)) based on
listed effects (Presentation-Style, Training-Number, Number-Timing
Interaction) as well as random slopes for each subject and stimulus
IE
class. Neither SCE nor PSE manifest on second-block trials. . . . . . 49
TABLE 5 : Mixed-effects logistic regression predicting generalization outcome
on first-block trials (data from Lewis and Frank (2018)) based on
listed effects (Presentation-Style, Training-Number, Number-Timing
EV
Interaction) as well as random slopes for each subject and stimulus
class. PSE and SCE emerge as significant main effects. . . . . . . . . 49
TABLE 6 : In this toy example, the initial training set contains a single instance
of a dalmatian with features (A:1, B:1, C:1, D:0). From this, the
learner extracts a mental representation of (A:0.3, B:0.8, C:0.3, D:0).
PR

During testing, a few potential items are all compared against men-
tal representation in order to select category members. Only val-
ues present in mental representation but missing from the evaluated
items incur a penalty. If the maximum category cutoff were 1.0, then
both the dalmatian and the poodle (shown with shaded background)
would be selected in this case. . . . . . . . . . . . . . . . . . . . . . . 60
TABLE 7 : Major patterns to be captured by models of word learning and gen-
eralization. Both the size of the training set (SCE) as well as the
temporal manner of presentation (PSE) have reliable effects on the
meanings posited by learners. “0.15” represents the typical standard
deviation from results in Spencer et al. (2011) . . . . . . . . . . . . . 62

TABLE 8 : Output of the best fitting model predicting Narrow generalization.


Bracketed values are 95% confidence intervals. . . . . . . . . . . . . . 84

TABLE 9 : Output of primary logistic regression model where the dependent


variable was particle-first order. . . . . . . . . . . . . . . . . . . . . . 109

xiii
TABLE 10 : Evaluating cases of N=2 or more words . . . . . . . . . . . . . . . . . 118
TABLE 11 : Evaluating cases of N=4 or more words. The effect of conditional
probability is absent, while the effects of frequency, object length,
and definiteness remain. . . . . . . . . . . . . . . . . . . . . . . . . . 118

TABLE 12 : Target stimulus pairs used in Experiments 1, 2, and S1. . . . . . . . . 124


TABLE 13 : Filler stimuli used in Experiments 1, 2, and S1. . . . . . . . . . . . . 125
TABLE 14 : Output of the best fitting model on all trials for Experiment 1 . . . . 127
TABLE 15 : Output of the best fitting model on the first half of test trials for
Experiment 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
TABLE 16 : Output of the best fitting model on the last half of test trials for
Experiment 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
TABLE 17 : Output of the best fitting model on the first half of test trials, text-
before condition for Experiment 1 . . . . . . . . . . . . . . . . . . . . 128
TABLE 18 : Output of the best fitting model on the first half of test trials, text-
after condition for Experiment 1 . . . . . . . . . . . . . . . . . . . . . 128
TABLE 19 : Output of the best fitting model on all trials for Experiment 2 . . . . 128

W
TABLE 20 : Output of the best fitting model on the last half of test trials for
Experiment 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
TABLE 21 : Output of the best fitting model on the first half of test trials, text-
before condition for Experiment 2 . . . . . . . . . . . . . . . . . . . . 129
TABLE 22 :
IE
Output of the best fitting model on the first half of test trials, text-
after condition for Experiment 2 . . . . . . . . . . . . . . . . . . . . . 129
TABLE 23 : Exclusions with ceiling/floor cutoff for each experiment (by condition).137
EV
TABLE 24 : Tuned parameter values from Section 3.4.3 . . . . . . . . . . . . . . . 151
TABLE 25 : Data from Lewis and Frank (2018). Dependent variable is the generalization-
level outcome on all trials. Linear mixed model predicting general-
ization based on listed effects as well as random slopes for subject
and stimulus class. PSE and SCE emerge as significant main ef-
PR

fects along with a three-way interaction between Presentation-Style,


Training-Number, and Block-Order. . . . . . . . . . . . . . . . . . . . 152
TABLE 26 : Data from Lewis and Frank (2018). Dependent variable is the out-
come of broad vs. narrow generalization proportion on second-block
trials. Linear mixed model predicting generalization based on presentation-
style, training-number, the presentation-number interaction, as well
as random slopes for subject and stimulus class. Neither SCE nor
PSE manifest on second-block trials. . . . . . . . . . . . . . . . . . . 152
TABLE 27 : Data from Lewis and Frank (2018). Dependent variable is the out-
come of broad vs. narrow generalization on first-block trials. Linear
mixed model predicting generalization based on presentation-style,
training-number, the presentation-number interaction, as well as ran-
dom slopes for subject and stimulus class. PSE and SCE emerge as
significant main effects. . . . . . . . . . . . . . . . . . . . . . . . . . . 152

TABLE 28 : Disyllabic nonce word labels used in Experiment 1 (Chapter 4) . . . 153

xiv
TABLE 29 : Potential feature alternations for each domain. . . . . . . . . . . . . . 154

W
IE
EV
PR

xv
LIST OF ILLUSTRATIONS

FIGURE 1 : Timeline of the main experimental manipulation. Participants were


provided with disambiguating text either before (a) or after (b)
hearing the corresponding audio. . . . . . . . . . . . . . . . . . . . . 12
FIGURE 2 : Design of the exposure and test phases in both experiments. Each
participant was assigned to one of four possible conditions during
the exposure phase (a), which had a 2 x 2 design: shifted phone (/d/
or /t/) and audio–text order (text before or text after). All partic-
ipants then completed the same task at test (b), categorizing audio
on a continuum of voice-onset time (VOT) as either “ta” or “da.”
The graph illustrates predicted categorization patterns (separately
for each shifted-phone condition) in cases in which adaptation occurs. 14
FIGURE 3 : Pairing of text and audio used in Experiments 1 and 2 in the shifted-
/d/ and shifted-/t/ conditions. Although all participants were ex-
posed to the same text, participants in the shifted-/d/ condition

W
heard audio with ambiguous voice-onset times (VOTs) paired with
“d” text, whereas participants in the shifted-/t/ condition heard
audio with ambiguous VOTs paired with “t” text. . . . . . . . . . .
IE 17
FIGURE 4 : Psychometric functions for Experiment 1: proportion of /t/ choices
as a function of voice-onset time (VOT) and shifted-phone condition
(/t/ or /d/), plotted separately for the text-before and text-after
conditions. Data points are the average of participant means, and
EV
error bars are within-subject 95% confidence intervals. Adaptation
occurred in the text-before condition, but did not occur in the text-
after condition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
FIGURE 5 : Psychometric functions for Experiment 2: proportion of /t/ choices
as a function of voice-onset time (VOT) and shifted-phone condition
PR

(/t/ or /d/), plotted separately for the text-before and text-after


conditions. Data points are the average of participant means, and
error bars are within-subject 95% confidence intervals. Adaptation
occurred in the text-before condition, but did not occur in the text-
after condition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

FIGURE 6 : Example word learning trial with test-grid shown to participants


in Xu and Tenenbaum (2007b); Spencer et al. (2011); Lewis and
Frank (2018). Figure adapted from Spencer et al. (2011). . . . . . . 43
FIGURE 7 : Proportion of broad property-projections based on sample size and
presentation-style (figure reproduced from Lawson (2014b) with
permission). When presented simultaneously, the size of training
has no effect on projection. When presented in sequence, the rates
of broad property-projection quickly approach a ceiling condition
as the size of training increases. . . . . . . . . . . . . . . . . . . . . 51

xvi
FIGURE 8 : Computation of mental representation from single training example
and subsequent comparison to test objects. Values are schematic
and for illustration only. . . . . . . . . . . . . . . . . . . . . . . . . 55
FIGURE 9 : Algorithmic flow charts highlighting some possible paths of NGM
behavior. This illustrates the common difference in experimental
outcome under parallel (left) and sequential (right) presentation of
stimuli. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
FIGURE 10 : Implementation of distance computation between an object and
mental representation under the NGM . . . . . . . . . . . . . . . . 59
FIGURE 11 : Chart of all seven training configurations. Conditions used for pa-
rameter tuning shown in light red. Time during training is indicated
within each block vertically; the objects in the parallel condition are
co-present at the same time, while the “sequential” trials training
objects are never co-present. . . . . . . . . . . . . . . . . . . . . . . 63
FIGURE 12 : Training on a single item. Experimental results from Spencer et al.
(2011) are shown in gold. Output of NGM in grey. Bars indicate
standard deviations. . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

W
FIGURE 13 : Training items presented in sequence. Experimental results in gold.
Output of NGM in grey. Bars indicate standard deviations. . . . . . 64
FIGURE 14 : Training items presented simultaneously. Experimental results in
gold. Output of NGM in grey. Bars indicate standard deviations. . 64
IE
FIGURE 15 : Visualization of parallel vs. sequential training conditions. Word1
in red and Word2 in blue. The total number of exemplars and
EV
display time remained constant across conditions. . . . . . . . . . . 76
FIGURE 16 : Sample pairs of maximally divergent stimuli (differing on all five
features) for each domain. . . . . . . . . . . . . . . . . . . . . . . . 79
FIGURE 17 : Example AOI calculation. The RF might be the front and the
bottom in blue. While the NF are the tail and the top in red. . . . 82
PR

FIGURE 18 : Bar graph of proportion of learning outcomes as a function of train-


ing condition (parallel vs. sequential) . . . . . . . . . . . . . . . . . 85
FIGURE 19 : Violin plot of the proportion of gaze-time allocated to RFs vs. NFs
as a function of Learning Outcome (learned vs. mislearned) . . . . . 87
FIGURE 20 : Violin plot showing RF-Skew as a function of learning outcome
(Narrow vs. Broad) . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
FIGURE 21 : Gaze to posited features is not affected by learning outcome (learned
vs mislearned). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
FIGURE 22 : Gaze to RF-set during each training exposure. Participants, in ag-
gregate, are likely to converge on their initial hypothesis. This fig-
ure includes trials from both parallel- and sequential-participants —
however, for parallel-participants the first three “trials” were actu-
ally objects on the screen simultaneously. . . . . . . . . . . . . . . . 90

FIGURE 23 : Basic outline of the language production architecture—adapted from


Bock and Levelt (2002). Time is indicated from left-to-right. . . . . 95

xvii
FIGURE 24 : Example illustration of IG output of verb-particle construction.
Variations in lemmas access or constituent assembly speed mani-
fest as comparable variations in linear order (when permitted by
the grammar): whichever element is retrieved and constructed first
is sent off to positional processing first. . . . . . . . . . . . . . . . . 103
FIGURE 25 : Distribution of object length (in words) within the present sample
of verb-particle data (67,905 sentences) . . . . . . . . . . . . . . . . 117

FIGURE 26 : Distributions of the 50% categorization thresholds in Experiment 1


from the main manuscript. No evidence for bimodality was observed
in any condition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
FIGURE 27 : Distributions of the 50% categorization thresholds in Experiment 2
from the main manuscript. No evidence for bimodality was observed
in any condition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
FIGURE 28 : There is no significant relationship between RTs and categorization
thresholds at test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
FIGURE 29 : There is no significant relationship between RTs and categorization

W
thresholds at test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
FIGURE 30 : Psychometric functions for phoneme categorization during testing
in Experiment 1. Output split by Shifted Phone (/t/ or /d/), Tim-
ing condition (text-before or text-after), and test-phase half (first
IE
four test blocks or remaining five). Adaptation occurred in the text-
before but not text-after condition and faded over the course of the
test phase (first vs. last half). Data points are subject means and
EV
error bars are within-subject 95% confidence intervals (Morey, 2008).141
FIGURE 31 : Psychometric functions for phoneme categorization during testing
in Experiment 2. Output split by Shifted Phone (/t/ or /d/), Tim-
ing condition (text-before or text-after), and test-phase half (first
four test blocks or remaining five). Adaptation occurred in the text-
PR

before but not text-after condition and faded over the course of the
test phase (first vs. last half). Data points are subject means and
error bars are within-subject 95% confidence intervals (Morey, 2008).142
FIGURE 32 : Violin plot of the median 50% threshold for “t” / “d” categorization
for each continuum in the norming study. Red line shows the over-
all median 50% threshold at 46.9ms. “_SHIFTED” and “_ORIG”
correspond to pitch-edited and original-pitch CV continua respec-
tively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
FIGURE 33 : Assignment of audio to text in Experiment S1. There is a confound
between “edited speech” and the particular phonological category
under manipulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
FIGURE 34 : Main results for Experiment S1. Psychometric functions for phoneme
discrimination during testing. Output split by Shifted Phone (/t/
or /d/) and Timing condition (text-before or text-after). Unlike in
Experiments 1 and 2, adaptation occurred in both the text-before
and text-after conditions. Data points are subject means and error
bars are within-subject 95% confidence intervals (Morey, 2008). . . 148

xviii
FIGURE 35 : Heatmap of gaze (within stimulus bounding box) throughout all
training trials and all stimulus domains. . . . . . . . . . . . . . . . . 155
FIGURE 36 : Heatmaps of overall gaze split by domain and overlaid on example
stimulus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
FIGURE 37 : Plots showing timecourse of gaze-time to the RF-set as a function
of learning outcome . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

W
IE
EV
PR

xix
CHAPTER 1 : Introduction

One the fundamental question of linguistics asks “what does a person know when they

know a language?” What are the mental representations that underlie our cognitive system

of linguistic meaning, how do we learn them, and how are they processed in real time?

To address these question, this dissertation investigates the wide ranging implications of a

simple fact: language unfolds over time.

Whether as cognitive symbols in our minds, or as their physical realization in sound waves

and ever-changing referents in the world, if linguistic computations are not made over tran-

sient and shifting information as it occurs, they cannot be made at all. This dissertation

W
explores the interaction between the computations, mechanisms, and representations of lan-

guage acquisition and language processing — with a central theme being the unique study

of the temporal restrictions inherent to information processing that I term the immediacy
IE
of linguistic computation. This program motivates the study of intermediate representations

recruited during online processing and acquisition rather than simply an Input/Output map-
EV
ping. While ultimately extracted from linguistic input, such intermediate representations

may differ significantly from the underlying distributional signal. I demonstrate in sev-

eral lines of work that, due to the immediacy of linguistic computation, such intermediate
PR

representations are necessary, discoverable, and offer an explanatory connection between

competence (linguistic representation) and performance (psycholinguistic behavior).

Given the system-wide impact that temporal restrictions have on language processing and

acquisition, this dissertation examines the immediacy of computation and the impact of

intermediate representations through a diverse set of projects involving computational mod-

eling, quantitative corpus analysis of language use, and psycholinguistic experimentation.

In particular, I address:

1. How are linguistic hypotheses (i.e. “what phone did I just hear”, or “what does that

word mean”) formed in real-time and what contents do these hypotheses comprise?

1
How does the process of generating hypotheses shape language compared with the

statistical evaluation of those ideas? A widespread intuition is that linguistic knowl-

edge and behavior are somehow governed by the raw computing power available to

the brain, but instead I argue that temporal restrictions have a far greater impact by

shaping the structure and content of hypotheses themselves1 .

2. How do the algorithms implemented in our minds use simple tools to actually com-

pute the often complicated Input/Output relationships we see in language processing

and acquisition? A set of outputs is often largely consistent with many possible algo-

rithms — this work attempts to identify which algorithms are most likely at play and

why, disentangling causes from effects.

W
1.1. Outline of the Dissertation
The dissertation is comprised of four distinct studies. This includes a set of experiments
IE
probing the intermediate representation of speech during online processing (Chapter 2), a

computational model and corresponding eye-tracking study of how learners handle semantic

ambiguity during word learning (Chapters 3 and 4 respectively), and a statistical analysis of
EV

how speakers make real-time choices during language production (Chapter 5). I discuss each

of these projects and the methods used to address these questions below in the remainder

of Chapter 1. While the individual findings stand on their own, when taken together they
PR

represent a rich analysis of the immediacy of linguistic computation and its system-wide

impact on the mental representations and cognitive algorithms of language.

1.1.1. Study 1: The Intermediate Representation of Speech

Chapter 2 answers a question in speech processing: what happens to the acoustic-phonetic

signal after it enters the mind of a listener. Previous work (Connine et al., 1991, inter alia)

demonstrates that listeners maintain intermediate speech representations over time. Suc-

cessful parsing necessitates the maintenance of some sort of intermediate representation in

order for listeners to use subsequent context to aide in the interpretation of prior phonetic
1
Thus an alternative title of the dissertation might have been: “How to Get (Linguistically) Rich when
you’re (Computationally) Poor”

2
input. Consider for instance how one would decide between the interpretation of a poten-

tially ambiguous word pair like “[t/d]ent” in a sentence such as “That was the [t/d]ent that

we saw in the forest/fender.” However, the internal structure of such representations — be

they the acoustic-phonetic signal or more general information about the probability of possi-

ble categories — has remained underspecified. I present experimental evidence from a novel

perceptual learning (“accent adaptation”) paradigm which supports the view that informa-

tion about the acoustic-phonetic signal is not maintained over time. In particular, I exposed

listeners to a speaker whose utterances contained acoustically ambiguous information con-

cerning phones/words and manipulated the temporal availability of disambiguating cues via

visually presented text (i.e., presentation before or after each utterance). Results show that

listeners adapt to the modified acoustic distribution only when disambiguating text is pro-

W
vided before the auditory information, but not after. This finding supports the position that

intermediate representations of speech consist of probabilistic activation over discrete lin-


IE
guistic categories (an account I call “AOC”) but not a direct record of the acoustic-phonetic

signal. Such results have impactful ramifications far beyond speech processing: limits to the
EV
storage of sensory input place real limits on mental representations. This may inform long-

standing debates in other areas of linguistics regarding the exemplar vs. abstract/discrete

representation of phones, morphemes, syntactic units and general mental categories.


PR

1.1.2. Study 2: Word Learning as Category Formation

Children famously face ambiguity during of morphological and syntactic acquisition (Yang,

2002, 2016; Tyler and Nagy, 1989; Pinker, 1989; Rumelhart and McClelland, 1985): how

does the learner deal with such ambiguity when multiple grammars are, in principle, con-

sistent with the words and sentences they have heard (Gold, 1967)? While words, unlike

syntactic units, are often thought of as atomic, a fundamental question in word learning

is strikingly similar: how, given only evidence about what objects a word has previously

referred to, are children able to generalize to the total class? How does a child end up

knowing that “poodle” picks out a specific subset of dogs despite their overlapping exten-

sions? Chapter 3 presents a model of word learning grounded in category formation (the

3
Naïve Generalization Model or “NGM”). While learners have been argued to display optimal

behavior by performing statistical inference over the input distribution of their experience

(e.g. via Bayesian inference — Xu and Tenenbaum (2007b)), they are also sensitive to input

conditions that are orthogonal to purely statistical reasoning (Spencer et al., 2011), like the

timing with which referents are encountered (for instance, whether stimuli are co-present on

the screen or viewed in sequence one second apart).

I contrast the NGM with the popular Bayesian inference theory of generalization (Xu and

Tenenbaum, 2007b). On the Bayesian account, learners have some representation of many

potential meanings for a word, and engage in statistically sensitive calculations to select the

hypothesis that is most probable given a distribution of attested exemplars. The “heavy-

W
lifting” and explanatory power resides in evaluation (via statistical inference) of many hy-

potheses without specifying the process which generates them. In contrast with previous
IE
Bayesian (Xu and Tenenbaum, 2007b) or associative (Regier, 2005) accounts, computation

in the NGM is local and lacks any global optimization over an evaluation metric. On my

view, word learning is an incremental and mechanistic process. Instead of retaining experi-
EV

ential statistics over words and all their potential meanings, the NGM constructs hypotheses

for word meanings as they occur. Uses of the same word are evaluated (and revised) with

respect to the learner’s intermediate representation (e.g. their current working conception)
PR

rather than to their complete distribution of experience. While in some cases this “working

conception” ends up being extremely similar to the distribution of experience, other cases

lead to divergent and highly-informative outcomes, in particular when stimuli are presented

sequentially rather than simultaneously. What you see (during learning) is not necessarily

what you get (in subsequent mental representation).

I evaluate, and find support for, the NGM on a range of experimental data — varying the

number and presentation-timing of stimuli, among other factors — on semantic generaliza-

tion in word learning (Xu and Tenenbaum, 2007b; Spencer et al., 2011). Learning behavior

is shaped by the immediacy of linguistic computation: learners are limited to locally eval-

4
uating only the fit of whatever structures they posit. Through this temporally constrained

process, one hypothesis will end up winning out because it offers a satisfactory fit to the

data, but this does not mean that the final meaning or grammar is provably optimal (as

often assumed by alternative accounts). Learners do the best job they can, not the best job

possible.

1.1.3. Study 3: A More Direct Probe of Intermediate Representations during Word Learning

The experiments modeled in Chapter 3 are informative as to the word representations that

result from successful learning, and the NGM makes predictions about intermediate states

of acquisition, but this does not provide direct evidence as to the fine-grained time-course

over which the relevant semantic generalizations emerge. Just like the generation/evaluation

duality, it is important for work in cognitive science to distinguish between an underlying

W
function (intension) and measuring its output (extension) as this is frequently a many-to-

one mapping. Chapter 4 introduces and presents results from a new eye-tracking paradigm
IE
(inspired by Rehder and Hoffman (2005a)) designed to test the predictions of broad classes

of word learning theories: accounts grounded in hypothesis generation like the NGM in
EV
contrast with accounts based on the statistical accumulation and evaluation of evidence. The

paradigm uses artificially created stimuli with spatially distributed features — each region

uniquely corresponding to a particular semantic dimension. By using eye-gaze as a measure


PR

of selective attention to these individual features, we are able to study the content and time-

course of intermediate representations as they emerge throughout learning. A statistical

accumulation theory predicts that learners should initially attend to all the dimensions that

they can in order to extract a representative sample before applying any evaluative filter

for the most likely meaning. Conversely the NGM predicts that learners should extract an

intermediate hypothesis on the basis of initial exposure; given the immediacy of linguistic

computation subsequent trials are evaluated only with respect to the hypothesized meaning2 .

I find that, consistent with the NGM, learners’ attention is limited only to the features
2
Perhaps the modernist movement had it right all along! “Nothing is less real than realism. Details are
confusing. It is only by selection, by elimination, by emphasis, that we get at the real meaning of things”
(attributed to Georgia O’Keefe — as quoted in Stuhlman (2007))

Reproduced with permission of copyright owner. Further reproduction prohibited without permission.

You might also like