Probabilistic Language Models in Cognitive Neuro - 2017 - Neuroscience - Biobeha
Probabilistic Language Models in Cognitive Neuro - 2017 - Neuroscience - Biobeha
Review article
A R T I C L E I N F O A B S T R A C T
Keywords: Cognitive neuroscientists of language comprehension study how neural computations relate to cognitive com-
Cognitive neuroscience of language putations during comprehension. On the cognitive part of the equation, it is important that the computations and
Computational linguistics processing complexity are explicitly defined. Probabilistic language models can be used to give a computa-
EEG tionally explicit account of language complexity during comprehension. Whereas such models have so far
MEG
predominantly been evaluated against behavioral data, only recently have the models been used to explain
fMRI
neurobiological signals. Measures obtained from these models emphasize the probabilistic, information-pro-
Probabilistic language models
Information theory cessing view of language understanding and provide a set of tools that can be used for testing neural hypotheses
Surprisal about language comprehension. Here, we provide a cursory review of the theoretical foundations and example
Entropy neuroimaging studies employing probabilistic language models. We highlight the advantages and potential
pitfalls of this approach and indicate avenues for future research.
⁎
Corresponding author at: Kapittelweg 29, 6525 EN Nijmegen, The Netherlands.
E-mail address: [email protected] (K. Armeni).
http://dx.doi.org/10.1016/j.neubiorev.2017.09.001
Received 9 December 2016; Received in revised form 19 July 2017; Accepted 2 September 2017
Available online 05 September 2017
0149-7634/ © 2017 Elsevier Ltd. All rights reserved.
K. Armeni et al. Neuroscience and Biobehavioral Reviews 83 (2017) 579–588
2.1. Probabilistic constraints in language processing In a nutshell, probabilistic language models are mathematical
formalisms describing probability distributions over language data. One
Probabilistic models of cognition have witnessed a surge of interest of the most common applications of probabilistic language models is in
in recent years (Chater et al., 2006). In the domain of language, it has so-called sequence-prediction tasks. In the case of language, this means
generally been recognised that the cognitive system is sensitive to dis- probabilistic models can be used for generating expectations about
tributional properties of the language input and that probabilistic upcoming words given the words seen so far in a sentence (usually up to
constraints play a role in both early language acquisition and later a limited length).
language processing (Kuhl, 2010; Griffiths, 2011; Seidenberg, 1997). A distinction can be made between sequence-based language
Empirical support for human sensitivity to statistical/probabilistic models that predict the words based on sequences of past words—a
constraints at the level of words has been shown through the word domain also called “statistical language modelling”—and models that
frequency effect on word recognition, disambiguation, and ease of estimate the probability of a syntactic structure underlying the observed
processing (see Jurafsky, 2002, for review and evidence). Additionally, sequence of words or the probability of the upcoming word given the
the role of statistical/probabilistic constraints in language processing syntactic parse so far (see also Section 2.2.1 and Fig. 2). This is a do-
and production has been shown through the effect of contextual con- main proper to “computational linguistics” and as such normally con-
straints, that is, as graded sensitivity of behavioural or neural measures sidered distinct from statistical language modelling; there is, however, a
(e.g., reading times or amplitudes of event-related potentials) to how great deal of overlap between the two research domains (Rosenfeld,
constraining the prior context is on possible sentence continuations 2000).
(Gibson and Pearlmutter, 1998). For the sake of convenience, we will in this review subsume this
The effects of contextual constraints and word probabilities are distinction and use a single term “probabilistic language models” be-
commonly interpreted as reflecting some form of graded prediction, cause the neuroimaging studies reviewed presently in Section 3 employ
expectation or anticipation in language comprehension (Huettig, 2015; descriptions at both levels of granularity.
see also Kuperberg and Jaeger, 2015, for a terminological remark).
Word probability in sentences is normally measured by means of
human judgments in the cloze task (Taylor, 1953). In this task, parti- 2.2.1. Common ways of estimating probabilities
cipants are presented with sentence contexts where the target word How are language probabilities estimated? Three broader classes of
position is blank. They are asked to fill the blank with a plausible word. models are commonly used in computational psycholinguistics: n-gram
The cloze probability of a word is then determined by counting the models, phrase structure grammars (PSGs), and neural networks.
number of participants that used the word to continue the sentence. N-gram models, also known as Markov models, represent the sim-
Word probability effects and effects of contextual constraints pro- plest architecture for estimating the probabilities. The term n-gram
vide evidence that graded statistical/probabilistic constraints in the stands for any sequence with the length of n-items where the model
linguistic signal and linguistic experience more broadly impact the real- order n denotes the number of context words (n − 1) plus the word (nth
time human language processing system; however, in experimental word) for which the probability is computed (Jurafsky and Martin,
settings the exact computations explaining such effects are often not 2009). Therefore, a 4-gram model takes into account three preceding
modelled explicitly. In what follows, we provide a cursory review on words in a sequence for computing the conditional probability of oc-
how probabilistic information can be modelled and quantified formally. currence for the fourth word. The basis for computing these prob-
abilities are the relative frequencies of co-occurrence of word sequences
derived from the training data in language corpora. We add that an n-
580
K. Armeni et al. Neuroscience and Biobehavioral Reviews 83 (2017) 579–588
gram can stand for the sequence of actual words or, alternatively, 2.2.2. Context-boundedness and representations
syntactic categories of words (or parts-of-speech). To see how these classes of models compare to one another, it is
Apart from n-grams, probabilities of the upcoming words can also useful to consider their characteristics along two key dimensions:
be estimated by using either feed-forward (FNN) or recurrent neural whether there is a limit to the amount of context considered for com-
network (RNN) architectures (Bengio et al., 2003; see, De Mulder et al., puting the conditional probabilities (boundedness) and what is the
2015, for a recent review on RNNs). In these architectures, the words nature of representations over which these models compute. This gross
are not represented as symbol strings as with n-grams, but are instead classification with boundedness and representations is schematically
converted into vector representations; each word is coded as a sequence depicted in Fig. 2.
of real numbers—a real-valued feature vector. These vector re- In terms of the amount of context that can be taken into account for
presentations are given as input to a pre-specified number of neuron- estimating the probabilities, models fall either in the category of
like hidden units where activation of these units is given by mathe- bounded or unbounded models (represented column-wise in Fig. 2).
matical functions and transformations applied to the word vectors. Bounded models impose a finite bound to the length of the preceding
In RNNs, the hidden units also receive recurrent input from the context considered; model classes with bounded limit are the n-gram
states encoded in previous steps (see Fig. 2, bottom right) which means models and FNNs where the probabilities are conditioned on a fixed
any current state of the layer reflects the history (of an undetermined number of preceding words. RNNs and PSGs, on the other hand, are
length) of past network states (e.g., representing sentential context in unbounded models. An RNN’s hidden layer activation depends on the
language tasks). During model training in word prediction tasks, the entire input string so far (Fig. 2, bottom right) whereas in PSGs the
models adjust the weights (or parameters) assigned to each hidden unit current word can depend on words at any earlier point which makes it
and individual components of word vectors such that the difference possible to model long-distance dependencies between the words—a
between words predicted by the model and words that actually appear hallmark of language structure.
is minimized. The activation of the output units are rescaled such that The second classification dimension is the nature of representations
the output vector can be interpreted as a probability distribution over over which the models compute (represented row-wise in Fig. 2); spe-
words. Each unit's activation is the estimated probability that the cor- cifically, the representations can either be symbolic or vector-based (the
responding word will appear next, given the word sequence presented latter are also termed as: analogue, continuous or distributed re-
to the model. presentations). N-grams and PSGs fall into the first category whereas
PSGs are sets of so-called rewrite rules relating a phrasal class (e.g., a FNNs and RNNs operate over continuous or distributed vector word
noun phrase) to its constituent parts-of-speech (e.g., a determiner, an representations. The critical difference between the two types of re-
adjective, and a noun) to the actual word strings (e.g., “a red hat”). A presentations is that symbolic representations (e.g., word strings “dog”
PSG therefore provides, by sequentially applying the rewrite rules in a and “cat”) can only be equal or unequal with no inherent measure of
process called derivation, the structural description underlying a given similarity apart from the relationship reflected in the frequency of co-
sequence of words. A probabilistic (or stochastic) PSG assigns a prob- occurrence; in contrast, numerical, vector representations in neural
ability of a syntactic parse given a surface level string, or the probability networks can be compared using a similarity measure. For example,
of the upcoming word given the syntactic parse so far (Roark, 2001, see because a every vector has a direction in a vector-space, a distance
also Fig. 2, top right). The probability of the entire parse is determined between two word vectors (quantifying semantic distance between two
as a joint probability of all rewrite rules needed to generate the com- words encoded by these vectors) can be computed mathematically as a
plete parse. The probabilities of rewrite rules are determined from oc- function of the angle between two vectors (smaller angle indicates more
currences in syntactically-annotated corpora known as tree-banks (see, closely related words).
e.g., Marcus et al., 1993).
581
K. Armeni et al. Neuroscience and Biobehavioral Reviews 83 (2017) 579–588
On the basis of probabilities estimated with probabilistic models Until recently, probabilistic language models were predominantly
described above it is possible to compute the amount of information tested against behavioural data, such as grammaticality judgments, self-
conveyed by each word in a sequence. This is quantified with in- paced reading times, and eye-movements (e.g., Boston et al., 2008;
formation-theoretic complexity metrics such as word surprisal and word Demberg and Keller, 2008; Frank and Bod, 2011; Linzen and Jaeger,
entropy. A complexity metric is any measure quantifying hypothesized 2014; Lau et al., 2017). The use of probabilistic language models in
processing difficulty at the current word and need not be probabilistic; cognitive neuroscience of language comprehension represents a recent
the number of nodes traversed in a hierarchical syntactic derivation is trend; here we review example studies where probabilistic language
another example of a metric capturing comprehension difficulty models were used word-by-word to quantify complexity in sentence or
(Gibson and Thomas, 1999). For a complete treatment on information- story comprehension tasks. We begin by reviewing studies where in-
theoretic complexity metrics specifically, we point the reader to a re- formation measures represented the predictor of interest and continue
cent review by Hale (2016); here, we provide a brief overview to es- with those where they were used as an additional predictor to non-
tablish the necessary coherence with rest of the paper. probabilistic complexity measures.
Surprisal is an information-theoretic measure quantifying how un-
expected and thus how informative the current word (wt ) is given the 3.1. Information measures as the predictor of interest
words that precede it (w1, …, wt − 1). A higher word surprisal values in-
dicates that the currently encountered word is less expected given the Given that word surprisal and entropy quantify different aspects of
context. In mathematical terms, surprisal S (wt ) is defined as the nega- the incoming linguistic signal, Willems et al. (2016) used 3-gram lan-
tive logarithm of the word's conditional probability of occurrence: guage models and asked whether the two measures yield distinct loci of
activation in the brain while participants listened to auditory narra-
S (wt ) = −log P (wt w1, …, wt − 1) (1)
tives. Word entropy negatively correlated with blood oxygen level de-
If base-2 logarithm is used, surprisal is expressed in bits. The same is pendent (BOLD) signal in the right inferior frontal gyrus, the left ventral
true for the word entropy information measure, which quantifies how premotor cortex, left middle frontal gyrus, supplementary motor area,
narrow or spread-out the probability distribution of possible next words and the left inferior parietal lobule whereas word surprisal showed
is. If taken as a measure of cognitive effort, it models the degree of the positive correlations bilaterally in the superior temporal lobes and in a
listener's or reader's uncertainty about the upcoming word given the set of (sub)cortical regions in the right hemisphere (see Fig. 3). These
words encountered so far. Higher entropy values represent a higher results were interpreted within the predictive coding framework; re-
degree of uncertainty (due to a higher number of possible candidate gions sensitive to entropy were taken to reflect active predictions of the
continuations) whereas lower entropy values signify a higher degree of coming words (predictions are possible in low entropy states) and areas
certainty with fewer, highly probable continuations given the context related with word surprisal (how surprising the current word is) were
so far. Mathematically, entropy at the current word position H(t) is interpreted as possibly reflecting prediction errors in the early auditory
defined as the expected value of surprisal for the upcoming word (wt + 1) areas.
given the words encountered so far (w1, …, wt ): As explained in Sections 2.2.1 and 2.3, language probabilities and
complexity metrics can also be computed on the basis of syntactic
H (t ) = − ∑ P (wt + 1 |w1, …, wt ) log P (wt + 1 |w1, …, wt ) structures. Henderson et al. (2016) used the probabilistic phrase
wt + 1∈ W (2)
structure parser by Roark (2001) to study the cortical infrastructure
where W denotes the set of all possible words. sensitive to syntactic surprisal during naturalistic comprehension. The
Above, we introduced surprisal and entropy as defined over actually authors simultaneously measured BOLD responses and eye-movements
observed words in sentences, however, both metrics can also be com- while participants silently read stories in paragraphs. A whole-brain
puted on the basis of words’ parts-of-speech (Frank, 2010) or syntactic comparison between word groups with high and low syntactic surprisal
structures as obtained from probabilistic grammars (Hale, 2003; Roark revealed significant differences in the inferior frontal gyrus bilaterally,
et al., 2009). If the models take into account the actually observed left anterior temporal lobes (under a less conservative statistical
words, a metric is said to be lexicalized, whereas in the case of an un- threshold), bilateral insula, fusiform gyrus, and the putamen. There
lexicalized metric, only structural probabilities or probabilities of parts- were no statistically significant predicted differences in superior tem-
of-speech are used for computing complexity (Demberg and Keller, poral lobes or the superior temporal sulcus.
2008). In other words, unlexicalized complexity metrics are not con- The authors discuss the results as in line with current neurobiolo-
cerned with lexical-semantic properties of language input. However, gical models that place the cortical systems for syntactic computations
additional assumptions are required on the type of syntactic structures to inferior frontal and anterior temporal cortices. It is interesting to note
plausibly involved in human comprehension (Hale, 2003; Frank, 2013). that eye-tracking data revealed no differences for the syntactic surprisal
In addition to surprisal and entropy, another relevant complexity contrast; this stands in contrast to previous reports showing relations
metric is entropy reduction. Originally, Hale (2006) defined the entropy between syntactic surprisal metrics and eye movements (e.g., Boston
reduction, resulting from integrating word wt into the derivation of the et al., 2008; Demberg and Keller, 2008). The authors speculate that the
sentence so far, as the amount by which uncertainty about the complete novel use of a lexicalized syntactic surprisal—as opposed to un-
sentence's structure gets reduced by excluding structures incompatible lexicalized syntactic surprisal used in previous reports—might be a
with wt . In practice, however, estimating the probabilities of all possible possible source of discrepancy.
sentence structures is not feasible. For this reason, the scope of the In cognitive electrophysiology, one of the most studied signals is the
entropy computation has been reduced to, for example, the possible event-related potential (ERP); time-averaged voltage deflections re-
sentence continuations (Wu et al., 2010), a subset of upcoming four flecting an integrated (summed) response of large populations of spa-
words (Frank, 2013), or even just the single next word (Roark et al., tially and temporally coherent cortical pyramidal neurons (Luck, 2005).
2009). Under the assumption that those models and complexity metrics that
In brief, cognitive neuroscience and probabilistic language model- best explain the data also more closely resemble putative cognitive
ling conceptually share a common point in emphasizing information mechanisms, Frank et al. (2015) computed word surprisal and entropy
processing and probabilistic aspects of language comprehension. We reduction of words and their parts-of-speech under three types of
now turn to the literature where probabilistic language models were models: n-grams (n = 2, 3, and 4), PSGs, and RNNs.
used to analyze neural measures of interest. Out of all the possible relations between word information measures
582
K. Armeni et al. Neuroscience and Biobehavioral Reviews 83 (2017) 579–588
and six candidate ERP component amplitudes from an exploratory open syntactic nodes at the moment when each word was presented)
analysis, word surprisal measure computed on the basis of 4-grams and and probabilistic language models. The former showed significant ef-
RNNs significantly improved the fit of the regression model to the N400 fects in several left superior temporal and inferior frontal electrode
ERP amplitude over and above PSGs but not vice versa; that is, the sites, whereas lexical and part-of-speech bigram surprisal (i.e., transi-
inclusion of hierarchical syntactic information in the models was not tion probability) and next-word entropy showed positive and negative
reflected in better statistical fit. In terms of mechanistic interpretation, effects, respectively, in electrodes surrounding the left posterior su-
the authors take this result as compatible with the lexical retrieval ac- perior and middle temporal gyri.
count of the N400 component (Kutas and Federmeier, 2000). Based on these results, the authors argue in favor of neurophysio-
logical reality of hierarchical syntactic operations during sentence
comprehension. They interpreted the probabilistic predictability effects
3.2. Information measures as additional predictor
as consistent with other reports localizing neural generators of single-
word semantic priming, N400, and repetition suppression effects to
The studies reviewed above looked exclusively at the effects of in-
posterior temporal regions.
formation measures computed by probabilistic language models. We
van Schijndel et al. (2015) investigated the role of syntactic memory
now turn to studies where such measures are investigated in addition to
load during auditory story comprehension. The strength of spectral
non-probabilistic measures of complexity.
coherence of MEG oscillatory neural activity in the 10 Hz (alpha-band)
Brennan et al. (2016) investigated the neural correlates of syntactic
range was taken as a neural indicator of increased working memory
complexity during naturalistic comprehension. Comprehension diffi-
usage. Syntactic complexity was quantified as the number of in-
culty was characterized with n-grams, PSGs, and minimalist grammars
complete syntactic structures maintained at any word position (depth of
(a formal grammar that accounts for syntactic phenomena not ac-
syntactic embeddedness estimated based on the most likely parse of a
counted for by PSGs). A stepwise inclusion of progressively more
probabilistic PSG). N-gram probability predictors and a PSG surprisal
“syntactically sophisticated” language predictors improved the statis-
were used as control measures.
tical fit to BOLD time courses in the bilateral anterior temporal lobes,
The authors report that the average alpha-band coherence in a pair
left inferior frontal gyrus, left posterior temporal lobe, left inferior
of left posterior and anterior sensors range was significantly different
parietal lobule, and left premotor area. When taken on their own, the 2-
for two levels of syntactic depth while controlling for n-gram prob-
and 3-gram surprisal measures revealed significant effects in the ante-
ability effects; 3-gram probability showed marginal alpha-band co-
rior temporal lobes, left inferior frontal gyrus and the left posterior
herence effects prior to correcting for multiple comparisons. Similar to
temporal lobe.
the interpretations by Brennan et al. (2016) and Nelson et al. (2017),
Based on the fact that models including knowledge of hierarchical
the authors interpreted the results as showing that hierarchical lin-
syntax explained variance over and above the models that incorporate
guistic structure is computed during comprehension because it im-
only linear, word sequence-based statistics, the authors take their re-
proves the fit to empirical data over competitive non-hierarchical
sults as evidence for the involvement of abstract syntactic linguistic
models.
knowledge in every-day sentence comprehension. The effects of sur-
Finally, apart from regression-based analyses and factorial designs,
prisal are in part consistent with the results by Willems et al. (2016)
the statistical relationship between neural data and language model
who similarly report word surprisal effect in the posterior temporal
output can also be ascertained by means of multivariate statistical
lobe.
techniques, for example, by using features of a language model in an
Nelson et al. (2017) investigated modulations of average high fre-
intermediate step for decoding stimulus identity from multivariate
quency (70–150 Hz) power in intracranially recorded electro-
neural data. Wehbe et al. (2015) report that binary word classification
physiological signals by hypothesized syntactic phrase-structure
accuracy based on MEG amplitudes, which in turn were predicted by
building operations during a word-by-word sentence reading task. In
RNN output vectors—interpreted as word probabilities—, was highest
model-comparison analysis, they contrasted explanatory power of non-
approximately 400 ms after word onset, which can be seen as consistent
probabilistic hierarchical syntactic predictors (counting the number of
583
K. Armeni et al. Neuroscience and Biobehavioral Reviews 83 (2017) 579–588
with results by Frank et al. (2015) who found a positive correlation pace and incrementality of language comprehension processes, covert,
between lexical surprisal and the N400 amplitude. On the basis of the online measures of comprehension difficulty such as eye-movement
time-course of classification accuracy, the authors linked the late effect records have been a key component of empirical evaluations for com-
of word probability to word integration processes (that differ between peting models in reading and spoken comprehension (see Rayner, 1998;
unpredictable and predictable words). Huettig et al., 2011, for reviews).
Brain signals can be similarly considered as covert markers of online
3.3. Summary cognitive difficulty and as such taken as empirical test bed for cognitive
hypotheses implemented in language models. Whereas in language,
Current applications of probabilistic language models in cognitive compared to model-based approaches in other domains, neural mea-
neuroscience show that probabilistic language models can be used with sures do not necessarily represent exclusive diagnostic data for evalu-
hemodynamic and electrophysiological methods and allow researchers ating cognitive-computational theories, any neurophysiologically valid
to investigate and focus on spatial fingerprints for specific linguistic cognitive theory should ultimately account for neural measures as these
computations in cortical regions (Willems et al., 2016; Henderson et al., are closely linked to the underlying neural computations. As such,
2016) or to compare predictions of different models against each other neural validation of cognitive theories provides cognitive-computa-
on the basis of same neurobiological data, be it fMRI time courses tional constraints for plausible neuronal computations (Mars et al.,
(Brennan et al., 2016), language event-related M/EEG components 2012; Palmeri et al., 2016, see also Section 5.5).
(Frank et al., 2015; Wehbe et al., 2015), or spectral contents of elec-
trophysiological signals (van Schijndel et al., 2015; Nelson et al., 2017). 4.3. Statistical efficiency in analyses
The studies employed language stimuli in both auditory and visual
modalities and, with the exception of the studies by Frank et al. (2015) In most current empirical applications of language models, com-
and Nelson et al. (2017), used language stimuli in naturalistic, narrative plexity metrics are computed for all words in experiments which im-
contexts. We now turn to a more detailed discussion of specific ad- proves statistical sensitivity in the studies compared to the traditional
vantages and disadvantages of the approach. experimental approach. For example, the three stories used by Willems
et al. (2016) yielded approximately 3000 words, all of which were
4. Advantages considered as separate trials in the analysis. This contrasts with the
currently prevailing experimental approaches, where, most often, stu-
4.1. Formalized cognitive computations dies will only investigate neurobiological effects on target words in non-
filler items. This of course follows from the logic of experimental de-
What can we expect to learn from model-based analyses? signs; however, it also means that large stretches of neural data are
Probabilistic language models represent the computational level of collected without being inspected or considered in the analysis.
explanation in cognitive neuroscience in the time-honoured sense of Further, probabilistic language models provide a quantification over
Marr (1982): What aspect of the language input enters into the com- a range of values, rather than only the extreme poles of the spectrum
putation? What is being computed and why? Quantitative methods which is common in subtraction-based designs (but see, e.g., Pallier
represent a complement to subtraction paradigms in neuroimaging (see et al., 2011, for an exception). In case of significant statistical depen-
Hagoort, 2014, for a recent review on sentence comprehension) where dence between variables, parametric variation gives stronger support to
cognitive computations are inferred on the basis of informal, qualitative the actual workings posited by the model compared to factorial designs
task-based cognitive contrasts. (Bechtel and Abrahamsen, 2010).
Reading off cognitive computations from tasks is not straightfor-
ward (Boone and Piccinini, 2016) in that it must first be assured that 4.4. Naturalistic stimuli and data reuse
the task taps into the target linguistic computation and not, for ex-
ample, meta-linguistic processes. This can be assured by comparing Apart from explicitness and increased statistical sensitivity of re-
several informal task contrasts (see, e.g., Kaan and Swaab, 2002, for a search designs, there is another potential advantage of language mod-
discussion on task contrasts for syntactic computations) or by compu- elling: it makes it easier to study the brain responses to naturalistic
tationally modelling the task itself (see, e.g., Norris et al., 2000, for a stimuli (Brennan, 2016). Even though the study of language in its
model of phoneme monitoring). Once this is established, it is possible to ecological setting has in certain cognitive traditions been regarded as an
draw links to the observed neural effects. In model-based approaches, ill-advised enterprise on principled and practical grounds (Chomsky,
however, markers of sentence-level cognitive computations, for ex- 1959, 1995), it was highlighted as a necessary empirical step to study
ample syntactic surprisal, are directly statistically related to neural the brain from the systems level (see Hasson and Honey, 2012; Small
signals. and Nusbaum, 2004).
From a methodological perspective, explicit mathematical defini- The approaches reviewed here strike a balance between the two
tions and computational implementations lead to a more rigorous and perspectives: while the computational part enables rigorous for-
standardized quantification of independent variables which reduces malization of the cognitive hypothesis, absence of secondary task
dependence on researchers’ operationalizations of specific concepts during the experiment enables the study the of brain responses to more
(but see Section 5.1 for potential pitfalls related to allures of for- ecologically valid stimuli. Studying the brain in naturalistic settings is a
malization). desirable research approach (for a recent overview of challenges and
developments, see the contributions in Willems, 2015), nevertheless,
4.2. Theory evaluation we hasten to add that it should complement established experimental
approaches which capitalize on well-controlled task-based designs (see
In other domains of cognitive neuroscience, such as decision- e.g. Fetsch, 2016, for a recent opinion on the importance of experi-
making and cognitive control, linking neural data to parameters of mental designs); for example because a specific cognitive hypothesis
formal models served as a fruitful way to overcome the impasse when might not be available and implemented as a probabilistic language
competing models could not be distinguished based on overt behavioral model.
responses alone (Forstmann et al., 2011). As such, model comparison It is also worth emphasizing that the absence of specific task con-
proved to be a major contribution of combining model-based ap- straints in the experimental design lends these types of neuroimaging
proaches with neurobiological data (Mars et al., 2012). Given the fast data sets appropriate for reuse and sharing for analyses with new
584
K. Armeni et al. Neuroscience and Biobehavioral Reviews 83 (2017) 579–588
language models that embody novel hypotheses; a component of con- sentence as nuisance variables. Alternatively, in factorial designs, it
temporary research practice which is being actively recognized in the must be ensured that experimental conditions are chosen such that they
neuroscience community (Poldrack and Gorgolewski, 2014). are matched for other lexical variables as was done in Henderson et al.
(2016). The list of potentially confounding variables can extend de-
5. Limitations and pitfalls pending on the experimental settings; in an eye-tracking study,
Demberg and Keller (2008), for example, included also the eye-move-
Even though probabilistic modelling comes with evident ad- ment specific variables about whether the previous word was fixated or
vantages, it has, as is true for any methodological advancements, spe- not, launch distance, and fixation landing position in addition to word
cific limitations. In light of increasing acceptance of model-based ana- length, word frequency, forward transitional probability, backward
lyses by experimental cognitive neuroscientists, it is important to transitional probability, and word position in the sentence.
render these pitfalls explicit.
5.3. Syntactic and semantic complexity
5.1. Allures of formalization
A distinction between abstract, syntactic computations and
Due to their computational implementation and quantitative nature, meaning-bearing semantic operations has been a cornerstone in cog-
formally estimated language probabilities can be seen as representing a nitive sciences of language and represents a theoretical framework for
more objective estimate than measures of cloze probability obtained on research cognitive neuroscience (see, e.g., Friederici and Weissenborn,
the basis of subjective, human judgments (Staub, 2015). It is true that 2007; Kuperberg, 2007, for discussion). A word's frequency of co-oc-
language models and complexity metrics improve the comparability currence is in principle governed by both its syntactic valence and its
between experiments and can be viewed as objective from that point of lexical–semantic relationships to neighbouring words. In terms of
view. probabilistic language models, it is important to note that lexical, word-
Nevertheless, even for formal estimates the extent to which they based probabilistic language models (n-grams, RNNs) reviewed pre-
capture the “ground truth” can be debated. Using complexity measures sently cannot disentangle sources of semantic and syntactic complex-
obtained from a single language model on experimental stimuli would ities apart.
be comparable to using judgments of a single participant for quanti- Whereas the issue of resolving semantic and syntactic influences at
fying measures of cloze probabilities (see also Smith and Levy, 2011, for the level of words seems to be a technical rather than a principled one
discussion on the two types of language probabilities). The com- (see Padó et al., 2009; Frank and Vigliocco, 2011, for suggestions on
plementarity of the two ways of estimating probabilities is further un- formalizing syntactic versus semantic probabilities), at present, lex-
derscored if we consider that in speech recognition tasks, for example, ical–semantic influences on probability estimates can be overcome by
human judgments (providing knowledge not captured in the models using predictors based on unlexicalized complexity measures on the
alone) can be used to improve model performance (Rosenfeld, 2000). basis of parts-of-speech n-gram models as in Frank et al. (2015) or
Second, probabilistic language models describe the probability probabilistic PSGs as in Henderson et al. (2016), and Brennan et al.
distributions over words but do not model the human language acqui- (2016) rather than using lexicalized metrics based on actual words
sition trajectory. Specifically, models are trained on large amounts of themselves.
language data which does not correspond to how such knowledge is
acquired by humans, who exploit a variety of other multimodal sensory 5.4. Linguistic levels of analysis
and social cues (see Kuhl, 2010; Saffran, 2003, for reviews). From an
explanatory perspective, it would therefore be inaccurate to implicitly A hallmark of linguistic analyses is to view the language system as
treat models trained on collections of text as models of language ac- comprising of different levels of linguistic granularity, minimally of the
quisition. phonological, lexical-semantic (word-based) and syntactic linguistic
levels (Jackendoff, 2002). One of the important properties of language
5.2. Lexical confounds models and complexity metrics is that in practice these can be com-
puted per each word in a sentence capturing the incrementality of
All ways of estimating formal language probabilities, in one way or human sentence processing (Hale, 2016).
another, rely on observed frequencies of occurrence in collections of However, it must be emphasized, that neural effects are cannot al-
texts—language corpora. Together with the fact that complexity mea- ways be assessed for all individual words. For example, temporal evo-
sures are computed on a word-by-word basis, this means that by con- lution of the BOLD-response as measured with fMRI is slower than the
struction probabilistic complexity measures are likely to correlate with presentation rate of words. However, this limitation can be overcome
well-known lexical nuisance variables in psycholinguistics, for example, for instance by performing linear regression with a regressor which
lexical frequency (i.e., unigram probability), word length, phonological differs on a word-by-word basis such as perplexity or lexical frequency,
neighbourhood size, transitional probability (i.e., bigram probability) (see Yarkoni et al., 2008, for illustration of this approach).
etc.
These lexical measures characterize separate aspects of words. For 5.5. Explanatory status: maps or mapping?
example, lexical frequency is a property of the word alone whereas a 4-
gram probability is conditioned on the three preceding words and Finally, it is worth touching upon the explanatory scope of the ap-
therefore operationalizes context-dependent computations. Whereas proach presented here. What constitutes an adequate account of ex-
both can be viewed as effects of “lexical predictability”, they can be planation (in the sense of Craver, 2007) in cognitive neuroscience and
related to distinct cognitive computations; for example, ease of lexical how to approach it remains a debated topic and has received increased
retrieval and expectation-based processing, respectively (Staub, 2015; attention in cognitive neuroscience communities recently (see
Huettig, 2015, see also Kuperberg and Jaeger, 2015). Pulvermüller et al., 2014; Embick and Poeppel, 2015; Jonas and
Given that probabilistic language models afford the use of less ex- Kording, 2017; Krakauer et al., 2017, for some recent discussions). It
perimentally constrained, naturalistic stimuli, confound variables must has been emphasized previously that localizing specific cognitive
be controlled statistically. They should be included as covariates of no computations to circumscribed cortical areas does not in itself con-
interest in regression-type analyses; for example Frank et al. (2015) stitute a sufficient explanation (Poeppel, 2012).
included word frequency, word length, and word position in the Seeking a fit between probabilistically modelled cognitive states and
585
K. Armeni et al. Neuroscience and Biobehavioral Reviews 83 (2017) 579–588
neural data by means of a statistical model remains silent on the al- probabilistic language models be used as a tool for investigating ex-
gorithmic and the neural levels of explanation. Specifically, complexity pectation-based processing at distinct representational and temporal
metrics are estimators of comprehension difficulty and can provide levels of complexity concurrently in a single experiment within the
evidence for or against cognitive theories to the extent that the latter same dataset (e.g., Lopopolo et al., 2017)?
provide distinct predictions on where in a sentence the human cognitive Regardless of the specific computational theory embedded in the
system will experience difficulties (Martin, 2016). Currently, prob- models, efforts should be spent in laying out the constraints to algo-
abilistic models do not offer explanations in terms of how the cognitive rithmic and neurophysiological explanations (see Embick and Poeppel,
(and neural) computation is achieved (but see Hale, 2011, for an al- 2015; Martin, 2016). How does probabilistic cognitive computation
gorithmic proposal). Clearly, any empirical success of probabilistic relate to the general principles of cortical organization for language and
language models in explaining neural signals does not entail that other cognitive-perceptual systems (e.g., Battaglia et al., 2012;
mathematical formalisms, information measures or language prob- Friederici and Singer, 2015)? What general property of cortical cir-
abilities per se are instantiated in the brain (Jurafsky, 2002). cuitry is required to explain any observed correlations and directions of
From the perspective of neurophysiological explanation, current the effects between probabilistic computation and neurobiological sig-
fMRI-based applications stay within what has been dubbed the “car- nals? More specifically, what neuronal circuit configuration and com-
tographic imperative” (Poeppel, 2012) with the goal of tentatively lo- putation allows us to make a linking hypothesis to probabilistic cog-
calizing hypothesized computations to gross-level brain areas (as in nitive computation? What statistical learning mechanisms must be in
Willems et al., 2016; Henderson et al., 2016). On the other hand, place to account for development of probabilistic computation in lan-
electrophysiological results are predominantly informing cognitive guage (as in Kumaran et al., 2016)?
theories (as in Frank et al., 2015; van Schijndel et al., 2015). However, Lastly, probabilistic language models reduce the dimensions of
it is becoming increasingly clear in cognitive and systems neurosciences language comprehension by focusing on the properties of the linguistic
that brain signals are not only indices representing diagnostic evidence signal alone. An important explanatory consideration of the what and
for theories cast at the cognitive-computational levels of analyses, but the why of probabilistic language computation will eventually have to
are biophysically meaningful signals reflecting underlying neuronal account for the pragmatic and communicative perspective on language
computations and circuit configurations (Cohen, 2017) occurring at understanding: What purpose would probabilistic language computa-
lower levels of spatio-temporal cortical organizations (this is conveyed tion serve in models of pragmatic language understanding as prob-
by the upper part of our schematic in Fig. 1). In this respect, electro- abilistic inference (Goodman and Frank, 2016)? What does probabil-
physiological methods represent a powerful tool, compared to hemo- istic computation entail for the rapid and flexible human
dynamic methods, due to a closer link between electrophysiological communicative behaviour in social and interactional settings (see e.g.,
events at lower spatial scales (as in Nelson et al., 2017, where high Levinson, 2015; Stolk et al., 2015)?
frequency power is taken to reflect relevant neural computations).
Although model-based analyses reviewed above can reveal what 7. Conclusion
information content during comprehension makes a difference in terms
of neural signals, this type of correlational “bridging” represents an In the present paper, we provided a general overview of probabil-
initial step towards a more ambitious goal of describing the plausible istic language models, presented example applications in neuroscience
neural computational principles that explain the mapping to hypothe- studies, and discussed advantages and disadvantages. The approach
sized linguistic/cognitive computations and taxonomies (Dehaene advocated here should be viewed as complementary to the established
et al., 2015; see also Marcus et al., 2014). If probabilistic computations experimental paradigms in cognitive neuroscience. Probabilistic lan-
at some level represent a valid cognitive hypothesis underlying the guage models provide computationally implemented tools for evalu-
behaviour, this should provide constraints on the target neural com- ating cognitive theories on neural data, mapping cognitive computa-
putations, mechanisms and algorithmic descriptions. Before con- tions to gross-level brain areas, and offer tentative cognitive-
cluding, we outline below some outstanding challenges that deserve computational explanation of electrophysiological responses. Future
further attention in the future. challenges lie in widening the scope of language models to meet the
known characteristics of human linguistic-communicative capacities
6. Future challenges and moving from brain mapping to linking specific cognitive explana-
tions of macroscopic brain signals to plausible underlying neuronal
Cognitive neuroscience shows that human listeners can integrate computations.
several sources of information to interpret an utterance (Hagoort and
van Berkum, 2007). This translates into a long-standing challenge in the Conflict of interests
language modelling community: how can we bring probabilistic models
to bear on larger linguistic units and contextually relevant information, None declared.
for example by making use of discourse coherence in models of sen-
tence comprehension, long short term memory neural networks etc. Acknowledgments
(e.g., Dubey et al., 2013; Hochreiter and Schmidhuber, 1997)?
Similarly, different classes of models perform with different success The work presented here was partly funded by the Netherlands
rates on empirical data. If a certain class of models (e.g., n-grams or Organisation for Scientific Research (NWO) Gravitation Grant
PSGs) turns out to be consistently more successful empirically, what are 024.001.006 to the Language in Interaction Consortium and Vidi Grant
the consequences for neurocognitive theories? Which aspect of the 276-89-007 to Roel Willems. The authors would like to thank Tom
model architecture (the underlying cognitive hypothesis) or model Mitchell for valuable comments on a previous version of the manu-
training yields this difference compared to other models? script.
Theoretical and empirical investigations in psycholinguistics and
cognitive neuroscience show that language processing consists of dis- References
tinct representational and temporal scales, including, but not limited to,
at the level of phonemes, words, sentences, and discourse (Jackendoff, Battaglia, F.P., Borensztajn, G., Bod, R., 2012. Structured cognition and neural systems:
2002; Lerner et al., 2011). Typically, these stages are investigated in from rats to language. Neurosci. Biobehav. Rev. 36, 1626–1639. http://dx.doi.org/
10.1016/j.neubiorev.2012.04.004.
separate experiments with different experimental paradigms. Can
586
K. Armeni et al. Neuroscience and Biobehavioral Reviews 83 (2017) 579–588
Bechtel, W., Abrahamsen, A., 2010. Dynamic mechanistic explanation: computational Hagoort, P., 2009. Reflections on the Neurobiology of Syntax. In: Bickerton, D.,
modeling of circadian rhythms as an exemplar for cognitive science. Stud. Hist. Szathmáry, E. (Eds.), Biological Foundations and Origin of Syntax. MIT Press,
Philos. Sci. A 41, 321–333. http://dx.doi.org/10.1016/j.shpsa.2010.07.003. Cambridge, MA.
Bengio, Y., Ducharme, R., Vincent, P., Janvin, C., 2003. A neural probabilistic language Hagoort, P., 2013. MUC (Memory, Unification Control) and beyond. Front. Psychol. 4,
model. J. Mach. Learn. Res. 3, 1137–1155. http://dx.doi.org/10.1162/ 416. http://dx.doi.org/10.3389/fpsyg.2013.00416.
153244303322533223. Hagoort, P., 2014. Nodes and networks in the neural architecture for language: Broca's
Boone, W., Piccinini, G., 2016. The cognitive neuroscience revolution. Synthesis 193, region and beyond. Curr. Opin. Neurobiol. 28, 136–141. http://dx.doi.org/10.1016/
1509–1534. http://dx.doi.org/10.1007/s11229-015-0783-4. j.conb.2014.07.013.
Boston, M.F., Hale, J.T., Kliegl, R., Patil, U., Vasishth, S., 2008. Parsing costs as predictors Hagoort, P., van Berkum, J., 2007. Beyond the sentence given. Philos. Trans. R. Soc. 362,
of reading difficulty: an evaluation using the Potsdam Sentence Corpus. J. Eye 801–811. http://dx.doi.org/10.1098/rstb.2007.2089.
Movem. Res. 2, 1–12. http://dx.doi.org/10.16910/jemr.2.1.1. Hale, J., 2001. A probabilistic early parser as a psycholinguistic model. In: Second
Brennan, J., 2016. Naturalistic sentence comprehension in the brain. Lang. Linguist. meeting of the North American Chapter of the Association for Computational
Compass 10, 299–313. http://dx.doi.org/10.1111/lnc3.12198. Linguistics on Language technologies 2001 – NAACL ’01. Association for
Brennan, J.R., Stabler, E.P., Van Wagenen, S.E., Luh, W.-M., Hale, J.T., 2016. Linguistic Computational Linguistics, Morristown, NJ, USA. pp. 1–8. http://dx.doi.org/10.
structure correlates with temporal activity during naturalistic comprehension. Brain 3115/1073336.1073357.
Lang. 157–158, 81–94. http://dx.doi.org/10.1016/j.bandl.2016.04.008. Hale, J., 2003. The information conveyed by words in sentences. J. Psycholinguist. Res.
Chater, N., Tenenbaum, J.B., Yuille, A., 2006. Probabilistic models of cognition: con- 32, 101–123. http://dx.doi.org/10.1023/A:1022492123056.
ceptual foundations. Trends Cogn. Sci. 10, 287–291. http://dx.doi.org/10.1016/j. Hale, J., 2006. Uncertainty about the rest of the sentence. Cogn. Sci. 30, 643–672. http://
tics.2006.05.007. dx.doi.org/10.1207/s15516709cog0000_64.
Chomsky, N., 1959. A review of B. F. Skinner's verbal behavior. Language 35, 26–58. Hale, J., 2011. What a rational parser would do. Cogn. Sci. 35, 399–443. http://dx.doi.
Chomsky, N., 1995. Minimalist Program. MIT Press, Cambridge, MA. org/10.1111/j.1551-6709.2010.01145.x.
Cohen, M.X., 2017. Where does EEG come from and what does it mean? Trends Neurosci. Hale, J., 2016. Information-theoretical complexity metrics. Lang. Linguist. Compass 10,
40, 208–218. http://dx.doi.org/10.1016/j.tins.2017.02.004. 397–412. http://dx.doi.org/10.1111/lnc3.12196.
Craver, C.F., 2007. Explaining the Brain: Mechanisms and The Mosaic Unity of Hansen, P.C., Kringelbach, M.L., Salmelin, R. (Eds.), 2010. MEG: An Introduction to
Neuroscience. Clarendon Press-Oxford University Press, New York, NY. Methods. Oxford University Press, New York, NY.
De Mulder, W., Bethard, S., Moens, M.-F., 2015. A survey on the application of recurrent Hasson, U., Honey, C.J., 2012. Future trends in neuroimaging: neural processes as ex-
neural networks to statistical language modeling. Comput. Speech Lang. 30, 61–98. pressed within real-life contexts. NeuroImage 62, 1272–1278. http://dx.doi.org/10.
http://dx.doi.org/10.1016/j.csl.2014.09.005. 1016/j.neuroimage.2012.02.004.
Dehaene, S., Meyniel, F., Wacongne, C., Wang, L., Pallier, C., 2015. The neural re- Henderson, J.M., Choi, W., Lowder, M.W., Ferreira, F., 2016. Language structure in the
presentation of sequences: from transition probabilities to algebraic patterns and brain: a fixation-related fMRI study of syntactic surprisal in Reading. NeuroImage
linguistic trees. Neuron 88, 2–19. http://dx.doi.org/10.1016/j.neuron.2015.09.019. 132, 293–300. http://dx.doi.org/10.1016/j.neuroimage.2016.02.050.
Demberg, V., Keller, F., 2008. Data from eye-tracking corpora as evidence for theories of Hickok, G., Poeppel, D., 2007. The cortical organization of speech processing. Nat. Rev.
syntactic processing complexity. Cognition 109, 193–210. http://dx.doi.org/10. Neurosci. 8, 393–402. http://dx.doi.org/10.1038/nrn2113.
1016/j.cognition.2008.07.008. Hochreiter, S., Schmidhuber, J., 1997. Long short-term memory. Neural Comput. 9,
Devlin, J.T., Watkins, K.E., 2007. Stimulating language: insights from TMS. Brain 130, 1735–1780. http://dx.doi.org/10.1162/neco.1997.9.8.1735.
610–622. http://dx.doi.org/10.1093/brain/awl331. Huettig, F., 2015. Four central questions about prediction in language processing. Brain
Dubey, A., Keller, F., Sturt, P., 2013. Probabilistic modeling of discourse-aware sentence Res. 118–135. http://dx.doi.org/10.1016/j.brainres.2015.02.014.
processing. Top. Cogn. Sci. 5, 425–451. http://dx.doi.org/10.1111/tops.12023. Huettig, F., Rommers, J., Meyer, A.S., 2011. Using the visual world paradigm to study
Embick, D., Poeppel, D., 2015. Towards a computational(ist) neurobiology of language: language processing: a review and critical evaluation. Acta Psychol. 137, 151–171.
correlational, integrated and explanatory neurolinguistics. Lang. Cogn. Neurosci. 30, http://dx.doi.org/10.1016/j.actpsy.2010.11.003.
357–366. http://dx.doi.org/10.1080/23273798.2014.980750. Jackendoff, R., 2002. Foundations of Language. Oxford University Press, New York.
Fetsch, C.R., 2016. The importance of task design and behavioral control for under- Jonas, E., Kording, K.P., 2017. Could a neuroscientist understand a microprocessor? PLOS
standing the neural basis of cognitive functions. Curr. Opin. Neurobiol. 37, 16–22. Comput. Biol. 13. http://dx.doi.org/10.1371/journal.pcbi.1005268.
http://dx.doi.org/10.1016/j.conb.2015.12.002. Jurafsky, D., 2002. Probabilistic modeling in psycholinguistics: linguistic comprehension
Forstmann, B.U., Wagenmakers, E.-J. (Eds.), 2015. An Introduction to Model-Based and production. Prob. Linguist. 30, 1–50.
Cognitive Neuroscience. Springer New York, New York, NY. http://dx.doi.org/10. Jurafsky, D., Martin, J.H., 2009. Speech and Language Processing: An Introduction to
1007/978-1-4939-2236-9. Natural Language Processing, Computational Linguistics, and Speech Recognition,
Forstmann, B.U., Wagenmakers, E.-J., Eichele, T., Brown, S., Serences, J.T., 2011. 2nd ed. Pearson/Prentice Hall, Upper Saddle River, NJ.
Reciprocal relations between cognitive neuroscience and formal cognitive models: Kaan, E., Swaab, T.Y., 2002. The brain circuitry of syntactic comprehension. Trends Cogn.
opposites attract? Trends Cogn. Sci. 15, 272–279. http://dx.doi.org/10.1016/j.tics. Sci. 6, 350–356. http://dx.doi.org/10.1016/S1364-6613(02)01947-2.
2011.04.002. Krakauer, J.W., Ghazanfar, A.A., Gomez-Marin, A., MacIver, M.A., Poeppel, D., 2017.
Frank, S.L., 2010. Uncertainty reduction as a measure of cognitive processing effort. In: Neuroscience needs behavior: correcting a reductionist bias. Neuron 93, 480–490.
Proceedings of the 2010 Workshop on Cognitive Modeling and Computational http://dx.doi.org/10.1016/j.neuron.2016.12.041.
Linguistics. Association for Computational Linguistics, Uppsala, Sweden. pp. 81–89. Kuhl, P.K., 2010. Brain mechanisms in early language acquisition. Neuron 67, 713–727.
http://www.aclweb.org/anthology/W10-2010. http://dx.doi.org/10.1016/j.neuron.2010.08.038.
Frank, S.L., 2013. Uncertainty reduction as a measure of cognitive load in sentence Kumaran, D., Hassabis, D., McClelland, J.L., 2016. What learning systems do intelligent
comprehension. Top. Cogn. Sci. 5, 475–494. http://dx.doi.org/10.1111/tops.12025. agents need? Complement. Learn. Syst. Theory Updated. http://dx.doi.org/10.1016/
Frank, S.L., Bod, R., 2011. Insensitivity of the human sentence-processing system to j.tics.2016.05.004.
hierarchical structure. Psychol. Sci. 22, 829–834. http://dx.doi.org/10.1177/ Kuperberg, G.R., 2007. Neural mechanisms of language comprehension: challenges to
0956797611409589. syntax. Brain Res. 1146, 23–49. http://dx.doi.org/10.1016/j.brainres.2006.12.063.
Frank, S.L., Otten, L.J., Galli, G., Vigliocco, G., 2015. The ERP response to the amount of Kuperberg, G.R., Jaeger, T.F., 2015. What do we mean by prediction in language com-
information conveyed by words in sentences. Brain Lang. 140, 1–25. http://dx.doi. prehension? Lang. Cogn. Neurosci. 3798, 1–70. http://dx.doi.org/10.1080/
org/10.1016/j.bandl.2014.10.006. 23273798.2015.1102299.
Frank, S.L., Vigliocco, G., 2011. Sentence comprehension as mental simulation: an in- Kutas, M., Federmeier, K.D., 2000. Electropsysiology reveals semantic memory use in
formation-theoretic perspective. Information 2, 672–696. http://dx.doi.org/10. language comprehension. Trends Cogn. Sci. 12, 463–470. http://dx.doi.org/10.
3390/info2040672. 1016/S1364-6613(00)01560-6.
Friederici, A., Singer, W., 2015. Grounding language processing on basic neurophysio- Lau, J.H., Clark, A., Lappin, S., 2017. Grammaticality, acceptability, and probability: a
logical principles. Trends Cogn. Sci. 19, 1–10. http://dx.doi.org/10.1016/j.tics.2015. probabilistic view of linguistic knowledge. Cogn. Sci. 41, 1202–1241. http://dx.doi.
03.012. org/10.1111/cogs.12414.
Friederici, A.D., 2012. The cortical language circuit: from auditory perception to sentence Lerner, Y., Honey, C.J., Silbert, L.J., Hasson, U., 2011. Topographic mapping of a hier-
comprehension. Trends Cogn. Sci. 16, 262–268. http://dx.doi.org/10.1016/j.tics. archy of temporal receptive windows using a narrated story. J. Neurosci. 31,
2012.04.001. 2906–2915. http://dx.doi.org/10.1523/JNEUROSCI.3684-10.2011.
Friederici, A.D., Weissenborn, J., 2007. Mapping sentence form onto meaning: the syntax- Levinson, S.C., 2015. Turn-taking in human communication – origins and implications for
semantic interface. Brain Res. 1146, 50–58. http://dx.doi.org/10.1016/j.brainres. language processing. Trends Cogn. Sci. 20, 6–14. http://dx.doi.org/10.1016/j.tics.
2006.08.038. 2015.10.010.
Gibson, E., Pearlmutter, N.J., 1998. Constraints on sentence comprehension. Trends Levy, R., 2008. Expectation-based syntactic comprehension. Cognition 106, 1126–1177.
Cogn. Sci. 2, 262–268. http://dx.doi.org/10.1016/S1364-6613(98)01187-5. http://dx.doi.org/10.1016/j.cognition.2007.05.006.
Gibson, E., Thomas, J., 1999. Memory limitations and structural forgetting: the percep- Linzen, T., Jaeger, F., 2014. Investigating the role of entropy in sentence processing. In:
tion of complex ungrammatical sentences as grammatical. Lang. Cogn. Process. 14, Proceedings of the Fifth Workshop on Cognitive Modeling and Computational
225–248. http://dx.doi.org/10.1080/016909699386293. Linguistics. Association for Computational Linguistics, Baltimore. pp. 10–18. http://
Goodman, N.D., Frank, M.C., 2016. Pragmatic language interpretation as probabilistic www.aclweb.org/anthology/W/W14/W14-2002.
inference. Trends Cogn. Sci. 20, 818–829. http://dx.doi.org/10.1016/j.tics.2016.08. Logothetis, N.K., 2008. What we can do and what we cannot do with fMRI. Nature 453,
005. 869–878. http://dx.doi.org/10.1038/nature06976.
Griffiths, T.L., 2011. Rethinking language: how probabilities shape the words we use. Lopopolo, A., Frank, S.L., van den Bosch, A., Willems, R.M., 2017. Using stochastic lan-
Proc. Natl. Acad. Sci. U. S. A. 108, 3825–3826. http://dx.doi.org/10.1073/pnas. guage models (slm) to map lexical, syntactic, and phonological information proces-
1100760108. sing in the brain. PLOS ONE 12, 1–18. http://dx.doi.org/10.1371/journal.pone.
587
K. Armeni et al. Neuroscience and Biobehavioral Reviews 83 (2017) 579–588
0177794. Roark, B., Bachrach, A., Cardenas, C., Pallier, C., 2009. Deriving lexical and syntactic
Luck, S.J., 2005. An Introduction to the Event-Related Potential Technique. MIT Press, expectation-based measures for psycholinguistic modeling via incremental top-down
Cambridge, MA. parsing. In: Proceedings of the 2009 Conference on Empirical Methods in Natural
Marcus, G., Marblestone, A., Dean, T., 2014. The atoms of neural computation. Science Language Processing, vol. 1. Association for Computational Linguistics, Singapore.
346, 551–552. http://dx.doi.org/10.1126/science.1261661. pp. 324–333. http://dx.doi.org/10.3115/1699510.1699553.
Marcus, M., Marcinkiewicz, M., Santorini, B., 1993. Building a large annotated corpus of Rosenfeld, R., 2000. Two decades of statistical language modeling: where do we go from
English: the Penn Treebank. Comput. Linguist. 19, 313–330. here? Proc. IEEE 88, 1270–1278. http://dx.doi.org/10.1109/5.880083.
Marr, D., 1982. Vision: A Computational Investigation into the Human Representation Saffran, J.R., 2003. Statistical language learning. Curr. Dir. Psychol. Sci. 12, 110–114.
and Processing of Visual Information. MIT Press, Cambridge, MA. http://dx.doi.org/ http://dx.doi.org/10.1111/1467-8721.01243.
10.2307/2185011. Seidenberg, M.S., 1997. Language acquisition and use: learning and applying probabil-
Mars, R.B., Shea, N.J., Kolling, N., Rushworth, M.F.S., 2012. Model-based analyses: istic constraints. Science 275, 1599–1603. http://dx.doi.org/10.1126/science.275.
promises, pitfalls, and example applications to the study of cognitive control. Q. J. 5306.1599.
Exp. Psychol. 65, 252–267. http://dx.doi.org/10.1080/17470211003668272. Small, S.L., Nusbaum, H.C., 2004. On the neurobiological investigation of language un-
Martin, A.E., 2016. Language processing as cue integration: grounding the psychology of derstanding in context. Brain Lang. 89, 300–311. http://dx.doi.org/10.1016/S0093-
language in perception and neurophysiology. Front. Psychol. 7, 120. http://dx.doi. 934X(03)00344-4.
org/10.3389/fpsyg.2016.00120. Smith, N.J., Levy, R., 2011. Cloze but no cigar: the complex relationship between cloze,
Nelson, M.J., Karoui, I.E., Giber, K., Yang, X., Cohen, L., Koopman, H., Cash, S.S., corpus, and subjective probabilities in language processing. Proceedings of the 33rd
Naccache, L., Hale, J.T., Pallier, C., Dehaene, S., 2017. Neurophysiological dynamics Annual Meeting of the Cognitive Science Conference 1637–1642.
of phrase-structure building during sentence processing. Proc. Natl. Acad. Sci. U. S. A. Staub, A., 2015. The effect of lexical predictability on eye movements in reading: critical
114, E3669–E3678. http://dx.doi.org/10.1073/PNAS.1701590114. review and theoretical interpretation. Lang. Linguist. Compass 9, 311–327. http://dx.
Norris, D., McQueen, J.M., Cutler, A., 2000. Merging information in speech recognition: doi.org/10.1111/lnc3.12151.
feedback is never necessary. Behav. Brain Sci. 23, 299–325. http://dx.doi.org/10. Stolk, A., Verhagen, L., Toni, I., 2015. Conceptual alignment: how brains achieve mutual
1017/S0140525X00003241. discussion 325–370. understanding. Trends Cogn. Sci. 20, 180–191. http://dx.doi.org/10.1016/j.tics.
Padó, U., Crocker, M.W., Keller, F., 2009. A probabilistic model of semantic plausibility in 2015.11.007.
sentence processing. Cogn. Sci. 33, 794–838. http://dx.doi.org/10.1111/j.1551- Taylor, W.L., 1953. ‘Cloze procedure’: a new tool for measuring readability. Journal. Mass
6709.2009.01033.x. Commun. Q. 30, 415–433.
Pallier, C., Devauchelle, A.-D., Dehaene, S., 2011. Cortical representation of the con- van Schijndel, M., Murphy, B., Schuler, W., 2015. Evidence of syntactic working memory
stituent structure of sentences. Proc. Natl. Acad. Sci. U. S. A. 108, 2522–2527. http:// usage in MEG data. In: Proceedings of the 6th Workshop on Cognitive Modeling and
dx.doi.org/10.1073/pnas.1018711108. Computational Linguistics. Association for Computational Linguistics, Denver, CO.
Palmeri, T.J., Love, B.C., Turner, B.M., 2016. Model-based cognitive neuroscience. J. pp. 79–88.
Math. Psychol. 76B, 59–64. http://dx.doi.org/10.1016/j.jmp.2016.10.010. Wehbe, L., Ashish, V., Knight, K., Mitchell, T., 2015. Aligning context-based statistical
Poeppel, D., 2012. The maps problem and the mapping problem: two challenges for a models of language with brain activity during reading. In: Proceedings of the 2014
cognitive neuroscience of speech and language. Cogn. Neuropsychol. 29, 34–55. Conference on Empirical Methods in Natural Language Processing. Association for
http://dx.doi.org/10.1080/02643294.2012.710600. Computational Linguistics, Doha. pp. 233–243.
Poldrack, R.A., 2010. Mapping mental function to brain structure: how can cognitive Willems, R.M. (Ed.), 2015. Cognitive Neuroscience of Natural Language Use. Cambridge
neuroimaging succeed? Perspect. Psychol. Sci. 5, 753–761. http://dx.doi.org/10. University Press, Cambridge.
1177/1745691610388777. Willems, R.M., Frank, S.L., Nijhof, A.D., Hagoort, P., Van Den Bosch, A., 2016. Prediction
Poldrack, R.A., Gorgolewski, K.J., 2014. Making big data open: data sharing in neuroi- during natural language comprehension. Cereb. Cortex 26, 2506–2516. http://dx.
maging. Nat. Neurosci. 17, 1510–1517. http://dx.doi.org/10.1038/nn.3818. doi.org/10.1093/cercor/bhv075.
Pulvermüller, F., Garagnani, M., Wennekers, T., 2014. Thinking in circuits: toward neu- Wu, S., Bachrach, A., Cardenas, C., Schuler, W., 2010. Complexity metrics in an incre-
robiological explanation in cognitive neuroscience. Biol. Cybern. 108, 573–593. mental right-corner parser. In: Proceedings of the 48th Annual Meeting of the
http://dx.doi.org/10.1007/s00422-014-0603-9. Association for Computational Linguistics. Association for Computational Linguistics,
Rayner, K., 1998. Eye movements in reading and information processing: 20 years of Uppsala, Sweden. pp. 1189–1198. http://www.aclweb.org/anthology/P10-1121.
research. Psychol. Bull. 124, 372–422. http://dx.doi.org/10.1037/0033-2909.124.3. Yarkoni, T., Speer, N.K., Zacks, J.M., 2008. Neural substrates of narrative comprehension
372. and memory. NeuroImage 41, 1408–1425. http://dx.doi.org/10.1016/j.neuroimage.
Roark, B., 2001. Probabilistic top-down parsing and language modeling. Comput. 2008.03.062.
Linguist. 27, 249–276. http://dx.doi.org/10.1162/089120101750300526.
588