Computational Linguistics Notes
Computational Linguistics Notes
Corpus Linguistics
What's a corpus?
A finite collection of texts, stored in electronic form, collected in a systematic and controlled way,
homogeneous and representative (both qualitatively and quantitatively) wrt a certain linguistic domain.
Corpuses can be classified according to different parameters:
Genericity Modality Time Language
Coding Extension Representativity Closed/Monitored
Integrity (full/partial texts)
Before using a corpus, text normalization must be carried out → convert it into a more convenient form,
with word expansion (splitting a string into words), tokenization (e.g., Byte-Pair encoding), lemmatization
(determine roots) → the best alternative is the one with the least substitution/deletion/insertions
In corpus linguistics, they are used to do statistical analysis and hypothesis testing, checking occurrences or
validating linguistic rules within a specific language territory.
Examples: Childes (1985): archive of spontaneous speech transcription of child-directed speech with CHAT
coding); Brown (1964): 1ml tokens, representative of written ENG, 15 categories, unannotated); Penn
Treebank (1986): 1ml tokens, fully syntactically annotated with standard tagging style TREEBANK II
The difficulty of ensuring that the entire corpus is completely and consistently annotated means that these
corpora are usually smaller, containing around one to three million words. But the depth of annotation
clearly depends on the intended use of the corpus.
Next world probability = summed probability of all previous words is similar to the probability of
having the next word
Eg. PHRASE STRUCTURE GRAMMAR (Chomsky, 1965) ordered 4-tuple (VT, VN, →, {S}):
1. Vt =Terminal vocabulary
2. Vn = Non terminal vocabulary Vt U Vn = V
3. {S} = subset of Vn, ROOT NODE
4. = rewriting rule encoding both precedence and dominance relation; binary, asymmetric,
transitive
Defined on V*: ⱯAϵVn, φAψ φτψ for some φ,τ,ψϵV*
DERIVATION: given two strings φ,ψ ϵV*, there is a φ-derivation of ψ if φ *ψ; if there is such derivation, φ
DOMINATES ψ (reflexive & transitive relation); the derivation is terminated if ψϵVt*
Given a grammar G, a language generated by it is L(G), the set of all possible strings for which a terminated
s-derivation of φ exists
Eg. Structural description (syntactic tree) ordered 5-tuple (V, I,D,P,A):
V= finite set of vertices
I= finite set of labels
D= dominance relation (weak) defined on V
P= precedence relation (strict order) defined on V
A= assignment function (from V to I)
REGULAR GRAMMARS: 4 tuples similar to PSG (Vt, Vn, , {S}) with some extra constraints:
all production rules have at most one non-terminal symbol on the right side of the rule
the position of the terminal symbol on the right side of the rule is consistent (bc the elements there
are ordered in a precedence relation (vs strict minimalism, only hierarchy)), yielding left-recursive
or right-recursive languages
Automata: mathematical computation models composed by states and transitions among states;
conceivable in form of graphs. A FINAL STATE AUTOMATA (FSA) is a machine composed of a set of states &
possible transitions from one to the other; 5-tuples <Q, Σ, Q0, F, δ> such that
Q → finite set of states
Σ → characters acceptable as input
Q0 → initial states such that Q0ϵQ
F → final states
δ→ transition states
For FS grammars, language is a graph that connects words through transitions, w/o memory of the
previous stages. They can be used to recognise strings/words or represent a sentence/language
RE =FSA=RG –> describe REGULAR LANGUAGES))
Are you able to define recursion? How do you implement it in Regular Expressions (RE), Finite State
Automata (FSA), Regular Grammars (RG)?
Recursion is the basic property of NL that allows us to make infinite use of finite means, or, in other words,
the ability to place one component inside another component of the same kind.
A RECURSIVE RULE is a rule in which the output of first application can be fed as input of a second
application of the rule. The same recursive mechanism is implemented
in RG: the same non-terminal element at the left and at the right of the rewriting rule
S aR, RbR, RØ
in RE: Kleene closure → all the possible concatenations of an element including the null one ab*
in FSA: inserting a loop into a node (Q0—a-->Q1--b)
(------------------- CONTEXT FREE GRAMMARS (CFG) are RGs without restrictions on the right side. Only
admits rule of the type Aγ, where γ is any sequence of terminal/nonterminal symbols)
Can express/describe SYNTACTIC AMBIGUITY (eg. SaSb, SØ)
CFG: RG without restriction on the right side. It has binary branching and allows for deep embeddings
It is more powerful because it can capture counting recursion and it can describe in a more straightforward
way syntactic ambiguity (rules w/ same left-side symbol should be present to allow for ambiguity &
exponential problem: the more PP the more possibilities) which require freedom in having nesting
precedence not correlated w/ dominance
Discuss Chomsky's Hierarchy.
Rewriting Rules restrictions create formal grammar classes organized in an inclusion hierarchy:
Type 0 → Turing Equivalent Grammar→
implemented by a Turing Machine → recursively
enumerable language → no constraints
Type 1 → Context Sensitive Grammar →
Linear-bounded automata → any production rules
may be surrounded by a context of terminal and
nonterminal symbols.
Type 2 → Context Free Grammar → PDA →
capture counting dependencies → A → b (b=
sequence of (non)terminal symbols)
Type 3 → Regular Grammar → FSA at most
1 nonterminal symbol on left side, consistent order
Natural Languages are considered by Chomsky to by Mildly Context-Sensitive Languages
Single entry structure (ortho-phonetic, morphological, syntactic and semantic info, eg. XML, DTD, TSV) vs
Global lexicon structure ( idea: subcategorization might be related to semantic class – infer the semantic
class on basis of hierarchical organization of items in an ontology; eg. WORDNET, a semantic network
organized based on meaning, where each lexical concept is a synset, represented using its synonyms
polysemy solved by creating 2 synsets; not language-specific)
Evidence for a structured lexicon from psycholinguistics: priming effects, pronunciation errors
Goal of morphology: recognize a well-formed string & decompose it into morphemes. Morphological
analysis can be applied for
1. INFORMATION EXTRACTION → roots, not tokens
2. KEYWORDS EXPANSION → strip inflection
3. STEMMING → retrieve word root (stem); eg PORTER STEMMING ALGORITHM, set of cascade FST)
(-----------URP can be reduced to a SAT problem (problem of determining if there exists an interpretation
that satisfies a given Boolean formula), which is a non-deterministic (because of the problem space)
polynomial time problem (NP). It is one of those problems that it is difficult to find a solution for, but in case
there is an oracle, the answer can be easily checked. Why is it an NP problem? Because of ambiguity!
Several strings in NL can receive an ambiguous value (lexical flies V vs N, semantic vecchia adj vs noun,
syntactic PP attachment).)
How is complexity expressed? Which dimensions are used for calculating the complexity of an algorithm?
The complexity of a problem is directly proportional to resource usage. In particular, given that a
computation is the relation between an input and an output, and that its completion is the reaching of a
final state, we have:
TIME COMPLEXITY → number of steps required to reach the output
SPACE COMPLEXITY → quantity of information to be stored at each step
As the complexity generally increases with the size of the input, the complexity order of the problem can
be expressed in terms of input length (as representing the mapping between input & output).
Since it grows to the infinite, as presumably NL do, the growing rate of the lexicon is crucial to determine
the tractability/computability of the problem, that is, if a procedure exists & terminates with an answer in a
finite amount of time.
Computational complexity doesn’t seem to be strictly related to psycholinguistic complexity, i.e., the
difficulty of processing a sentence. Rather, it might be due to a limited processing capacity (eg. you pay a
cost whenever you have to store an element and storing elements that have similar features might
generate confusion).
Limited-size stack (Yngve 1960): language processing uses a stack to store partial analyses; the
more partial processings stored, the harder the processing (eg. multiple embeddings)
Syntactic Prediction Locality Theory (Gibson 1998): the total memory load is proportional to the
sum of required integrations + referentiality needs (a pronoun is less costly that a DP or a definite
description.
In the INTERVENTION ACCOUNTS OF COMPLEXITY (Rizzi 1990), the processing difficulty is said to be
proportional to the number and kind of relevant features
shared between the moved item and any possible
intervener. An example of long-distance dependency is that
of Object Clefts. In relation to this, Warren & Gibson (2005)
compare three types of DPs in Object Clefts (definite
description VS proper name VS pronoun), and the results of
the reading times show that not only the type of intervener
matters, but also their position an integration cost is not
enough (in Linguistic Integration Cost (Gibson), difficulty is
proportional to the distance in terms of number of
intervening discourse referents, following a referentiality
hierarchy). At the same time, intervention-based accounts
are not gradable: standard bottom-up theories can only
predict what creates complexity, but nothing clear on the
processing (eg. why is slowdown observed at the verb
segment?).
SUB-SYMBOLIC (IMPLICIT) REPRESENTATION: its constituent entities are not representations themselves;
succeeds in having an idea of what’s a constituent without having a rewriting rule; only input and output
are coded as discrete/symbolic entities; no explicit notion of category
Owes to neurobiology: it comes from the analysis of our brain: we can parcel our brain into
different functions (perceptive and performative), a dynamic functional system, and the system
complexity is an emergent property of simple interactions among parts
Semantics is highly non-compositional, but maximally affected by context
This idea was implemented in computer processing → PARALLEL
DISTRIBUTED PROCESSING: no explicit competence (what we know)
representation, rather a procedure (connection between simple entities),
because our brain has memory which allows us to have a processing model
The goal is to predict the complexity of the system as an EMERGENT
PROPERTY → a complex behaviour carried out by simple elements
Useful for COMPLEX PROBLEMS → we don’t have a representation of the problem space (no idea
on initial state etc), algorithmic solutions are too complex → we find an approximation of the
solution (shortcuts, the machine can figure it out by itself))
Just as neurons are the basic components of the CNS, ARTIFICIAL NEURONS are the elementary units in an
artificial neural network; their interaction might be extremely complex (emergent property), almost brain-
like. The artificial neuron is a simple processing unit linked by weighted connections. It receives one or
more inputs and sums them to produce an output. Usually each input is separately weighted, and the sum
is passed through an activation function.
an and a0 are two independent activations,
w are weighs, inhibitory or excitatory connections
Net function will be a sum of all the activations
Learning is a DESCENDING GRADIENT: you have to minimize errors, find a good balance in order to get the
lowest possible error.
LOCAL MINIMUM: the best possible solution given a certain context (not the most optimal
solution); either you enlarge modification span, or leave in the local minimum. Local minimal makes
the problem very complex, sometimes the network does not learn because of this.
How do you code inputs and outputs with artificial neural networks?
The neural network is composed of a huge number of neurons capable of generating electrical pulses or
spikes at a great speed, and it seems reasonable to assume that the pattern of firing is used to code
information. How is the pattern activity used as a code?
LOCALIST CODING: separate representations (=nodes) coding for distinct pieces of information, so
that 1 word = 1 node (e.g. 4 input units: a (0001) b (0010) c (0100) . (1000))
o Each unit is assigned a particular meaning: identifying “apple” would involve the activation
of neurons that code “apple”
o Typical of a more symbolic approach: cognition as formal manipulation of symbols with
explicit symbols to represent WORDS AND CONCEPTS
DISTRIBUTED CODING: information is coded as a pattern of activation across many processing
units, with each unit contributing to many different representations: identifying “apple” requires
the activation of many neurons, each of which are also involved in the coding of other concepts
o using a binary coding, we can use 2 bits, for representing 4 elements (a, b, c, d) that is, 2
input neurons (a=00, b=01, c=10, d=11)
o Usually associated with connectionism and a more sub-symbolic approach: cognition does
not require explicit symbolic codes, bc information is distributed across the whole system
o Advantage: they require less units, but at a price: no clear way to code the simultaneous
presentation of two stimuli, since it is difficult to tell which features should be bound
together, and the same neuron firing for two different encoding may introduce a bias in the
network; in this case a localist coding is better in order not to have a bias
(One may also have cases in between, but the essential difference revolves around the question of whether
the activity of individual units is interpretable.)
What kind of linguistic phenomena have been studied using artificial neural network simulations?
PAST TENSE (Rumelhart & McClelland, 86): we see a clear
linguistic pattern: 1) few high frequency verbs are learned in
crystalized forms (break), 2) over-regularity phase (break
breaked), 3) irregular verb inflection is learned with a smooth
coexistence of irregular and over-regular verbs until only
correct forms are used (break broke, breaked broke)
However, there is a difference between the network and human learning: tuning in a supervised way
requires a lot of examples VS the child (apparently) does not have them: it only takes a couple of
exposures. We don’t need supervised learning to learn past tense; this NN learns past tense, but they need
feedback on their performance.
How do you feed simple recurrent networks (SRN) and what do you expect as output?
SRN can deal with sentence processing if the input is revealed gradually over time, rather than being
presented at once. The input structure is localist (1 neuron = 1 input): the first token in the stream is
presented to the input layer. Activation is propagated forward. The target pattern for the output layer is
the next item in the stream of tokens: the output is compared to the target, delta terms are calculated,
and weights are updated before stepping along to the next item in the stream.
After processing the first item in the stream, the state of the hidden units is ‘copied’ to the context layer,
so that it becomes available as part of the input to the network on the next time step. (State-of-art
intuition: use hidden layer activation at next step)
(----------------SRNs are useful for discovering patterns in temporally extended data. They are trained to
associate inputs together with a memory of the last hidden layer state with output states. In this way, for
example, the network can predict what item will occur next in a stream of patterns.
GUESSING NEXT WORD paradigm: the network has learnt to perform a task if it expects something
related to the input; in this way the network can learn a category implicitly
e.g. the house is red (auto‐supervised learning)
Input = the
Output = house)
The answer is yes. Let’s look at the demands of learning each of the recursive structures:
Counting Recursion: should be easiest since it does not have agreement constraints imposed on it
Centre embedding: develop a last-in-first-out memory/stack to store agreement information
Cross-dependency: first-in-first-out memory/queue
Right branching: does not involve unbounded memory load
i) Bach et al. (1986) found that cross-dependencies in Dutch were easier to process than centre-
embeddings in German. This is interesting since CD cannot be captured by PSG rules and are typically
viewed as more complex because of this. SRN performance fits human performance on similar
constructions & this tells us that there is something weird in Chomsky’s hierarchy to predict complexity
ii) Thomas (1997) suggest that some ungrammatical sentences involving doubly centre-embedded object
relative clauses may be perceived as grammatical SRN replicate human results
iii) increasing processing difficulty with increase of depth of right recursion, although at a lesser degree
than for complex recursive constructions replicated in SRN
iv) counting recursion: Christiansen & Charter argue that such structures may not exist in natural
language; rather they may be closer to center-embedded constructions – eg. if-then pairs are nested in a
center-embedding order
How can a computer learn to solve NLP tasks without being explicitly programmed?
In general, while in symbolic approaches we have a system which we explicitly program with a set of rules
in order to do something, with MACHINE LEARNING we need no explicit programming. ML characterizes
any mechanical device which changes behaviour on the basis of experience to better perform in the future
Data mining: from data (registration of facts) to information (patterns underlying the data); it is the
application of machine learning (algorithm)
A computer program is said to learn from experience E with respect to some task T and some
performance measure P, if its performance on T, as measured by P, improves with experience E.
With ANNs, a network can eventually learn by itself how to solve NLP tasks without explicit teaching.
Elman (1993) gives an example of how this might be done, by making the network learn how to predict the
next word given previous context:
COMPLEX REGIMEN: you start training the network on simple sentences and gradually
increase the number of complex sentences, so that in the last round of learning you only have
complex sentences.
It is efficient, able to generalize its performance to novel sentences.
However, it does not represent how human children learn: children DO start to learn the
simplest structures until they achieve the adult language, but they ARE NOT placed in an
environment in which they initially only encounter simple sentences; they hear samples of all
aspects of human language from the beginning
LESIONING (modification of memory of the network every epoch): the child changes during his
period of learning: working memory and attention span are initially limited and increase over
time. We can model this in an ANN that is trained on complex data from the beginning, but
context units are reset to random values after every 2/3 words, and then the “working
memory” is gradually extended until there is no more deletions.
We obtain just as good results with all the data available at the beginning; it allows the
network to focus on simpler/local relations first
Discuss the vectorial representation of a document (in our case study a picture description) adding extra
features representing hesitation, false starts and complexity cues in phrase structure
Assumption: words can be defined by their distribution in language use, that is their neighbouring words
or grammatical environments. We can use this idea to define meaning as a point in space based on
distribution: VECTORS are n-dimensional entities which can represent sentences in a measurable space,
which makes comparison possible.
We can not only plot frequencies alone (like in the bag-of-words approach), but also n-grams, POS
annotations, number of syllables, syntag_breaks (number of hesitations after a functional word),
repetitions (n of duplicated syllabic patterns) and false starts before content words. We can take account
of these features and see if they introduce an Information Gain. If so, we can plot them in a vector.
But where to draw a line? We can try to find a SUPPORT VECTOR (at an equal distance from the others)
and then calculate the distance between the two, or rather use a DECISION TREE, by using a feature and
then on its basis expand the tree considering other features until we get a classification.