0% found this document useful (0 votes)
34 views

Constraint-Based Learning of Phonological Processes

This document describes research on learning phonological processes from data in an unsupervised manner. The goal is to infer human-readable descriptions of phonological rules from collections of related words. The approach uses constraint-based program synthesis to encode the learning problem into logic constraints that can be solved efficiently. It decomposes the problem into inferring underlying forms, learning sound changes, and learning contextual conditions. Evaluation on textbook problems and lexical databases shows the system, called S Y P HON, can generate accurate rules an order of magnitude faster than previous approaches.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Constraint-Based Learning of Phonological Processes

This document describes research on learning phonological processes from data in an unsupervised manner. The goal is to infer human-readable descriptions of phonological rules from collections of related words. The approach uses constraint-based program synthesis to encode the learning problem into logic constraints that can be solved efficiently. It decomposes the problem into inferring underlying forms, learning sound changes, and learning contextual conditions. Evaluation on textbook problems and lexical databases shows the system, called S Y P HON, can generate accurate rules an order of magnitude faster than previous approaches.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Constraint-based Learning of Phonological Processes

Shraddha Barke, Rose Kunkel, Nadia Polikarpova, Eric Meinhardt, Eric Baković, and Leon Bergen

{sbarke,wkunkel,npolikarpova,emeinhar,ebakovic,lbergen} @ucsd.edu

University of California, San Diego

Abstract 2. Inference must be unsupervised: phonological


processes are formalized as transformations
Phonological processes are context-dependent from (latent) underlying forms to surface forms
sound changes in natural languages. We
(rather than between surface forms).
present an unsupervised approach to learning
human-readable descriptions of phonological 3. Inference must be data-efficient: typically
processes from collections of related utter-
only a handful of data points are available.
ances. Our approach builds upon a technique
from the programming languages community
4. Inference must be fast: we envision linguists
called constraint-based program synthesis. We
contribute a novel encoding of the learning using our system interactively, tweaking the
problem into Satisfiablity Modulo Theory data and being able to see the inferred rules
constraints, which enables both data efficiency within minutes.
and fast inference. We evaluate our system
on textbook phonology problems and lexical Recently program synthesis has emerged as
databases, and show that it achieves high a promising approach to interpretable and data-
accuracy at speeds two orders of magnitude efficient learning (Ellis et al., 2015; Singh et al., 2017;
faster than state-of-the-art approaches. Verma et al., 2018; Ellis et al., 2018). In program
synthesis, models are represented as programs in
1 Introduction a domain-specific language (DSL), which allows
Phonological processes govern the way speech domain experts to impose a strong prior by designing
sounds in natural languages change depending on the an appropriate DSL. Program synthesis uses powerful
context. For example, in English verbs, the past tense constraint solvers to perform combinatorial optimiza-
suffix /d/ turns into [t] after voiceless consonants tion and find the least-cost program in the DSL that
(so the word “zipped” is pronounced [zIpt], while fits the data. Program synthesis has been previously
“begged” is pronounced [bEgd]). Linguists routinely used to tackle the problem of phonological rule
face the task of inferring phonological processes by learning (Ellis et al., 2015), however their work uses
observing and contrasting surface forms (pronun- global inference which scales poorly and hence does
ciations) of morphologically related words. To aid not satisfy requirement 4 (their system takes an hour
linguists with this task, we consider the problem of on average to solve a phonology textbook problem).
learning phonological processes automatically from In this work, we propose a novel inference tech-
collections of related surface forms. nique that satisfies all four core requirements. Our
This problem setting imposes four core require- key insight is that the problem of learning conditional
ments, which guide the design of our approach: rewrite rules can be decomposed into three steps:
inference of the latent underlying forms, learning
1. Inference results must be fully interpretable: the changes (rewrites), and learning the conditions.
our goal is to explain phonological processes Moreover, each of these problems can be encoded as a
exhibited by the data, not merely predict pro- constrained optimization problem that can be solved
nunciations of unseen words. Hence, our model efficiently by modern satisfiability modulo theories
takes the form of discrete, conditional rewrite (SMT) solvers (de Moura and Bjørner, 2008). Both
rules from rule-based phonology (Chomsky and the decomposition and the encoding into constraints
Halle, 1968). are contributions of this work. We implement this

6176
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing
and the 9th International Joint Conference on Natural Language Processing, pages 6176–6186,
Hong Kong, China, November 3–7, 2019. 2019c Association for Computational Linguistics
approach in a system called S Y P HON and show that it omitted, because it is irrelevant to the rule’s applica-
is capable of generating accurate phonological rules tion; formally, A, L, and R may each be empty feature
in under a minute and from just 5–30 data points. vectors, which are defined to match any phone.
Hereafter, we refer to the sequence LAR of
2 Background and Problem Definition the target and the context as the condition of the
rule. If the condition is empty, the rule applies
In this section, we illustrate phonological processes unconditionally. In addition to + and −, the values of
and the problem of phonological rule induction using features in the condition of the rule may be variables,
our running example of English verbs. which enforce that features have the same value in
different parts of the condition. For example,
2.1 Rule-Based Phonology
A → B / [αconsonant] [αconsonant]
Phonological features. Phones (speech sounds) are describes a rule which applies between pairs of
described using a feature system that groups similar- consonants and pairs of vowels, but not between a
sounding phones together. For instance, voiced con- consonant and a vowel.
sonants (consonants produced with vibrating vo-
2.2 Problem Definition
cal cords, like [z], [d], [b]) possess the features
+consonant and +voice, while voiceless consonants The input to our problem is a matrix of surface forms,
(like [s], [t], [p]) possess the features +consonant such as the one shown in Fig. 1, left. These forms are
and −voice. Each phone can be uniquely identified arranged into rows, corresponding to different stems,
by a feature vector: for example [−voice −strident and columns, corresponding to different inflections
+anterior −distributed] uniquely identifies the sound (in this case, the third-person singular and past tense
[t]. However, some phones may be uniquely identi- of English verbs). In the interest of space, we only
fied by several feature vectors, and not all feature vec- show four rows from this data set, but a typical input
tors correspond to phones (the feature system is redun- in a phonology textbook problem is only slightly
dant). For example, the feature vector [+low +high] larger and ranges from 5 to 30 rows.
does not correspond to any phones, as no phone can Given these data, our task is to infer the latent
have both a raised and a lowered tongue body. underlying forms for each of the words in the input
such that the resulting matrix of underlying forms
Phonological rules. In rule-based phonology, a
factorizes into stems and suffixes, and to learn a
phonological process is formalized as a conditional
sequence of phonological rules which, when applied
rewrite rule that transforms an underlying form
elementwise to the matrix of underlying forms,
of a word (roughly, the unique stored form of
reproduces the matrix of surface forms.
the word) into its surface form (the word as it is
This learned sequence of phonological rules
intended to be pronounced). In our English past tense
is generative in the following sense: given the
example, the underlying form /zIpd/—formed by
underlying form for a new word, such as /æskz/, we
concatenating the stem /zIp/ and past tense suffix
can deterministically apply these rules to generate
/d/—is transformed into the surface form [zIpt] by a
the surface form of that word, [æsks]. We use this
rule that makes an obstruent voiceless when it occurs
property to evaluate the accuracy of the rule set we
after a voiceless obstruent:
learned by holding out a portion of the words from
[−sonorant] → [−voice] / [−voice] the data, and then applying the rule to the underlying
In general, phonological rules have the form forms of those words, which were determined
A → B / L R, where all of A, B, L, and R are through phonological research.
feature vectors. The rule means that any phone that
matches A and occurs between two phones that match 2.3 Phonological Intuition
L and R, respectively, will be rewritten to match B The design of our system is informed by how linguists
(leaving the features not mentioned by B intact). A is solve the problem of phonological rule induction.
called the target of the rule, B is called the structural When a phonologist analyzes these data, they begin
change, and L and R are the left and the right by positing underlying forms that are likely to
contexts.1 In the example above, the right context is result in the simplest set of rules. For example, they
1
observe that the substring shared in each row is most
In this work we only consider a subset of strictly local k=3
rules (Chandlee et al., 2014), where either side of the context likely the stem, which surfaces without change; the
is restricted to at most a single phone. underlying suffix in the first column in Fig. 1 is likely

6177
Validate

/æsk/+/z/
Train

[zIps] [zIpt]
 
 [bEgz] [bEgd]  Xij R

∅ → [@] / [αstrident] [αstrident]

[mIs@z] [mIst]  S Y P HON [−sonorant] → [−voice] / [−voice]
[nidz] [nid@d]
Uij model (phonological rules)
input (surface forms)

/zIp/+/z/ /zIp/+/d/
 
 /bEg/+/z/ /bEg/+/d/ 
/mIs/+/z/ /mIs/+/d/ [æsks]
/nid/+/z/ /nid/+/d/
latent underlying forms

Figure 1: The general structure of the problem, shown concretely for English verbs.

/z/, which sometimes surfaces as [s] and other times be formalized as a context-free grammar:
as [@z]; and similarly, the underlying suffix in the R ⇒ R∗ R ⇒ C →C /C C
second column is likely /d/, which can change to [t] ∗
C ⇒ (V F ) V ⇒ +|− (1)
or [@d]. The choice of /z/ and /d/ as the underlying
suffixes is preferred to, say, /s/ and /t/, because this F ⇒ consonant | voice | ...
choice lets us explain all the observed data using only According to this grammar, R is a sequence of rules
three edits: /z/ → [s], /d/ → [t], and ∅ → [@]. R; each R is defined in terms of four feature vectors
The next step is to merge and generalize individual C; each feature vector is a sequence of pairs of
edits: the first two edits are both devoicing an feature values V and feature names F .
obstruent, so they can be merged into [−sonorant] Rewriting. We use CR and BR to denote the condi-
→ [−voice], while the last edit is an insertion and tion and structural change of a rule R, respectively.
cannot be generalized. A feature vector C can be interpreted as a Boolean
The final step of the analysis is to infer the condi- formula that holds of a phone a if a possesses all
tions under which each of the two structural changes features in C; we denote by |C| the number of
occurs. By contrasting examples in the first column, models of this formula, i.e. phones in the inventory
we infer that the insertion happens when the suffix /z/ Φ that satisfy C. Similarly, CR is a Boolean formula
occurs after a strident (like /s/ in /mIs/); otherwise, over trigrams of phones. A rewrite of a trigram abc
/z/ and /d/ are devoiced whenever they occur after a by rule R is defined as:
(
voiceless obstruent (like /p/ in /zIp/). The full data BR (b) if CR (abc)
set can be explained using the two rules in Fig. 1, right. R(abc) =
b otherwise
Note that in order to capture the data in both columns,
the insertion rule says that [@] is inserted whenever The notion of rewrites can be extended to words and
the stridency of the left and right context matches. rule sets.
Note also that in this case the order of rules matters: Learning as constrained optimization. We can
for words like /mIsz/, insertion is applied first, which now formalize our problem as a hard correctness
prevents the devoicing rule from applying. constraint over rules and underlying forms Uij :
R(Uij ) = Xij where Uij , Aj [Si ] (2)
3 Learning Phonological Rules Here, Aj [Si ] denotes a concatenation of the
prefix/suffix Aj with the stem Si .
As illustrated in Fig. 1, the input to our learning
There might be many rule sets R that are consistent
problem is a matrix of surface forms Xij with I rows
with all the data, and what we would like is to pick
and J columns. The goal is to learn a discrete rule
one that generalizes to other data that exhibits the
set R, while jointly inferring the latent set of I stems
same phonological process (for example, the rule
Si and J affixes Aj .
inferred in Fig. 1 should generalize to other regular
Hypothesis space. The hypothesis space for R can English verbs). Hence we frame the learning problem

6178
R (subject to the hard constraint that they form a fac-
torizable matrix Uij ). Finally, we deterministically
compute xk , Rk (uk ). Hence we can define:
Rk 0 if R (u ) 6= xk
 P (b =>)k k


P (xk ,uk | Rk ) =
k
|CRk | if CRk (uk )
 P (bk =⊥) otherwise


|¬CRk |
bk uk xk
Our goal is to maximize
k ∈ 1..N P (R,R1 ,...,RN ,u1 ,...,uN | x1 ,...,xN )
N
Figure 2: Probabilistic model of a phonological process. Y
A rule set R is sampled from a description length prior. ∝ P (R) P (xk ,uk | Rk )P (Rk | R)
We observe a set of N surface phonemes xk ; each xk k=1
is generated by sampling a rule Rk from R and an Objective function. Taking logs, we can derive the
underlying trigram uk , and deterministically applying following approximate minimization objective for
Rk to uk (coin flip bk decides whether uk should match our constrained
X optimization problem:
Rk ’s condition). ws `(R)+NR+ ·log(|CR |) (3)
R∈R
where NR+ is the number of positive examples for this
as a constrained optimization problem and derive
rule. (Note that this objective ignores P (Rk | R) and
the objective function using a Bayesian model.
bk , which are assumed to be uniform. It also ignores
3.1 Bayesian Model the negative examples. This provides a reasonable ap-
proximation, under the assumption that |¬CR |  |CR |
Generative process. Intuitively, to generate surface for each rule R, which holds in the current setting.)
forms Xij , we must sample a single rule set R, I This function includes a simplicity term, which favors
stems Si , and J affixes Aj , and then deterministically rules with shorter (and hence, more general) condi-
apply R to each Aj [Si ]. Prior work on phonological tions, and a likelihood term, which favors more spe-
rule learning (Ellis et al., 2015) assumed that Si and cific conditions if there are sufficient positive exam-
Aj are sampled uniformly from the language and ples to support them. This likelihood term stems from
independently of R. We observe, however, that in our strong sampling assumption; we demonstrate its
most data sets of interest, underlying forms are in importance for inferring accurate rules in Sec. 5.
fact sampled to contrast the contexts in which R
does and does not apply. We model this intuition as 3.2 Inference by Program Synthesis
a strong sampling process depicted in Fig. 2. To solve the constrained optimization problem
For simplicity, in this model each observation we build upon a technique from programming
corresponds to an individual rule application to languages called constraint-based program
an underlying trigram u that produces a surface synthesis (Solar-Lezama, 2013).
phoneme x. For example, the rewrite /zIpz/ → Constraint-based synthesis. The input to (induc-
[zIps] is represented as four observations: /#zi/ tive) program synthesis is a DSL that defines the
→ [z], /zIp/ → [I], /ipz/ → [p], and /pz#/ → [s] space of possible programs and a set of input-output
(where # encodes word boundary). −−→
examples E = hi,oi; the goal is to find a program
Our generative process first samples a ruleset R whose behavior is consistent with the examples. In
from the description length prior over the hypothesis constraint-based synthesis, this search problem is
space (1): P reduced to solving a boolean constraint. To this end,
P (R) ∝ 2−ws · R∈R `(R) we index the DSL by a bitvector c, called a control
where `(R) is the length of rule R and ws > 0 is vector. We then define a mapping from control vec-
a model hyperparameter. For each observation tors to program behaviors via an evaluation relation
k ∈ 1..N , we pick a rule Rk uniformly from R. ϕ(c,y,z)—a boolean formula that holds if and only if
Before sampling the underlying trigram uk , we flip a program indexed by c produces output z on input y.
a coin bk to decide whether we want to sample a Given the evaluation relation, the synthesis problem
positive or a negative trigram, i.e. whether CRk (uk ) reduces to solving the^
following boolean constraint:
should hold true; we then sample uk uniformly ∃c. ϕ(c,i,o)
from the set of all positive (resp. negative) trigrams hi,oi∈E

6179
An SMT solver (de Moura and Bjørner, 2008) is the condition under which this change occurs
then used to find a satisfying assignment for c, which (Sec. 3.5). If this step fails, we go back to step
allows us to recover the corresponding program. For 1 and generate the next candidate matrix Uij .
this approach to succeed, the evaluation relation has
In the rest of this section we detail these three
to be designed carefully so that it only uses constraints
steps. For illustration purposes, in all examples we
that the solver can efficiently reason about.
will assume that our feature set has just three features:
Synthesis of phonological rules. In our setting, the voice v, sonorant s, and continuant c.
DSL is the space of all rule sets R (up to a certain
size), and the evaluation relation ϕ(c,U,X) is the 3.3 Underlying Form Inference
correctness condition (2). Importantly, our setting The input to this step is the matrix of surface forms
differs from traditional program synthesis in two Xij and the output is a set of aligned pairs hU,Xiij .
ways: first, we have to search for both the control Tab. 1 illustrates this for a 2 × 2 matrix of English
vector and the inputs, and second, in addition to verbs. For example, hU,Xi11 = h[zIpz],[zIps]i; we
satisfying the correctness condition, we also seek to use red to show alignment information (in this case,
minimize the objective function (3). If we encode the a single substitution z → s). Insertions and deletions
−−−→
objective function as ψ(c,hU,Xi), we can reduce rule are represented by alignment with null segments.
learning to the following constrained optimization:
−−−−−−−−→ Input Output
minimize ψ(c,hAj [Si ],Xij i)
N,M
[zIps] [zIpt] [zIpz] [zIpd]
[zIps] [zIpt]
^
subject to ϕ(c,Aj [Si ],Xij )
i,j=1,1
[nidz] [nid@d] [nidz] [nid d]
Given a proper encoding of ϕ and ψ, this constraint [nidz] [nid@d]
can be solved by an optimizing SMT solver (Bjørner
Table 1: Underlying form inference on English verbs.
et al., 2015); this is the approach used in prior
work (Ellis et al., 2015). However, this is a very
The output matrix hU, Xiij has to satisfy two
computationally intensive problem. The reason
properties: (i) the matrix Uij can be factorized into
is the astronomical size of the search space: for a
stems Si and affixes Aj , and (ii) each pair hU,Xi has
problem of factorizing a 10×2 matrix Xij into stems
a small edit distance. Our intuition is that underlying
of length `S = 3 and affixes of length `A = 2, if we
forms that have a small edit distance from surface
limit the maximum number of rules NR to 2 and
forms are likely to produce simpler rules. Hence we
consider an inventory Φ with 90 phones and a feature
generate candidate matrices hU,Xiij in the order of
set F with 30 features, we can estimate the size of
increasing edit distance, until rule inference succeeds
the search space as 3|F |NR |Φ|I`S +J`A ≈ 2600 .
for one of them. This strategy will always eventually
Decomposition. To achieve scalable inference, we find a matrix of underlying forms which can be
decompose the global constrained optimization related to the surface forms by a rule set we can infer
problem into three steps, inspired by phonological as long as one exists. This process is not guaranteed to
intuition we described in Sec. 2.3: find the global minimum of the objective function (3),
but we show empirically that it produces good results.
1. Underlying form inference. In the first step we
We can encode the properties (i) and (ii) as
use an SMT solver to generate likely underlying
a boolean constraint over unknown strings with
stems and suffixes. We rank them based on the
concatenation and length, which can be solved
heuristic that underlying forms Uij that have a
efficiently by the Z3 STR 2 solver (Zheng et al.,
smaller edit distance from surface forms Xij are
2017). From the solutions for those unknowns it
more likely to produce simple rules (Sec. 3.3).
is straightforward to recover not only the stems and
2. Change inference. Given the set of edits suffixes, but also the required alignment information
between each Uij and the corresponding Xij , between the underlying and surface forms.
we identify the smallest set of structural changes
3.4 Change Inference
B that can describe all the edits (Sec. 3.4).
The input to change inference is the set of all edits in
3. Condition inference. Finally, for each structural the aligned pairs hU,Xiij , computed in the previous
change B, we use program synthesis to infer step, and the output is a set of structural changes

6180
that captures all the edits. Tab. 2 illustrates this for u ` Features
the edits from Tab. 1; columns LHS and RHS show
/#zI/ ⊥ [+#] [+v −s][+v +s]
relevant features of the left- and and right-hand sides
of the edit. /zIp/ ⊥ [+v −s][+v +s][−v −s]
/Ipz/ ? [+v +s][−v −s][+v −s]
Edit LHS RHS Change
/pz#/ > [−v −s][+v −s] [+#]
/z/ → [s] [+v −s +c] [−v −s +c]
[−v]
/d/ → [t] [+v −s −c] [−v −s −c] Table 3: Input to condition inference for change [−v]
∅ → [@] ∅ [@] [@] on /zIpz/ → [zIps]

Table 2: Change inference on English verbs.


Inference by program synthesis. To frame condi-
For each edit, we compute the set of all possible tion inference as a program synthesis problem we
structural changes which are consistent with the edit. need to define the control vector that indexes the space
For example, the edit /z/ → [s] is consistent with of all possible conditions, and a corresponding eval-
four possible changes: [−v], [−v −s] [−v +c], and uation relation. In our control vector, for each feature
[−v −s +c]. Next, we greedily merge change-sets f , we use six control variables, which represent the
of different edits if their intersection is nonempty. three positions that a feature can appear in a rule (left
This merging step allows us to identify a small set context, target, and right context) and the two values
of distinct structural changes which together describe it can take on (+ and −). We denote these variables
all the edits. For example, the change-sets of the by fpv for v in V = {+,−} and p in P = {l,t,r}.
first two edits in Tab. 2 can be merged to produce Our evaluation relation takes the form
the change-set: {[−v],[−v −s]}. The third edit in ϕ(c,u,`) , matches(c,u) ⇔ `, where matches is
Tab. 2 is an insertion, which changes the values of all a relation specifying whether the condition indexed
features present in [@], and hence cannot be merged. by c matches the trigram u. The matches relation
When no more merges are possible, we pick the is further defined as follows:
^
simplest change from every change set (in this case, matches(c,u) , fpv ⇒ (up,f = v)
(f,v,p)∈F ×V ×P
we end up with changes B1 = [−v] and B2 = [@]).
where up,f is the value of feature f at position p in
This greedy process bounds the maximum number
trigram u.
of rules to the number of change sets.
3.6 Inductive Bias
3.5 Condition Inference
In addition to being consistent with the data, we also
For each structural change B inferred in the previous want the condition to minimize the objective function
step, we now attempt to determine the condition (3). We encode the objective function as
LAR under which the change applies. If successful, ws s(c)+l(c),
a rule A → B / L R is added to the inferred rule where s(c) encodes the simplicity of the condition
set R; otherwise we go back to underlying form indexed c (its size), l(c) encodes the likelihood, and
inference and try the next candidate matrix Uij . ws is a model hyperparameter which determines the
For a given change B, the input to condition infer- relative importance of simplicity.
ence is the set of pairs hu,`ik , where uk is a phone tri- The challenge is to encode the likelihood term in a
gram in some underlying form and the label `k can be solver-friendly way. To count the number of models
positive (>), negative (⊥), or unknown (?). Tab. 3 il- l ||C t ||C r |, i.e. we
of |CR |, we observe that |CR | = |CR R R
lustrates this for trigrams from U = /zIpz/. A trigram can independently count the models of the target, and
is labeled positive if its middle phone undergoes the the left and right contexts,
change B in the data, negative if it does not undergo Xso p
+
l(c) , N log(|CR |)
B, and unknown if B has no effect on this phone. In p∈P
our example, neither /I/ nor /p/ in /zIps/ actually p
We also observe that |CR | can be encoded efficiently
changed, however /zIp/ is labeled ⊥ while /Ipz/ using a constraint whose size is linear in the size of
is labeled ?, because /p/ is already [−v], and hence the phone inventory Φ:
devoicing has no effect on it. Our goal is to infer a con- X  ^ 
p
|CR | , if ¬fpv then 1 else 0
dition consistent with the labels of all the positive and
a∈Φ (f,v)∈F ×V
negative trigrams (unknown trigrams are ignored). af 6=v

6181
Finally, as the solver does not support logarithms, for both rule sets and underlying forms, provided by
we encode log using a lookup table. This is tractable, one of our coauthors, who is a phonologist.
p
since we only need to evaluate the log of each |CR |,
which is at most the size of our inventory, roughly 4.2 Textbook Problems
100 phones. For this category, we curated a set of 34 problems
from phonology textbooks (Gussenhoven and Jacobs,
3.7 Current limitations
2017; Odden, 2005; Roca and Johnson, 1999) by
S Y P HON currently leverages three simplifying selecting problems with local, non-interacting rules.
assumptions about rules for domain-specific problem For each problem, the input data set is tailored
decomposition and SMT encoding, which are crucial (by the textbook author) to illustrate a particular
to making learning computationally tractable. phonological process, and contains 20-50 surface
Conjunctive conditions. Rule conditions are con- forms. For all of these problems we have gold
junctions of equalities over feature values, and each standard solutions, either provided with the textbook
rule has a unique change. We can thus decompose or inferred by a phonologist. Gold standard solutions
the learning process into change inference and condi- range in complexity from one to two rules. Out of the
tion inference: change inference greedily groups all 34, 21 problems are shared with (Ellis et al., 2019),
observed edits into changes, and from then on we as- which we use as the baseline for inference times.
sume that each change uniquely corresponds to a rule. Following the textbooks, these problems are
Local context. The condition of each rule is only further subdivided into 10 matrix problems, 20
a function of the target phone and two surrounding alternation problems, and 4 supervised problems.
phones. This allows us to encode condition inference The matrix problems follow the format presented
as learning a formula over trigrams of phones, which in Sec. 2. The alternation and supervised problems
has a compact encoding as SMT constraints. are easier, as we are given more information about
Rule interaction. One rule’s change does not create the underlying form. For alternation problems,
conditions for another. This allows us to perform we are essentially given a set of choices for what
condition inference for each rule independently. the underlying form might be, and for supervised
problems the underlying form is given exactly. These
Many attested patterns in real languages go beyond
problems include examples of phones in comple-
these limitations. We believe that it is possible to
mentary distribution. Our problem decomposition
lift these restrictions, and still leverage the structure
allows us to switch out underlying forms inference
of conditional rewrite rules to retain most of the
to handle different kinds of input. According to this
benefits of our problem decomposition. We leave
classification, the flapping lexical database is an
this extension to future work.
alternation problem and verbs is a matrix problem.
4 Data
5 Experiments
We evaluate our system on two broad categories of
datasets: lexical databases and textbook problems. We evaluate our system on the two categories of data
sets discussed in Sec. 4. We split the 34 textbook
4.1 Lexical Databases problems into 24 development and 10 test problems.
We use large lexical databases to investigate two Our system has several free parameters (most
(morpho)phonological processes in English: flapping importantly, the simplicity weight ws ). These were
(6457 rows) and regular verb inflections (2756 rows). hand-tuned on all of the data except the test problems.
We process the CMU pronouncing dictionary (Weide, For the test problems, we only added missing sounds
2014) to create underlying and surface form pairs ex- to the inventory as needed. The 10 test problems are
emplifying flapping, as in Gildea and Jurafsky (1996). all alternation problems. We leave for future work
For verb inflections, we combine morphological the investigation of these hyperparameter settings
information extracted from CELEX-2 (Baayen et al., on new matrix problems.
1993) with CMU transcriptions to create a database
of regular verbs, where each row of the database 5.1 Lexical Database Experiments
contains the third-person singular present tense We evaluate our system on two large English datasets,
wordform and past tense wordform for a given verb. one demonstrating flapping, and the other verbs. For
For both datasets we have gold standard solutions each dataset, we learn a rule set from 20, 50 and 100

6182
Accuracy Rule Match Data Set   Inferred Rule
+cor
Precision Recall UF 1 Flap 20
−voi
→ [+approx] / [+1stress]
SP SP- SP SP- SP SP-
 
+cor  
+voi
Flap 20 76 52 50 66 31 25 100 2 Flap 50  −voi  → / [+stress] [+syll]
+approx
−del. rel.
Flap 50 93 79 86 71 86 71 100 
+ant

 
+voi
Flap 100 100 79 100 71 100 71 100 3 Flap 100  −voi  → / [+stress] [+syll]
+approx
Verb 20 86 73 48 42 83 61 100 −del. rel.

Verb 50 88 78 52 50 92 80 100 4 Russian [−son] → [−voi] / #


Verb 100 95 81 62 58 100 82 100 
+cons

5 Scottish [+syll] → [+long] /  +voi 


Table 4: Accuracy results for the English flapping     +cont
−cont −c.g.
and verbs corpora data sets on 20, 50 and 100 training 6 Korean
−voi
→ / [+c.g.]
 −s.g.
examples. S Y P HON (SP ) and S Y P HON - (SP- ) are two 
−cont
7 Farsi → ∅ / [+ATR] #
variants of our model, with and without likelihood, resp. +dors  
αvoi
Accuracy reports the generalization accuracy on unseen 8 Hungarian [−son] → [αvoi] /
−del. rel.
inputs, rule match and UF indicate how well the inferred
rule and underlying form resp. match the gold standard. 9 Kishambaa  [+nas] →[−voi] / [+s.g.]
+approx
10 Persian → [+voi] / [−nas]
−voi  
−lab
data points, and evaluate its accuracy on the remain- 11 Ganda [+lat] → [+cont] /
  +ATR 
ing data. We also perform a syntactic comparison of 12 Limbu
+back
→ [+rnd] /
+lab
+syll −cont
the rule set against the gold standard rules, which we  

−ant

 
−son −rnd
report as average precision and recall among the sets 13 Kongo
+cor
→  +dist  /
+high
+strid
of features in the two rules. Finally, we compare the
latent underlying forms we inferred for each problem Table 5: Selected inferred rules: English flapping trained
to the known correct underlying forms. Tab. 4 summa- on 20, 50 and 100 examples (1–3); textbook develop-
rizes the results. Tab. 5 (rows 1–3) shows the actual ment problems (4–8); textbook test problems (9–13).
rules inferred on the three flapping training sets.
To examine the importance of likelihood in our
system, we repeat this experiment for a variant of
our system S Y P HON -, which does not consider like- problems (rows 4–8) and test problems (rows 9–13).
lihood and simply optimizes our simplicity prior. As
the number of data points increases, the effect of the The accuracy of our inferred rules and underlying
likelihood grows, and so for S Y P HON the recall com- forms is 100% for all textbook problems. This is
pared to the gold standard quickly climbs. By contrast, not surprising: the combination of hard constraints
the recall of S Y P HON - plateaus, which shows that and a restrictive DSL makes inferring incorrect rules
likelihood is indeed important for finding good rules. or underlying forms very difficult. More interesting
is the syntactic comparison to the gold standard.
5.2 Textbook Problem Experiments
This measure is intended to estimate how well
We evaluate the textbook problems under the follow- the rules S Y P HON produces match phonologists’
ing three experimental conditions. First, to evaluate intuition. The results in Tab. 6 confirm that without
the generalization accuracy for unseen inputs, for the likelihood term, inference tends to exclude
each of the problems, we hold out a randomly important features from the rule condition: the recall
sampled 20% of the data. We then learn a rule set on for held out problems goes down by 21%.
the remaining data, and validate it against the held out
data. We repeat this process three times, and report Finally, we compare inference times of S Y P HON
the average accuracy for each class of problems in with the prior work of Ellis et al. (2019), which is also
Tab. 6. We also evaluate syntactic accuracy of the based on constraint-based program synthesis but does
rules and of underlying forms, in the same way as not perform problem decomposition, instead using
for the lexical databases. Additionally, we evaluate the global encoding outlined in Sec. 3.2. As shown
our system on 10 test problems, which were left out in Tab. 7, the decomposition makes S Y P HON at least
entirely when tuning the system hyperparameters. two orders of magnitude faster, with an average infer-
We report the same metrics for these problems. ence time of just 30 seconds for matrix problems, thus
Tab. 5 shows inferred rules for selected development enabling phonologists to use the tool interactively.

6183
Accuracy Rule Match such work either requires thousands of training
UF
Precision Recall observations (Gildea and Jurafsky, 1996) or has used
SP SP- SP SP- SP SP- SP abstracted and greatly simplified symbol inventories
M AT 100 100 70 69 77 69 100 and training data (Chandlee et al., 2014).
A LT 100 100 66 61 71 62 100 Hayes and Wilson (2008), Goldsmith and Riggle
S UP 100 100 63 60 71 64 - (2012), and Futrell et al. (2017) propose different
T EST 100 100 54 52 61 48 100 methods for learning probabilistic models of phono-
tactics, which represent gradient co-occurrence re-
Table 6: Accuracy of textbook problems. We use (-)
strictions between surface segments within a word.
for supervised problems without underlying form
inference. Unlike the current implementation of S Y P HON, these
models include representational structures that enable
Inference Time (secs) them to capture certain non-local phenomena. How-
S Y P HON Baseline Speedup ever, because these models focus on phonotactics,
M AT 30.0 3100 124.6 they do not infer underlying forms or rules which
A LT 10.7 N/A N/A relate underlying forms to surface forms.
S UP 5.3 6333 378.3 Finally, much work has focused on learning
T EST 8.3 N/A N/A representations of phonological processes as
mappings that minimally violate a set of ranked or
Table 7: Comparison of the inference times of textbook weighted constraints (Prince and Smolensky, 2004;
problems with baseline. We report the median execution Legendre et al., 1990), but such work has generally
times and geometric mean of the speedups. N/A indi- taken the constraint definitions as given and focused
cates examples where baseline results are unavailable.
on learning rankings or weights (see e.g. Goldwater
and Johnson, 2003; Tesar and Smolensky, 2000;
6 Related Work Boersma and Hayes, 2001), with some exceptions
(Doyle et al., 2014; Doyle and Levy, 2016).
Learning (morpho-)phonology is a rich and active
area of research; in this overview, we focus on 7 Conclusion
approaches that share our goal of inferring fully
interpretable models of phonological processes. We have presented a new approach to learning fully
Most closely related to ours is the work of Ellis et al. interpretable phonological rules from sets of related
(2015) and its (unpublished) follow-up (Ellis et al., surface forms. We have shown that our approach pro-
2019) on using program synthesis to infer phonologi- duces rules that largely match linguists’ intuition from
cal rules. As mentioned above, the main difference is a handful of examples and within minutes. The contri-
that S Y P HON is two orders of magnitude faster than butions of this paper are a novel decomposition of the
their system thanks to a novel decomposition and ef- global inference problem into three local problems, as
ficient SMT encoding. On the other hand, we impose well as an encoding of these problems into constraints
extra restrictions on the hypothesis space (i.e. we only that can be efficiently solved by an SMT solver.
support local rules), which means that S Y P HON is
unable to solve some of the harder textbook problems
References
that Ellis et al. (2019) can solve. In addition, Ellis et al.
(2019) propose a method for inducing phonological Adam Albright and Bruce Hayes. 2002. Model-
ing English past tense intuitions with minimal
representations which are universal across languages.
generalization. In Proceedings of the 2002 Work-
Beyond program synthesis, Rasin et al. (2017) shop on Morphological Learning, Association for
use a comparable description length-based approach Computational Linguistics.
to unsupervised joint inference of underlying
Rajeev Alur, Rastislav Bodı́k, Garvit Juniwal, Milo
phonological forms and rewrite rule representations M. K. Martin, Mukund Raghothaman, Sanjit A. Se-
of phonological processes, but use a genetic shia, Rishabh Singh, Armando Solar-Lezama, Emina
algorithm to find approximate solutions. Gildea and Torlak, and Abhishek Udupa. 2013. Syntax-guided
Jurafsky (1996) and Chandlee et al. (2014) discuss synthesis. In Formal Methods in Computer-Aided
Design, FMCAD 2013, pages 1–8.
supervised learning of restricted classes of finite-state
transducer representations of several phonological Rajeev Alur, Arjun Radhakrishna, and Abhishek Udupa.
processes (including English flapping). To date, 2017. Scaling enumerative program synthesis via

6184
divide and conquer. In International Conference Daniel Gildea and Daniel Jurafsky. 1996. Learning bias
on Tools and Algorithms for the Construction and and phonological-rule induction. Computational
Analysis of Systems, pages 319–336. Springer. Linguistics, 22(4):497–530.

R. Harald Baayen, Richard Piepenbrock, and H van Rijn. John Goldsmith and Jason Riggle. 2012. Information
1993. The CELEX lexical database on CD-ROM. theoretic approaches to phonological structure: the
case of Finnish vowel harmony. Natural Language
Nikolaj Bjørner, Anh-Dung Phan, and Lars Flecken- & Linguistic Theory, 30(3):859–896.
stein. 2015. νz - an optimizing SMT solver. In Tools
and Algorithms for the Construction and Analysis of Sharon Goldwater and Mark Johnson. 2003. Learning
Systems, pages 194–199. OT Constraint Rankings Using a Maximum Entropy
Model. Proceedings of the Stockholm Workshop on
Paul Boersma and Bruce Hayes. 2001. Empirical Variation within Optimality Theory, pages 111–120.
Tests of the Gradual Learning Algorithm. Linguistic
Inquiry, 32(1):45–86. Noah D Goodman, Joshua B Tenenbaum, Jacob
Feldman, and Thomas L Griffiths. 2008. A rational
Jane Chandlee, Rémi Eyraud, and Jeffrey Heinz. 2014. analysis of rule-based concept learning. Cognitive
Learning strictly local subsequential functions. science, 32(1):108–154.
Transactions of the Association for Computational
Linguistics, 2:491–504. Dan Gusfield. 1997. Algorithms on Strings, Trees and
Sequences. Cambridge University Press.
Noam Chomsky and Morris Halle. 1968. The Sound Pat-
tern of English. Studies in language. Harper & Row. Carlos Gussenhoven and Haike Jacobs. 2017. Under-
standing Phonology. Routledge.
Leonardo Mendonça de Moura and Nikolaj Bjørner.
2008. Z3: an efficient SMT solver. In TACAS, Bruce Hayes and Colin Wilson. 2008. A maximum
volume 4963 of LNCS, pages 337–340. Springer. entropy model of phonotactics and phonotactic
learning. Linguistic Inquiry, 39(3):379–440.
Gabriel Doyle, Klinton Bicknell, and Roger Levy.
2014. Nonparametric Learning of Phonological Géraldine Legendre, Yoshiro Miyata, and Paul
Constraints in Optimality Theory. In Proceedings Smolensky. 1990. Harmonic Grammar – A formal
of the 52nd Annual Meeting of the Association for multi-level connectionist theory of linguistic well-
Computational Linguistics, pages 1094–1103. formedness: Theoretical foundations. Technical
Report ICS # 90-5, CU-CS-465-90, University of
Gabriel Doyle and Roger Levy. 2016. Data-driven Colorado.
learning of symbolic constraints for a log-linear
David Odden. 2005. Introducing Phonology. Cam-
model in a phonological setting. In Proceedings of
bridge University Press.
COLING 2016, the 26th International Conference on
Computational Linguistics: Technical Papers, pages José Oncina, Pedro Garcı́a, and Enrique Vidal. 1993.
2217–2226. Learning Subsequential Transducers for Pattern
Recognition Interpretation Tasks. IEEE Transac-
Greg Durrett and John DeNero. 2013. Supervised
tions on Pattern Analysis and Machine Intelligence,
learning of complete morphological paradigms. In
15(5):448–458.
Proceedings of the North American Chapter of the
Association for Computational Linguistics. Alan Prince and Paul Smolensky. 2004. Optimality The-
ory: Constraint interaction in generative grammar.
Kevin Ellis, Adam Albright, Armando Solar-Lezama, Wiley-Blackwell.
Joshua B. Tenenbaum, and Timothy J. O’Donnell.
2019. Synthesizing theories of human language with Ezer Rasin, Iddo Berger, Nur Lan, and Roni Katzir.
Bayesian program induction. In Prep. 2017. Acquiring opaque phonological interactions
using Minimum Description Length. In Supple-
Kevin Ellis, Daniel Ritchie, Armando Solar-Lezama, mental Proceedings of the 2017 Annual Meeting on
and Josh Tenenbaum. 2018. Learning to infer Phonology.
graphics programs from hand-drawn images. In
Advances in Neural Information Processing Systems Iggy Roca and Wyn Johnson. 1999. A Workbook in
31, pages 6059–6068. Curran Associates, Inc. Phonology. Blackwell.
Kevin Ellis, Armando Solar-Lezama, and Josh Tenen- Rohit Singh, Venkata Vamsikrishna Meduri, Ahmed K.
baum. 2015. Unsupervised learning by program Elmagarmid, Samuel Madden, Paolo Papotti, Jorge-
synthesis. In Advances in neural information Arnulfo Quiané-Ruiz, Armando Solar-Lezama, and
processing systems, pages 973–981. Nan Tang. 2017. Synthesizing entity matching rules
by examples. PVLDB, 11(2):189–202.
Richard Futrell, Adam Albright, Peter Graff, and
Timothy J. O’Donnell. 2017. A generative model Armando Solar-Lezama. 2013. Program sketch-
of phonotactics. Transactions of the Association for ing. International Journal on Software Tools for
Computational Linguistics, 5:73–86. Technology Transfer, 15(5-6):475–495.

6185
Bruce Tesar and Paul Smolensky. 2000. Learnability in
Optimality Theory. MIT Press.
Abhinav Verma, Vijayaraghavan Murali, Rishabh
Singh, Pushmeet Kohli, and Swarat Chaudhuri.
2018. Programmatically interpretable reinforcement
learning. In Proceedings of the 35th International
Conference on Machine Learning, pages 5052–5061.
R. L Weide. 2014. The CMU pronouncing dictionary.
Release 0.7b.

Yunhui Zheng, Vijay Ganesh, Sanu Subramanian, Omer


Tripp, Murphy Berzish, Julian Dolby, and Xiangyu
Zhang. 2017. Z3str2: an efficient solver for strings,
regular expressions, and length constraints. Formal
Methods in System Design, 50(2-3):249–288.

6186

You might also like