Stemming and Segmentation For Classical Tibetan
Stemming and Segmentation For Classical Tibetan
Page 1
Sarat Chandra Das, “Life of Sum-pa mkhan-po, also styled Ye-śes dpal-’byor,
the author of Rehumig (Chronological Table)”, Journal of the Asiatic Society
of Bengal (1889)
1 Introduction
https://webcache.googleusercontent.com/search?q=cache:hNW8YcKtQRwJ:https://www.cs.tau.ac.il/~wolf/papers/TibetanStemming.pdf+&cd=1&hl=fr&ct=clnk&gl
17/03/2018 Stemming and Segmentation for Classical Tibetan
Page 2
https://webcache.googleusercontent.com/search?q=cache:hNW8YcKtQRwJ:https://www.cs.tau.ac.il/~wolf/papers/TibetanStemming.pdf+&cd=1&hl=fr&ct=clnk&gl
17/03/2018 Stemming and Segmentation for Classical Tibetan
Page 3
Some of the positions may remain empty or contain a special value to indicate
their absence.
In Fig. 1, stack A holds the prescript component, stack B holds the su-
perscript, core letter, subscript, and vowel components. Stack C holds the coda
(final letter), and D, the postscript. An additional position E holds the appended
particle(s).
Thus, for example, the future tense bsgrub and the imperative sgrubs of the
verb sgrub (to perform) would take the following forms:
bsgrub = 〈 b, s, g, r, u, b, –, – 〉
sgrubs = 〈 –, s, g, r, u, b, s, – 〉
The disyllabic contraction sgra’ang, to give another example, would take the
form:
sgra’ang = 〈 –, s, g, r, a, –, –, ’ang 〉
Each location in the tuple is governed by a different set of rules. Some com-
binations are possible, while other combinations never occur and their appear-
ance would suggest either a transliteration error, scribal error (or a damaged
woodblock), or the presence of a non-Tibetan word, such as a Sanskrit word
transliterated in Tibetan script.
Page 4
Table 1: Consonant distribution within the Tibetan Buddhist canon. (The re-
maining consonants each appear less than 3% of the time.)
s d g b r n y p m ng l
10% 8% 6.5% 6.5% 6.5% 5.6% 5.5% 5.1% 4.4% 4.4% 3.4%
3 Stemming
Since syllables having the same stem may take many different forms, stemming
is a crucial stage in almost every text-processing task one would like to perform
in Tibetan. Usually, in Indo-European and Semitic languages, stemming is per-
formed on the word level. However, in Tibetan, in which words are not separated
by spaces or other marks, a syllable-based stemming mechanism is required even
in order to segment the text into lexical items. We should point out that (heuris-
tic) stemming does not mean the same thing as (grammatical) lemmatization,
and the stemming process can result in a stem that is not a lexical entry in a
dictionary. Moreover, unlike other Indo-European languages, stemming of Ti-
betan is mostly relevant to verbs and verbal nouns (which are common in the
language). Despite being inaccurate in some cases, stemming (for Tibetan as for
other languages) can improve tasks such as word segmentation and intertextual
parallel detection [7]. Moreover, even for Tibetan words consisting of more than
one syllable, stemming each syllable makes sense since all the inflections are em-
bedded at the syllable level. For instance, the words brtag dbyad (analysis) and
brtags dpyad (analyzed) are stemmed to rtog dpyod (to analyze, analysis).
The following are the main rules that govern the structure of the syllable [3].
– There are 30 possibilities for the core letter: any of the 29 consonants or the
core letter a qua consonant.
– There are 5 vowels, one of which must be present: a, i, u, e, o.
– There are 3 possible superscripts: r (with core k, g, ng, j, ny, t, d, n, b, m,
ts, dz); l (with k, g, ng, c, j, t, d, p, b, h); s (with k, g, ng, ny, t, d, n, p, b,
m, ts).
– There are 4 subscripts: y (with k, kh, g, p, ph, b, m); r (with k, kh, g, t, th,
d, p, ph, b, m, sh, s, h); l (with k, g, b, z, r, s); w (with k, kh, g, c, ny, t, d,
Page 5
ts, tsh, zh, z, r, l, sh, s, h). In rare cases, the combinations rw and yw may
also appear as subscripts, e.g. in the syllables grwa and phywa.
https://webcache.googleusercontent.com/search?q=cache:hNW8YcKtQRwJ:https://www.cs.tau.ac.il/~wolf/papers/TibetanStemming.pdf+&cd=1&hl=fr&ct=clnk&gl
17/03/2018 Stemming and Segmentation for Classical Tibetan
– There are 10 possible codas (final letters): g, ng, d, n, b, m, ’, r, l, s.
– There are 5 possible prescripts: the letters g (with c, ny, t, d, n, zh, z, y,5
sh, s, ts), d (with k, g, ng, p, b, m, ky, gy, py, by, my, kr, gr, pr, br), b (with
k, g, c, t, d, zh, z, sh, s, ky, gy, kr, gr, kl, zl, rl, sl, rk, rg, rng, rj, rny, rt, rd,
rn, rts, rdz, lt, sk, sg, sng, sny, st, sd, sn, sts, rky, rgy, sky, sgy, skr, sgr),
m (with kh, g, ng, ch, j, ny, th, d, n, tsh, dz, ky, gy, khr, gr), ’ (with kh, g,
ch, j, th, d, ph, b, tsh, dz, khy, gy, phy, by, khr, gr, thr, dr, phr, br).
– There are 2 possible postscripts, which come after the coda: s, d (the suffix
d is archaic and seldom found).
– There are 6 particles that are appended at the end of syllables: ’am, ’ang, ’i,
’is, ’o, ’u. This is only possible with syllables ending with a vowel (i.e. lacking
a final letter, and thus by definition also a postscript), or with the final letter
’. The appending of the particle results in a disyllabic contraction (while the
two vowels are often pronounced as diphthongs). Rarely, two particles can
also be appended (e.g. phre’u’i). However, since for the stemming we regard
the appended particle(s) as a single unit, which is not stemmed, these cases of
doubled-appended syllables do not affect stemming and thus are disregarded.
There are two additional possible particles that can be appended at the end
of a syllable: s and r. Since both s and r are also valid codas, this may cause
ambiguity. (The potential problem is partially solved for the letter s in the
normalization stage, but a full solution is difficult to achieve.)
superscript
core+vowel -coda
subscript
The deleted parts do not change the basic underlying semantics of the syllable.
The stemmer works in the following manner: first, we break the syllable into
a list of Tibetan letters. This stage is required because Wylie transliteration
represents some Tibetan letters by more than one character (e.g. zh, tsh). There
is, fortunately, no ambiguity in the process of letter recognition. By design,
Page 6
https://webcache.googleusercontent.com/search?q=cache:hNW8YcKtQRwJ:https://www.cs.tau.ac.il/~wolf/papers/TibetanStemming.pdf+&cd=1&hl=fr&ct=clnk&gl
17/03/2018 Stemming and Segmentation for Classical Tibetan
Each Tibetan syllable should contain one core letter and one vowel. Other
positions (subscript, etc.) are not obligatory; there should be only one letter that
fits each of the seven places in the tuple, while the eighth place accommodates
a syllable, one of 6 possible appended particles. We therefore start with the
detection of all the vowels (by definition, each syllable contains one vowel). A
contraction consisting of an appended syllable commonly contains two vowels.
(As noted earlier, the rare case of a double-appended syllable has no effect on
the stemming.) Syllabic contractions should contain two vowels at most.
https://webcache.googleusercontent.com/search?q=cache:hNW8YcKtQRwJ:https://www.cs.tau.ac.il/~wolf/papers/TibetanStemming.pdf+&cd=1&hl=fr&ct=clnk&gl
17/03/2018 Stemming and Segmentation for Classical Tibetan
Page 7
The vowel (a, i, u, e, o) necessarily follows the core letter, or the subscript
(y, r, l, w, rarely also yw, rw) if there is one. Examples are bam (b is the core
letter); bsgrubs (g is the core letter); ’ga’ (g is the core letter); zhwa (zh is the
core letter); chen (ch is the core letter). If the syllable begins with a vowel, the
core letter in our representation is set to be a (meaning, we add an extra a),
which makes ag a valid syllable with core letter a, vowel a, and coda letter g.
Another example is the syllable e that would be represented as having core letter
a and vowel e.
The stem of the syllable consists of the core letter or the stacked letter (which,
in turn, consists of the core letter and a superscript, or a subscript, or both),
the vowel, and the final letter (if this is found). Syllables can be considered to
be stemmically identical if these are consistent, despite additions or omissions
of a prescript and/or a postscript.
Under certain circumstances (commonly inflection of verbs), the core letter
may be changed. However, the change is not arbitrary, and usually occurs among
phonetically “related” letters, such as k/kh/g; c/ch/j; t/th/d; p/ph/b.
Possible changes can be found in the vowel, while still retaining the same basic
meaning (and the same stem). Most commonly the vowel o in verbs changes to
a and vice-versa, reflecting a change in tense. Since other vowel changes are
unfortunately also possible, it seems impossible to identify a pattern. The only
viable solution would be to work with a list of verbs and their inflections, or
alternatively, to consider a vowel change as substantial (thus failing to recognize
the stemmic identity).
The final stage is normalization. As it turns out, there are groups of Ti-
betan letters that can be replaced one with the other without changing the basic
meaning of the syllable. Since we are interested in grouping all syllables that are
ultimately stemmically identical into one and the same stem, we normalized all
tuples according to the following rules:
Page 8
(We have glossed over a few additional special cases and peculiarities that are
dealt with in the stemmer code.)
Once we have the tuple corresponding to the syllable, we extract the compo-
nents 〈 superscript, . . . , final letter 〉 to obtain a quintuple that represents the
syllable’s stem. For bsgrub and sgrubs, future tense and imperative of the verb
sgrub, the stemming process will generate the same stem: sgrub.
For the similarity measures and word-segmentation tasks, described in the
following sections, each letter is encoded by a number. The particles, as previ-
ously mentioned, are encoded as themselves, so overall we have a total of 41
possible values for the various locations in the tuple (29 consonants [excluding
a], 5 vowels, 6 particles, blank).
4 Learning Similarity
tion in the tuple, is computed as the sum of two learned weights. One weight is
associated with the letter in one syllable and the other associated with the par-
allel letter in the second. This way, instead of a quadratic number of parameters,
we have only a linear number.
\ The reason for omitting the coda s for the sake of normalization is that in cases
where it is added to form the past tense, which results in a syllable that appears to
have a stem with coda s, we treat this s as equivalent to the postscript s often added
to form the past tense.
Page 9
Given the stemmer’s output for two Wylie encoded syllables xi,yi ∈ R5, we first
re-encode it as a more explicit tuple by using an encoding function. Three types
of such functions are considered.
In the first type, the encoding is simply the identity function. This is a naïve
approach in which the rather arbitrary alphabetic distance affects the computed
metric.
The second type encodes each possible letter in each of the five locations
(superscript, ..., final letter) as one binary bit. The bit is 1 if the associated
location in the stemmed representation has the value of the letter associated with
the bit, and 0 otherwise. This representation learns one weight for each letter at
each location. If the two syllables xi and yi differ in three locations out of the
five, the learned model would sum up six weights: each of the three locations
would add one weight for each of the two letters involved in the substitution.
The third type of encoding is based on information regarding equivalence
groups of letters. In other words, substitutions within each group are considered
synonymous. There are five groups with more than one letter:
1. g, k, kh
2. c, ch, j, zh, sh
3. d, t, th close
4. b, ph, p
5. z, dz, tsh, ts
The rest of the letters form singleton groups. The total number of groups is 21.
Let f ber the encoding function. The learned model has a tuple of parameters
w, which has the same dimension as f(x), and a bias parameter b. It has the
form: w |f(xi) − f(yi)| + b, that is, a weighted sum of the absolute differences
between the encoding functions of the two stemmed syllables.
https://webcache.googleusercontent.com/search?q=cache:hNW8YcKtQRwJ:https://www.cs.tau.ac.il/~wolf/papers/TibetanStemming.pdf+&cd=1&hl=fr&ct=clnk&gl
17/03/2018 Stemming and Segmentation for Classical Tibetan
During training, synonymous and non-synonymous pairs of syllables are
provided to the SVM algorithm [2]. Each pair is encoded as a single tuple
|f(xi) − f(yi)|, and an SVM with a linear kernel is used to learn the param-
eters w and b.
4.2 Evaluation
The dataset contains 1521 sets of verbs and their inflectional forms. The sets are
divided into three fixed groups in order to perform a cross validation accuracy
estimation. In each cross validation round, two splits are used for training and
one for testing. Within each group, all pairs of syllables from within the same set
(inflections of the same verb) are used as positive samples. There are 110–140
such pairs in each of the splits. Ten times as many negative samples are sampled.
Table 4 presents the results of the experiments. The area under the ROC
curve (AUC) is used to measure classification success. We compare two meth-
ods: one does not employ learning and simply observes the Euclidean distance
f(xi) − f(yi); the other is based on learning the weights w via SVM. We
Page 10
Table 4: Comparison of the three encoding functions used for metric learning.
Results for both the Euclidean (L2) distance and SVM-based metric learning are
shown. The reported numbers are mean AUC±SD over three cross-validation
splits.
compare the three functions f described above: (i) Naïve, (ii) Binary, and (iii)
Equivalence groups.
As can be seen, the Equivalence group function significantly outperforms
the other functions. It is also evident that learning the weights with SVM is
preferable to employing a constant weight matrix (which results in a simple
Euclidean distance).
5 Word Segmentation
https://webcache.googleusercontent.com/search?q=cache:hNW8YcKtQRwJ:https://www.cs.tau.ac.il/~wolf/papers/TibetanStemming.pdf+&cd=1&hl=fr&ct=clnk&gl
17/03/2018 Stemming and Segmentation for Classical Tibetan
The problem of word segmentation, viz. grouping the syllables into words, is of
major importance. Since no spaces or special characters are used to mark word
boundaries, the reader has to rely on language models so as to detect the word
boundaries.
5.1 Design
Page 11
https://webcache.googleusercontent.com/search?q=cache:hNW8YcKtQRwJ:https://www.cs.tau.ac.il/~wolf/papers/TibetanStemming.pdf+&cd=1&hl=fr&ct=clnk&gl
17/03/2018 Stemming and Segmentation for Classical Tibetan
5.2 Evaluation
Page 12
https://webcache.googleusercontent.com/search?q=cache:hNW8YcKtQRwJ:https://www.cs.tau.ac.il/~wolf/papers/TibetanStemming.pdf+&cd=1&hl=fr&ct=clnk&gl
17/03/2018 Stemming and Segmentation for Classical Tibetan
36,958 sentences and the test set of 9239 sentences. Overall, there are 349,530
words consisting of an average of 1.46 syllables per word.
Training of the network was accomplished using cross-entropy loss and with
the Adam learning rate scheduling algorithm [6]. A dropout layer of 50% is added
before the LSTM layer.
Table 5 summarizes the performance of the neural network when trained
with context-windows sizes of 5 and 7. As can be seen, a context size of 5
syllables works somewhat better, and the proposed binary network outperforms
the multilabel network.
The online segmentation tool is available at http://cs.tau.ac.il/
~yairhoff/wordseg.html. A screenshot for a sample text is shown in Fig. 3.
6 Conclusion
We have seen the practicality of designing a rule-based stemmer for the syllables
of a monosyllabic language like Tibetan. This contributes to an analysis of the
morphology of the Tibetan syllable and provides a basis for the development of
additional linguistic tools.
We plan on experimenting with the possibility of semi-supervised learning of
such a stemmer, and comparing results with this rule-based approach.
Page 13
https://webcache.googleusercontent.com/search?q=cache:hNW8YcKtQRwJ:https://www.cs.tau.ac.il/~wolf/papers/TibetanStemming.pdf+&cd=1&hl=fr&ct=clnk&gl
17/03/2018 Stemming and Segmentation for Classical Tibetan
The creation of a practical stemming tool for Tibetan made it possible for
us to build a reasonable word-segmentation algorithm. We also plan to use it for
the development of intelligent search and matching tools for classical Tibetan.
Acknowledgements
We would like to express our deep gratitude to the other participants in the
“Hackathon in the Arava” event (held in Kibbutz Lotan, Israel, February 2016),
who all contributed to the development of new digital tools for analyzing Tibetan
texts: Kfir Bar, Marco Büchler, Daniel Hershcovich, Marc W. Küter, Daniel
Labenski, Peter Naftaliev, Elad Shaked, Nadav Steiner, Lior Uzan, and Eric
Werner. We thank Paul Hacket for crucially providing the necessary data.
This research was supported in part by a Grant (#I-145-101.3-2013) from the
GIF, the German-Israeli Foundation for Scientific Research and Development,
and by the Khyentse Center for Tibetan Buddhist Textual Scholarship, Univer-
sität Hamburg, thanks to a grant by the Khyentse Foundation. N.D.’s and L.W.’s
research was supported in part by the Israeli Ministry of Science, Technology
and Space (Israel-Taiwan grant #3-10341). N.D.’s research benefitted from a fel-
lowship at the Paris Institute for Advanced Studies (France), with the financial
support of the French state, managed by the French National Research Agency’s
“Investissements d’avenir” program (ANR-11-LABX-0027-01 Labex RFIEA+).
References
1. Chen, X., Qiu, X., Zhu, C., Liu, P., Huang, X.: Long short-term memory neural
networks for Chinese word segmentation. In: Proceedings of the 2015 Conference
on Empirical Methods in Natural Language Processing. pp. 1197–1206. Association
for Computational Linguistics, Lisbon, Portugal (September 2015), http://aclweb.
org/anthology/D15-1141
2. Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20(3), 273–297
(1995), http://dx.doi.org/10.1007/BF00994018
3. Hahn, M.: Lehrbuch der klassischen Tibetischen Schriftsprache. IeT 10, Swisttal-
Odendorf: Indica et Tibetica Verlag (1996)
4. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8),
1735–1780 (November 1997), http://dx.doi.org/10.1162/neco.1997.9.8.1735
5. Huang, H., Da, F.: General structure based collation of Tibetan syllables. Journal
of Computational Information Systems 6(5), 1693–1703 (2010)
6. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings
of the 3rd International Conference on Learning Representations (ICLR, San Diego)
(May 2015), http://arxiv.org/pdf/1412.6980v8.pdf
7. Klein, B., Dershowitz, N., Wolf, L., Almogi, O., Wangchuk, D.: Finding inexact
quotations within a Tibetan Buddhist corpus. In: Digital Humanities (DH) 2014. pp.
486–488. Lausanne, Switzerland (July 2014), http://www.cs.tau.ac.il/~nachumd/
papers/textalignment.pdf
8. Wylie, T.V.: A standard system of Tibetan transcription. Harvard Journal of Asiatic
Studies (22), 261–267 (1959)
9. Zipf, G.K.: Human Behaviour and the Principle of Least Effort. Hafner Pub. Co.,
New York, NY (1949)
https://webcache.googleusercontent.com/search?q=cache:hNW8YcKtQRwJ:https://www.cs.tau.ac.il/~wolf/papers/TibetanStemming.pdf+&cd=1&hl=fr&ct=clnk&gl