0% found this document useful (0 votes)
12 views7 pages

HLTD201105

Uploaded by

zaw khaing
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views7 pages

HLTD201105

Uploaded by

zaw khaing
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Burmese Phrase Segmentation

May Thu Win Moet Moet Win Moh Moh Than


[email protected] [email protected] [email protected]
Research Programmer, Myanmar Natural Language Processing Lab

Dr.Myint Myint Than Myanmar Computer Federation


Dr.Khin Aye Member of Myanmar Language Commission

few others are found in one style only. Regional


Abstract variation is seen in both styles.

Phrase segmentation is the process of determi- 2.1 Sentence Construction


nation of phrase boundaries in a piece of text. One morpheme or a combination of two or more
When it comes to machine translation, phrase morphemes will give rise to one word; combina-
segmentation should be computerized. This is
the first attempt at automatic phrase segmenta-
tion of two or more words becomes a phrase;
tion in Burmese (Myanmar). This paper aims combination of two or more phrases will be one
to express how to segment phrases in a Bur- sentence. The following figure shows the hierar-
mese sentence and how to formulate rules. The chical structure of sentence construction.
system has been tested by developing a phrase
segmentation system using CRF++.

1 Introduction
Burmese Language is the national and official
language of Myanmar, and is a member of the
Tibeto-Burman language family, which is a sub-
family of the Sino-Tibetan family of languages.
Its written form uses a script that consists of cir-
cular and semi-circular letters, adapted from the Figure 1: Hierarchical structure of sentence construc-
Mon script, which in turn was developed from a tion
southern Indian script in the 8th century.
Burmese language users normally use space as Sen- သူက ပန်း လးကိ နမ်းတယ် ။
they see fit, some write with no space at all. tence
There is no fixed rule for phrase segmentation. In Phrase သူက ပန်း လးကိ နမ်းတယ်
this paper, we propose phrase segmentation Word သူ က ပန်း လး ကိ နမ်း တယ်
rules, in linguistics point of view, which will Mor-
help Natural Language Processing tasks such as သူ က ပန်း လး ကိ နမ်း တယ်
pheme
Machine Translation, Text Summarization, Text
Categorization, Information Extraction and In- Table 1: Sentence construction of a Burmese sen-
formation Retrieval and so on. tence

2 Nature of Burmese Language In this table, သူက ပန်း လးကိ နမ်းတယ် means
There are two types of language style - one is "she kisses the little flower". And သူ is she, က is
literary or written style used in formal, literary subject marker, ပန်း is the flower, လး is little,
works, official publications, radio broadcasts and ကိ is object marker, နမ်း is kisses and တယ် is
formal speeches and the other is colloquial or
verb marker.
spoken style used in daily communication, both
conversation and writing, in literary works, radio
Morpheme: Morpheme is the smallest syntactic
and TV broadcasts, weekly and monthly maga-
unit that has semantic meaning. The sentence
zines. Literary Burmese is not so much different
shown in Table.1 has seven morphemes.
from colloquial Burmese. Grammar pattern is the
same in both, and so is the essential vocabulary.
Word: The word is the basic item in a sentence.
Some particles are used unchanged in both but a

27

Conference on Human Language Technology for Development, Alexandria, Egypt, 2-5 May 2011.
It consists of one or more morphemes that are tion is employed at the writer's whim; it is not
linked by close juncture. A word consisting of guided by rules.
two or more stems joined together is a compound
word. Words that carry meaning are lexical
words and words that only show grammatical
relation are grammatical words. The sentence
shown in Table .1 has six words.
Phrase: Two or more words come together to
form a phrase. A phrase is a syntactic unit that
forms part of a complete sentence and has its
particular place in that sentence. Phrase boun-
dary is characterized by what is called a phrase Figure 2: Phrase segmentation by writer's whim
marker or a pause in speech – open juncture. The Segmentation may be suggested by pause, or
phrase marker can be omitted, in which case we length variety, or clustering of words to bring
say there is a zero marker. Markers show differ- about meaning. Segmentation in a casual careless
ent functions, like subject, verb, object, comple- way will not be of any help. This paper tries to
ment, qualifier, time and place adverb, etc. The point out the places where we break sentences
sentence shown in Table.1 has three phrases. with consistency. The boxes shown in the figure
are places to break. We will explain how and
Sentence: Finally, we want to say something
about the sentence. A sentence is organized with why we should break at these places in section 6.
one or more phrases in Subject Object [Comple- 5 Particles
ment] Verb or Object Subject Verb order. It is a
sequence of phrases capable of standing alone to We do not normally use lexical words (nouns,
make an assertion, a question, or a command. verbs and qualifiers) by themselves and they
have to be joined by grammatical words (par-
2.2 Syntax ticles or markers) to form a phrase in both lite-
Syntax is the study of the rules and principles rary and colloquial styles. There are three types
found in the construction of sentences in Bur- of particles - formatives, markers and phrase par-
mese language. A Burmese sentence is com- ticles.
posed of NP+...+NP+VP (where, NP = noun
5.1 Formatives
phrase and VP = verb phrase). Noun phrases and
verb phrases are marked off by markers but some Formation derives a new word by attaching par-
can be omitted. ticles to root morphemes or stems. It may also
change the grammatical class of a word by add-
3 Parts of Speech ing affix (prefix or suffix). Adding "စရာ" to the
verb "စား eat" gives rise to "စားစရာ food", a
Myanmar Language Commission opines that
Burmese has nouns, pronouns, adjectives, verbs, noun, and there are many derivational mor-
adverbs, postpositions, particles, conjunctions phemes that change verbs to nouns, verbs to ad-
and interjections. In fact, the four really impor- verbs, and so on.
tant parts are Nouns, Verbs, Qualifiers or Mod- Reduplication is a word formation process in
ifiers and Particles. Pronouns are just nouns. Qu- which some part of a base (a morpheme or two)
alifiers are the equivalents of adjectives and ad- is repeated, and is found in a wide range of Bur-
verbs that are obtained by subordinated use of mese words. Example; လက်လဲှ warm ⇢
nouns and verbs. Postpositions and affixes can be လက်လက်လဲှလှဲ warmly.
considered as markers or particles. Interjections It can be obviously seen that formation is a
do not count in the parts of speech in Burmese. way of word structure. It can be useful some-
times in phrase segmentation, as it can easily be
4 Phrase Segmentation by Writer's marked off as an independent phrase.
Whim
5.2 Markers
In Figure.2, sentences are broken into phrases
A marker or a particle is a grammatical word that
with space in a random way. Phrase segmenta-
indicates the grammatical function of the marked
word, phrase, or sentence.

28
When we break up a sentence we first break it phrase particle), လာ - come (verb), မှာ - future
into noun phrases and verb phrases. Verb phrases (verb marker), ပါ့ - "of course" (sentence final
must be followed by verb markers (sentence- phrase particle) and နာ် - right? (sentence final
ending or subordinating markers). Noun phrase phrase particle).
will be followed by various noun markers, also
called postpositions, denoting its syntactic role in
6 Markers
the sentence. If we want to show a noun is the
subject, a marker that indicates the subject func- Some suffixes mark the role of a phrase in the
tion will be strung with this noun. If we want to sentence. Suffixes that perform this function are
indicate a noun to be the object, a marker that called "markers". So, markers can be seen as
indicates the object function will be strung. The phrase boundaries. Markers can be split into two
distinctive feature of markers is that they show groups: (1) noun markers and (2) verb markers.
the role of a phrase in the sentence.
6.1 Noun Markers
noun noun verb Markers that are attached to nouns are called
noun phrase "noun markers". A noun marker shows its func-
phrase phrase phrase
tion as subject, object or complement, instru-
ကျာင်းသားများ လ့လာ ရး သွားသ ment, accompaniment, destination, departure,
ပဂသိ ့
သည် ခရီး ည် and others in the sentence. We can sometimes
to Ba- an excur- construct a phrase without noun markers. Such a
The students make phrase is said to be fixed with zero markers sym-
gan sion
"The students make an excursion to Bagan." bolized by ø. Its meaning is the same as that of a
phrase with markers. A phrase will be segmented
Table 2: A Burmese sentence with markers where we consider there is a zero marker in the
sentence.
In this Table.2, we find three markers,
- သည် marking the subject of the sentence 6.1.1 Subject Markers
- သိ ့ marking the place of destination in the sen- Marker that marks a noun as a subject of sen-
tence tence can be defined as subject marker.
- သည် marking the verb tense of the sentence
Sometimes, we can construct a noun or verb Literary Style Colloquial Style Translation
phrase without adding any visible markers to သည်၊ က၊ မှာ၊ø က၊ ဟာ၊ ø no English
them. In this case, we say we are using zero equivalent
markers, symbolized by ø after the noun.
- လ့လာ ရးခရီး suffixing ø marker Table 3: Subject markers and their meaning
We, therefore, use markers as pointers to indi- Example: (with subject marker)
vidual phrases in phrase segmentation of Bur- | က န် တာ်က | သချာမှ လပ်တယ်။
mese texts.
| I | like to be sure before I act.
5.3 Phrase Particles Example: (with zero markers)
| က န် တာ် ø | သချာမှ လပ်တယ်။
Phrase particles are suffixes that can be attached
to a phrase in a sentence without having any ef- | I | like to be sure before I act.
fect on its role in the sentence. They serve only 6.1.2 Object Markers
to add emphasis to particular phrases or to the
whole sentence or to indicate the relation of one Markers that specify the object of the sentence
sentence to another. They are attached to nouns can be defined as object markers.
or verbs or even to phrases that already contain
markers. Phrase particles are of two types: sen- Literary Colloquial
Translation
tence medial and sentence final. Style Style
Example: မင်းက တာ့ လာမှာ ပါ့ နာ်။ no English
ကိ၊ အား၊ ø ကိ၊ ø
You will come, right? equivalent
In this example; မင်း is you (subject), က - sub-
ject marker, တာ့ - "as for" (sentence medial Table 4: Subject markers and their meaning

29
Example: သူ | မိန်းက လးတစ်ဦးကိ | ချစ်ဖးူ သည်။ Example:| ကားြဖင့် | သွားသည်။
He had once fallen in love with | a girl |. They went | by bus | .
6.1.3 Place Markers 6.1.6 Cause Markers
Markers that specify the place and directions can Markers that specify the reason or cause can be
be defined as place markers. defined as cause markers.
Place Literary Colloquial Transla-
Markers Style Style tion Literary Style Colloquial Translation
၌၊မှာ၊တွင်၊ at, on, Style
Location မှာ၊ က ကာင့်၊ သြဖင့်၊ because, be-
ဝယ်၊က in ကာင့်၊နဲ ့
... cause of
Departure မှ၊က က from
Destina- သိ ့၊ကိ၊ဆီ၊ ကိ၊ဆီ၊ဆီကိ Table 8: Cause markers and their meaning
to
tion ø... ၊ø Example:| ဝမ်း ရာဂါ ကာင့|် သသည်။
Continua- တိင်တိင်၊ He died | of cholera | .
until,
tion of အထိ ၊ ø P ထိ၊ အထိ၊ ø
till 6.1.7 Possessive Markers
place ...
Markers that show a possessive phrase or a mod-
Table 5: Place markers and their meaning ifier phrase can be called possessive markers.

Example: (Departure) Literary Style Colloquial Style Translation


| နြပည် တာ်မ|ှ ထွက်လာသည်။ ၏၊ ရဲ ့၊ ့ (tone 's
ရဲ ့၊ ့ (tone mark)
I left | from NayPyiDaw | . mark)
6.1.4 Time Marker Table 9: Possessive markers and their meaning
Markers that specify the time can be defined as
time markers. Example:| မ မရဲ ့ | ကျးဇူးကိ
အာက် မ့ပါ သည်။
Time Literary Collo- Transla-
quial I remember | mother's | kindness.
Markers Style tion
Style 6.1.8 Accordance Markers
မှာ၊တွင်၊
Time မှာ၊ က၊ ø at, on, in Markers that specify an action or event occurs in
ø..
accordance with something can be defined ac-
Contin- တိင်တိင်၊ ထိ၊အထိ၊ cordance markers.
up to,
uation of ထက်တိင်၊
အထိ၊ øP... till
time ø ... Colloquial
Literary Style Translation
Style
Table 6: Time markers and their meaning
အလိက်၊အရ၊ အရ၊အတိင်း၊ as, according
Example: (Continuation of time) အ လျာက်၊ ... အညီ၊ ... to
| ယခထက်တင် ိ | လွမ်း နဆဲပါ ခိင်။
I miss you |up to the present, | Khaing. Table 10: Accordance markers and their meaning

6.1.5 Instrumentality Markers Example:| ရစီးအလိက် |သွားြခင်းကိ ရစန်ဟ


Markers that specify how the action takes place ခ သည်။
or indicate the manner, or the background condi- Going | according to the current | is
tion of the action can be defined as instrumentali- called "downstream".
ty markers. 6.1.9 Accompaniment [coordinate] Markers
Literary Style Colloquial Style Translation
ြဖင့်၊နှင့် နဲ ့ by, with Markers that denote accompaniment and two or
more items being together with two or more
Table 7: Instrumentality markers and their meaning items are accompaniment markers.

Literary Style Colloquial Translation

30
Style take noun markers as ဘာကိ, ဘယ်မှာ (what,
where). They can be segmented as noun phrases.
နှင့်၊နှင့်အတူ၊ နဲ ့၊ နဲ ့အတူ၊
and, with Example:| ဘာ | လပ် ပးရမလဲ။
နှင့်အညီ၊ ... ရာ... ရာ၊ ...
| What | can I do for you?

Table 11: Coordinate markers and their meaning


6.2 Verb Markers
Markers that are attached to the verbs are called
Example:| မိဘနှငအ ့် တူ | နသည်။ "verb markers".
She lives together | with her parents |.
6.2.1 Subordinating Markers
6.1.10 Choice Markers In simple sentences, they are generally at the end
Markers that specify numbers [of persons or of the sentence and can be seen as independent
things] to make a choice from can be defined as markers. We have no need to consider how to
choice markers. break the sentence into phrases with these mark-
ers because their position plainly shows it. But in
Literary Colloquial complex sentences, they are in the middle of the
Translation sentence and are known as dependent or subor-
Style Style
တွင်၊အန dinating markers. Subordinating markers need to
တွင်၊အနက်၊မှ၊ be considered before breaking a sentence into
က် between, among
ထဲမှ၊ ... phrases. We can break a set of verb and verb
၊အထဲမှ၊...
markers attached to it as a verb phrase. Some of
subordinating markers are လ င် (if), မ---လ င် (un-
Table 12: Choice markers and their meaning
less), ကတည်းက (since), သာ ကာင့် (because),
Example: | အဖွဲ ့ဝင်များထဲမှ | တစ် ယာက်ကိ သာအခါ (when) and so on.
ခါင်း ဆာင်အြဖစ် ရွးချယ်သည်။
One person | from among the members | 6.2.2 Adjectival Markers
is chosen as leader of the group. Adjectives are formed by attaching adjectival
markers to verbs and they can be segmented as
6.1.11 Purpose Markers
noun modifier phrases.
Markers that specify the purpose and are used to
denote for, for the sake of, can be defined as pur- Colloquial
pose markers. Literary Style Translation
Style
သာ၊သည့၊် မည့် တဲ့၊မဲ့ no English
Colloquial equivalent
Literary Style Translation
Style
အလိ ့ငှာ၊ဖိ ့၊အတွက်၊ ဖိ ့၊အတွက်၊ရန်၊ Table 14: Adjectival markers and their meaning
to, for
... ...
Example: သူ | ြပာသည့် | စကားကိ က န်မ
Table 13: Purpose markers and their meaning နားမလည် ချ။
I didn't understand the words | he spoke |.
Example: ကျာင်းသားများသည် | ဗဟသတအလိ ့ငှာ |
လ့လာ ရးခရီးထွက် ကသည်။ 6.2.3 Adverbial Marker
The students set out on a study tour | to Adverbs are formed by adding adverbial marker
gain experience |. “စွာ -ly ” to verbs and they can be segmented as
verb modifier phrases. Adverbs can also be ob-
6.1.12 Demonstratives and interrogatives
tained by derivation (prefix and suffix) and re-
Demonstratives and interrogatives may be used duplication of verbs.
in subordination to other nouns, as သည်အိမ်, Example: (adverbial marker)
ဟိအိမ်, ဘယ်အိမ် (this, that, which house). They | ငိမ်သက်စွာ | နား ထာင် န ကသည်။
serve as adjectives followed by nouns. And they Listen | quietly | .
can also be used as independent nouns that can Example: (reduplication)
| ငိမ် ငိမသ
် က်သက် | နား ထာင် န ကသည်။

31
Listen | quietly | . tences decoding with CRF++ tool to get Burmese
phrase model. According to our Burmese lan-
7 Other Techniques guage nature, we employ unigram template fea-
tures of CRF implementations.
We can break the sentences into phrases with In decoding phase, un-segmented Burmese
noun and verb markers. Moreover, we can also sentences are inputted to the system and then
segment the following conditions as phrases. automatically encoded with Burmese phrase. As
7.1 Complement a result, we can achieve Burmese sentences that
have been segmented into phrases.
A word or a group of word that serve as the sub-
ject/object complement can be considered a 9 Experimental Result
phrase with zero ø in Burmese.
Example: ဦးညိုြမက | သတင်းစာဆရာ ø | ြဖစ်တယ်။ Maximum correctness of phrase segmentation
U Nyo Mya is | a journalist | . performs when the test and training data come
from the same category of corpus. The probabili-
7.2 Time Phrase ty of correctness may be worse if we trained on
the data from one category and tested on the data
A word or a group of words that show the time
from the other one. Here we tested phrase seg-
can be defined as a time phrase and can be seg-
mentation of various types of corpus with 5000
mented as a phrase (e.g., မ ကာမီ soon).
and 50000 phrase-model of general corpus re-
7.3 Sentence Connector spectively.

Grammatical words that are used for linking two


or more sentences are called sentence connec-
tors. They are generally placed at the beginning
of the second sentence. Some are သိ ့ သာ် (but),
ဒါ ကာင့် (therefore), သိ ့ရာတွင် (however), ထိ ့အြပင်
(moreover) and so on. We regard them as sen-
tence connectors and break them.

7.4 Interjections
A lexical word or phrase used to express an iso-
lated emotion is called an interjection, for exam-
ple; အလိ (Alas!), အမ လး (Oh God) and so on.
They are typically placed at the beginning of a Figure 3: Result of Phrase Segmentation with 5000
sentence. Interjections may be word level or phrase-model using CRF++ toolkit
phrase level or sentence level. Whatever level it
is, they can be considered a phrase and can be so
segmented.

8 Methodology
CRF++ tool is a simple, customizable, and open
source implementation of Conditional Random
Fields (CRFs) for segmenting/labeling sequential
data. CRF++ will be applied to a variety of NLP
tasks.
In our system, we have two phases. The first
Figure 4: Result of Phrase Segmentation with 50000
one is encoding phase and the second one is de-
phrase-model using CRF++ toolkit
coding phase. In encoding phase, at first, we col-
lect and normalize raw text from online and of- It can be seen that the more sufficient training
fline journals, newspapers and e-books. When data, the more efficiency we get. Average scores
we have sufficient corpus, as preprocessing task, of phrase segmentation are above 70% according
we manually break un-segmented text by the to the F-Measure. The corresponding scores are:
rules mentioned above. Next, we train these sen- Corpus Type Score

32
sport 83% Acknowledgment
newspaper 72%
The authors are happy to acknowledge their debt
general 70% and offer grateful thanks Mr. Tha Noe, linguist
novel 62% and Mr. Ngwe Tun, CEO, Solveware Solution.
The authors sincerely want to express ac-
Table 15: Various corpus types and their scores knowledgement to their colleagues of Myanmar
Natural Language Processing Team and techni-
10 Known Issues cians who helped them and guided on this paper.
Although scores are highly efficient, we face
some difficulties that we cannot solve. For ex- References
ample, we can manually segment a sentence into J.A. Stewart. 1955. Manual of Colloquial
phrases with zero markers such that complement, Burmese. London.
time and adverbs formed by derivation as a
phrase whether it has been attached with markers J. Lafferty, A. McCallum, and F. Pereira.
or not. But in our system, it is difficult to achieve 2001. Conditional random fields: Proba-
best results because of zero markers. We need bilistic models for segmenting and labe-
more and more training data to cover these zero ling sequence data, In Proc. of ICML,
marker phrases. Boundaries of these phrases may pp.282-289. USA.
be various. So, we can only get about 50% accu-
racy for these types of phrases. John Okell and Anna Allott. 2001. Bur-
Another problem is homonyms. For example: mese/Myanmar Dictionary of Grammati-
ကိ ‘Ko’ is object marker but it may also be the cal Form. Curzon Press,UK.
title of a name like ကိချမ်း ြမ့ ‘Ko Chan Myae’. John Okell.1969. A Reference Grammar of
As a title of a name, we do not need to segment Colloquial Burmese. London. Oxford Uni-
ကိချမ်း ြမ့ ‘Ko Chan Myae’. But CRF++ tool will versity Press.
segment this phrase as ကိ ‘Ko’ and ချမ်း ြမ့ ‘Chan
Myae’ depending on the probability of training Kathleen Forbes.1969.The Parts of Speech
data. In Burmese and The Burmese Qualifiers.
JBRS, LIII, ii. Arts & Sciences University,
11 Conclusion Mandalay.
In this study, we have developed an automatic Pe Maung Tin, U. 1954. Some Features of
phrase segmentation system for Burmese lan- the Burmese Language. Myanmar Book
guage. The segmentation of sentences into Centre & Book Promotion & Service Ltd.
phrases is an important task in NLP. So, we have Bangkok, Thailand.
described how we can segment sentences into
phrases with noun markers, verb markers, zero Willian Cornyn. 1944. Outline of Burmese
markers and other techniques in this paper. We Grammar , Language Dissertation
hope this work will help accelerate NLP No.38, Supplement to Language, vo-
processing of Burmese language such as Ma-
lume 20, No,4.
chine Translation, Text summarization, Text Ca-
tegorization, Information Extraction and Infor- http://crfpp.sourceforge.net
mation Retrieval and so on.
ဦးခင် အး. ြမန်မာသဒ္ဒါနှင့် ဝါ စဂရှစပ
် ါးြပဿနာ,
12 Further Extension အတွဲ (၁၃), အပိင်း (၅), တက္ကသိလ်ပညာ
As we mentioned in section 2.1, the combination ပ ဒသာ စာ စာင်.
of two or more words becomes a phrase. It is ဦး ဖ မာင်တင်. ၁၉၆၅ အလယ်တန်းြမန်မာသဒ္ဒါ
easier to segment words after decomposing the
phrases of sentence. The result of phrase seg- စာ ပဗိမာန်.
mentation will help the word segmentation. ြမန်မာသဒ္ဒါ. ၂၀၀၅.ြမန်မာစာအဖွဲ ့ဦးစီးဌာန.
Moreover, we can build Burmese parser based on
phrase segmentation.

33

You might also like