Korean Language Engineering
Korean Language Engineering
Information Platform *
K i m , Seongyong a n d C h o i , Key-Sun
l ) e p a r t m e n t of C o m p u t e r S c i e n c e
K o r e a A d v a n c e d h i s t i t u t e o f S c i e n c e a n d ~[7"~chnology
Taejon, Korea
[email protected], [email protected]
1049
network sign characteristics, which lead to the ease of sys-
.~.o~r~sp,etto,'m = 1 / ac:c~ss ~ ,~ tem design and tlexibility of the system config~
h~p s~rvor ~v' ' v ~ ~ ftp server rations (Berners-Lee ~5 Connolly, 1993). Its other
characteristic lies in the common gateway inter-
l ~ 13(C ~ ~ Olcb'Tenn~linc~o face (CGI) which makes it possible to interface
with various shell scripts and program codes with-
out difficulties. Yet another point is that the
server-client model makes the platform transpar-
ent to the users.
IP consists of three parts. First, text corpora,
l. U:Unix
voice and handwritten scripts DBs, dictionaries
"TexI/Di:MS 1 S: Solelris
W: V~4nda,w~,
Mgt b~y~,t~rn U/..~.t,~DevelopingforUnix11rs't, and a set of terminological DBs constitute the in-
V~n4o.;~sPlatform thenforSendW formation base. The information base may di-
rectly be distributed through ftp server or indi-
rectly accessed by the language tools on the higher
Figure 1: The Conceptual Diagram of the Infor-
layer of the http server configuration.
mation Platform
Secondly, language tools are running on the
http server with the aids of CGI as well as be-
ing ftp-ed to users as executable codes. Since we
consists of systems for text interchange and com- aim to provide software versions on Unix, Solaris,
pression, hypertext, multimedia, word processing and PC Windows altogether, initial hardware re-
and others. For knowledge processing, it will cover quirements for each tool may be different. ~
document paraphrasing, indexing and retrieval, Finally, documentation preparation will also be
computer-based instruction/education, etc. accompanied with the project's progress.
3 Information Platform 4 Information Base
For Korean language engineering, it is necessary 4.1 Text Corpus
to develop systematically all the projects of each
area and integrate them into a uniform frame, Text corpora are essential to statistical modeling,
called an information platform (IP). 1 KLE pro- in developing formal theories of the grammars,
grams each project according to its priority and investigating prosodic phenomena in speech, and
state-of-the-art technology. Consequently, ]P re- evaluating or comparing the adequacy of parsing
flects the status of ongoing projects and is an as-is models (Marcus et al., 1993). There are four sorts
framework on which further researches and devel- of corpora from contemporary Korean texts.
opment works can be performed. • Raw corpus
Figure 1 shows the conceptual diagram of IP. Two factors are the genre of each source text
This platform doesn't integrate all the project that is related to the objective(s) in using
outcomes but some of the 5mdamental resources the corpus, and the category of the text that
and basic tools, since it reflects the current config- represents the internal structure of the text.
uration that is not concrete but open to changes. Major sources of the corpus inchlde books,
The whole integration of the project outcomes will magazines, and newspapers; up to date three
be available at the end of the first phase in 1997. million word phrases are gathered.
This platform is different from ALEP (Ad-
• Part-of-speech (POS) tagged corpus
vanced Language Engineering Platform) (Simp-
POS tagset for Korean originated from (Kiln
kins, 1994) in that ALEP is an environment that
L~ Seo, 1994). In version 1 platform we
can be provided to users as a form of a (customiz-
able) package whereas our platform is a server- yielded 2.5 million automatically tagged word
phrases and 1.5 million post-edited word
client model in pursuit of a web-based service for
phrases.
resources and tools.
Worldwide web is composed of hyperdocuments • Tree-tagged corpus
and hyperlinks to handle multimedia data as well This can be produced by applying syntactic
as to provide easy and timely access to elec- tagset to the POS tagged corpus. The syn-
tronic information. It uses hypertext markup lan- tactic tagset is being studied using 100,000
guage (HTML) based on standardized generalized sentences out of POS tagged corpus, and the
markup language (SGML). Therefore, it guaran- resultant tree-tagged corpus using a tree tag-
tees the standardization and straightforward de- ger will appear at the end of this year.
1050
• Categorized corpus of target language expressions but offer
Korean verbs and adjectives are classified into basic meanings for entries together with
over seventy categories, and a set of sentence some syntactic and morphological informa-
styles are investigated for 940 basic verbs of tion. Ontology-based lexicon is lexically ori-
those categories. About thirty five thousand ented in that it guides the user to find a prag-
sentences are tangible in version 1 platform. matically or contextually equivalent expres-
sion corresponding to the source language ex-
4.2 Voice Data Base pression. The work is on the phase of feasi-
This resource can be used ['or speech recognition bility study with intensive locus on collecting
and synthesis applications. We initially focused Korean-English bilingual information sources
on word-level voice data. It includes phoneti- and developing tools for lexicon construction.
cally balanced words, phonemic sequences pro-
Lexicon for morphological analysis
nounced by four different speakers, and narration
The lexicon for Korean morphological anal-
of sample stories. It also stores the sounds of sin-
ysis is currently being built to have 30,000
gle syllables, diphones, numerics, high-frequency
entries with oil'-line m a n a g e m e n t tools, and
words, gazetteers, flmctional words, and consecu-
will grow to 100,000 entries with on-line tools
tive word sequences. The data are stored in server
after two more years. 4
disks and CD-ROMs as a wave form. This ef-
fort will be extended to sentence-level collections
such as phonetically balanced sentences, speech 5 Language Engineering Tools
dialogues, and scenarios. Basically, the tools that we present here are for
4.3 H a n d w r i t t e n Scripts Data Base text corpus and dictionaries, except for voice and
character recognizers. The latter two programs
Since character recognition systems are under the are currently under the develol)ment and will be
control of applications engineers, the objective of integrated later.
this work is to provide well-tbrmed data and eval-
uation criteria for those recognition systems. We 5.1 Morphological Analyzer
stepwise our data collection into three phases: to
MorI>hological analysis is an i m p o r t a n t but dilfi-
scan, with 300 dpi resolution, one thousand sets of
c , l t t)art of the analysis since Korean is an aggluti-
590 high-frequency syllables in the first year, then
native language with sophisticated morpheme seg-
of 990 syllables and 2,350 syllables in the follow-
mentation rules and morphotactic rules. The n]or.-
ing years, a At each phase, we develop both the
phological analyzer is based on the Korean chart
square-hand (:haracters and free-style characters.
parsing (Lee, 1993). Its' current precision is over
4.4 Dictionaries and Terminological Data 92 percent for the grammatical inl)ut sentences. It
Base aims to achieve 98 percent accuracy in two nrore
years. It will be extended to cover special sym-
• Multilingual technical dictionary
bols, alien strings, elliptical or abbreviated words,
The objective is to set up mappings between
and spell errors to earn higher accuracy.
technical terms of Korean and other lan-
gnage(s) in both directions. '['he first work is 5.2 Tagger
done for computer science domain, and it has
35,000 entries each for Korean and English. It Because the output of morphological analysis is
will be extended to cover Chinese, Japanese, rather complex due to the characteristics of Ko-
and German as well as more domains includ- rean, the use of a tagger to reduce ambiguities
ing electrical/electronic engineering, medical seems important for further processing. (Shin
science, law, etc. et al., 1995) adopts the hidden Markov model
and takes into account the characteristics of Ko-
® Monolingual terminology data bank rean word phrase structures for more accurate tag-
Users need definitions and explanations of ging: a word phrase contains one or more r o o f
technical terms during their work on specific phemes, syntactic information (grammatical rela-
domains. This work provides users such ter- tions by bound morphemes), and semantic i n f o f
minological details. We assorted 15,000 en- mation (case roles by postpositions). The exper-
tries each for culture/art and Korean classical iments revealed 98 % accuracy for the test set of
literature. 5,500 word phrases out of 55,000 training data,
- Ontology-based lexicon and 94.7 % tbr 5,500 untrained test data.
Currently awnlable dictionaries are seman-
tically oriented. They don't provide pools ~We can conceive much nlore types of dictionar-
ies: for example, lexicons for syntactic attd semantic
3It is possible to compose up to 11,172 syllables out analyses, and dictionaries tha.t are to be created or ex-
of ea<:h Korean alphabet, but Korea/, Standard Code tracted from existing ones upon users' or developers'
KSC-:5601 prescribes 2,350 complete codes for Korean nee(Is. These will be i,clhded after the tirst phase of
syllables. the project, following future direction of the project.
1051
Another approach is based on the Markov ran- Although the current status is just an opening
dom field (MRF) theory (Jung, 1996), whose Ko- spot, the long-term goal is to bltikLfully automatic
rean version will be added to IP this year. servers for Korean language information. Since IP
plays a key role in the effort, we hope that our
5.3 Tree Tagger endeavors would be well geared to the needs of
(Kim, 1995) is a prototype using dependency nation-wide language engineering.
grammar and adopting statistical methods for
ranking the parse trees to get k-best parsing re-
sults. Its current accuracy is about 80 % for the References
trained data. While this is a working prototype,
we need a tree tagger with better performance Abney, Steven. 1991. Parsing by Chunks.
so that another tree tagger using partial parsing Berwick, R., Abney, S., and Tenny, C. (eds.),
method (Abney, 1991) is on breadboard. Principle-Based Parsing. Kluwer Academic
Publishers.
5.4 K o r e a n / E n g l i s h A l i g n m e n t S y s t e m Berners-Lee, Tim, and Connolly, Daniel. 1993.
An alignment system gathers correspondences Hypertext Markup Language: A Representation
between surface representations of both lan- of Textual Information and Mctainformation
guages. (Shin, 1996) experimented expectation- for tletrieval and interchange. CERN, USA.
maximization algorithm with 68.7 % accuracy at Jung, Sung-Young. 1996. it Markov Random Field
phrase level, and this will be incorporated into based English Part-of-rlhgging System. M. S.
version 2 platform. Thesis, Korea Advanced institute of Science.
and Technology. (to appear in COLING96.)
5.5 KWIC Manager
Keyword-in-context (KWIC) manager deals with Kim, tliongun. 1995. Korean Syntactic Analysis
word usage of text corpus. Its functions include in- with Probabilistic Dependency Grammar. M. S.
dexing and searching word phrases, morphemes or Thesis, Korea Advanced Institute of Science
unigrams, applying logic operations (AND, OR, and Technology.
NO2) to them, and sorting the results. Kim, aae-Hoon, and Seo, Jungyun. 1994. A Ko-
rean Part-@Speech 7hg Set for Natural Lan-
5.6 Text/Dictionary Management guage Processing. Technical report no. CAIR-
System TR-94-55. KAIST: Center for Artificial Intelli-
TI)MS' goals are twofold: to provide customi> gence Research.
able information extraction/indexing/search tools Lee, Eun-Chul. 1993. An hnproved Method on Ko-
and managerial functions for text data base; and rean Morphological Analysis Based on C Y K Al-
to provide an environment for dictionary deveb gorithm,. M. S. Thesis, Pohang Institute of Sci-
opment and management as well as converting or ence and Tcdmology.
merging existing dictionaries to the intended one
according to user's specification. Marcus, Mitchell P., Santorini, Beatrice, and
Because of the big size of each text to be Marcinkiewicz, Mary A. 1993. Building a Large
stored and lots of keywords to be indexed and Annotated Corpus of Fmglish: The Penn Tree~
searched for each text, it requires special stor- hank. Computational Linguistics, 19(2): 31.3-
ing and managing mechanisms. This is also the 330,
ease for the dictionary management. For the Oh, Gil-R,ok, Choi, Key-Sun, and Park, Se-Young.
extensibility and adaptability, we have devised 1994. ftangul Engineering. Seoul, Korea: Daey-
standard dictionary markup language based on onngsa.
SGML. Templates (dictionary features, text de- Shin, Jung-lto, Ilan, Young-Seok, Park, Young-
scriptors, and relations among those), specifica- Chan, and Choi, Key-Sun. 1995. An HMM
tions for text/dictionary editor and format trans- Part-of-Speech Tagger for Korean Based on
lator have been also being designed and low-level Word-phrase. Recent Advances in Natural Lan-
design is being undertaken. This work is being guage Processing, Bulgaria.
coded on PC Windows and will output the first
draft version this year. Shin, Jung-Ho. 1996. Aligning a Parallel Korean-
English Corpus at Word and Phrase Level. M. S.
6 Conclusion Thesis, Korea Adwmce Institute of Science and
Technology. (to appear in COLING96.)
To this point we described the motivation and cur-
Simpkins, N. K. 1994. ALEP (Advanced Language
rent status of the Korean IP, and took a brief look Engineering Platform): An Open Architecture
at resources and tools. We started the project
for Language Engineering. CEC and Cray Sys-
in 1994 to yield version I platform in 1995 and tems, Luxembourg.
are working on version 2 platform. The project
will continue till the years of twenty first century.
1052