Korean Language Engineering

This document discusses the current status of Korean language engineering and the development of an information platform. It describes how Korean language engineering aims to establish resources and tools to support researchers. The information platform will integrate projects in key areas like natural language processing, text analysis, and knowledge processing. It will provide linguistic databases, language tools, and documentation through various access methods like HTTP and FTP servers. This will help avoid duplicate work and advance the field of Korean language engineering.

Uploaded by

sandrodoom

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views

Korean Language Engineering

Uploaded by

sandrodoom

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Korean Language Engineering: Current Status of the

Information Platform *
K i m , Seongyong a n d C h o i , Key-Sun
l ) e p a r t m e n t of C o m p u t e r S c i e n c e
K o r e a A d v a n c e d h i s t i t u t e o f S c i e n c e a n d ~[7"~chnology
Taejon, Korea
[email protected], [email protected]

Abstract the beginning, they focused on Korean alphabets

and sonw scrappy parts of character processing,
Language engineering implenicnts func- lacking the global view of the engineering ap-
tions of a language and inforillation via proaches. Technical approaches to Korean began
computers. '['he need for language en- with the formation of the special interest group
gineering plattbrms has been generally on Korean information processing under tile Ko-
recognized and several researches are be- rea Information Science Society. And in 1994
ing undertaken around the worhl. Our (;enter for Korean Language Engineering (KLE)
goal is to establish Korean inforn-iation was founded to serve as a centrM organization for
platform of linguistic resources and tools Korean language engineering, which aims to plan
for Korean language and information and progranl related projects an(i works in a con-
colnumnities. The platform will sup- sistent, systeinlttic way with long-teiun gems. It
port researchers and engineers with well- also incorporates academic and research institutes
developed and standardized resources and hidustries into comnion goals: the etticient
and al)plication tools thereby avoiding and imrmonious (lriw~ toward research and devel-
duplicate activities fi'om scratch a.nd ani- opment, and establishment of long-range policies
plifyiilg overall effort on the domain. and strategies for Korean la.ngu~tge engineering.
This paper reports tile components and
the current status of the project, and the 2 Areas of Korean Language
importance of the effort.
Engineering Researches
1 Korean Language Engineering According to the level of technologies, KLE par-
titioned its projects into ttiree classes.
1.1 Language Engineering Fundamental technology deals with radical and
Language engineering is slich an activity that im- theoretical researches, collection and nlanipula-
plements various fnnctions related to a language tion of data, and standardization. In linguis-
and builds lip an information base. It realizes tic viewpohlt, these include language [ornialisms,
linguistic activities of everyday life and linguis- text corpora, and statistical int'ormation of a lan-
tic competence of human beings with the aids of guage. On infornlation enginee.ring side, the tech-
computer science, thereby supporting people's in- nology covers information interchange and com-
tellectual linguistic productions. The language en- pression techniques, basic techniques of artifi-
gineering not only collects and disseminates tile cial intelligence such as knowledge representation,
informat, ion and knowledge of ~t language, among searching, and tools for manipulating Korean al-
the linguistic society but also serves as a Younda- phabets. From the cognitive engineering point of
tion on which linguistic culture and ~echnologies view, the research focuses on the structure of Ko-
can be based (Oh et al., 1994). rean alphabets, fonts, command structures, and
interdisciplinary works of cognitive science. A/>
1.2 Korean Language Engineering o~her division handles standardization issues for
code schemes and w)cabnlaries, keyboard layout,
Korean language engineering is one for Korean standard text formats, and internationalization.
language. It came into birth in early 1980's with
"Pile second class is called basic technology,
the emergence of personal comt)uters (PCs). hi
which is related to the basic software libraries for
*'Fhis work is fimded by Ministry of Science and Korean language processing. Included in this class
7~,clmology and Ministry of Education and Athletics, are natural language analysis, pattern recognition,
as a part of a contract by Center for Korean Language multimedia data base, and data conversion tools.
Engineering. The third class is applications technology. It

1049
network sign characteristics, which lead to the ease of sys-
.~.o~r~sp,etto,'m = 1 / ac:c~ss ~ ,~ tem design and tlexibility of the system config~
h~p s~rvor ~v' ' v ~ ~ ftp server rations (Berners-Lee ~5 Connolly, 1993). Its other
characteristic lies in the common gateway inter-
l ~ 13(C ~ ~ Olcb'Tenn~linc~o face (CGI) which makes it possible to interface
with various shell scripts and program codes with-
out difficulties. Yet another point is that the
server-client model makes the platform transpar-
ent to the users.
IP consists of three parts. First, text corpora,
l. U:Unix
voice and handwritten scripts DBs, dictionaries
"TexI/Di:MS 1 S: Solelris
W: V~4nda,w~,
Mgt b~y~,t~rn U/..~.t,~DevelopingforUnix11rs't, and a set of terminological DBs constitute the in-
V~n4o.;~sPlatform thenforSendW formation base. The information base may di-
rectly be distributed through ftp server or indi-
rectly accessed by the language tools on the higher
Figure 1: The Conceptual Diagram of the Infor-
layer of the http server configuration.
mation Platform
Secondly, language tools are running on the
http server with the aids of CGI as well as be-
ing ftp-ed to users as executable codes. Since we
consists of systems for text interchange and com- aim to provide software versions on Unix, Solaris,
pression, hypertext, multimedia, word processing and PC Windows altogether, initial hardware re-
and others. For knowledge processing, it will cover quirements for each tool may be different. ~
document paraphrasing, indexing and retrieval, Finally, documentation preparation will also be
computer-based instruction/education, etc. accompanied with the project's progress.
3 Information Platform 4 Information Base
For Korean language engineering, it is necessary 4.1 Text Corpus
to develop systematically all the projects of each
area and integrate them into a uniform frame, Text corpora are essential to statistical modeling,
called an information platform (IP). 1 KLE pro- in developing formal theories of the grammars,
grams each project according to its priority and investigating prosodic phenomena in speech, and
state-of-the-art technology. Consequently, ]P re- evaluating or comparing the adequacy of parsing
flects the status of ongoing projects and is an as-is models (Marcus et al., 1993). There are four sorts
framework on which further researches and devel- of corpora from contemporary Korean texts.
opment works can be performed. • Raw corpus
Figure 1 shows the conceptual diagram of IP. Two factors are the genre of each source text
This platform doesn't integrate all the project that is related to the objective(s) in using
outcomes but some of the 5mdamental resources the corpus, and the category of the text that
and basic tools, since it reflects the current config- represents the internal structure of the text.
uration that is not concrete but open to changes. Major sources of the corpus inchlde books,
The whole integration of the project outcomes will magazines, and newspapers; up to date three
be available at the end of the first phase in 1997. million word phrases are gathered.
This platform is different from ALEP (Ad-
• Part-of-speech (POS) tagged corpus
vanced Language Engineering Platform) (Simp-
POS tagset for Korean originated from (Kiln
kins, 1994) in that ALEP is an environment that
L~ Seo, 1994). In version 1 platform we
can be provided to users as a form of a (customiz-
able) package whereas our platform is a server- yielded 2.5 million automatically tagged word
phrases and 1.5 million post-edited word
client model in pursuit of a web-based service for
phrases.
resources and tools.
Worldwide web is composed of hyperdocuments • Tree-tagged corpus
and hyperlinks to handle multimedia data as well This can be produced by applying syntactic
as to provide easy and timely access to elec- tagset to the POS tagged corpus. The syn-
tronic information. It uses hypertext markup lan- tactic tagset is being studied using 100,000
guage (HTML) based on standardized generalized sentences out of POS tagged corpus, and the
markup language (SGML). Therefore, it guaran- resultant tree-tagged corpus using a tree tag-
tees the standardization and straightforward de- ger will appear at the end of this year.

1"ltttp://world.kaist.ac.kr/KLE/KIBS/" is 2For example, the text and dictionary manage-

SunOS, version 1 platform and web pages are only ment system is currently being built upon PC Win-
in Korean. The 2nd version will be released on Solaris dows so that Unix and Solaris executables are not yet
at the address "http://kibs.kaist.ac.kr/KLE/KIBS/." available.

1050
• Categorized corpus of target language expressions but offer
Korean verbs and adjectives are classified into basic meanings for entries together with
over seventy categories, and a set of sentence some syntactic and morphological informa-
styles are investigated for 940 basic verbs of tion. Ontology-based lexicon is lexically ori-
those categories. About thirty five thousand ented in that it guides the user to find a prag-
sentences are tangible in version 1 platform. matically or contextually equivalent expres-
sion corresponding to the source language ex-
4.2 Voice Data Base pression. The work is on the phase of feasi-
This resource can be used ['or speech recognition bility study with intensive locus on collecting
and synthesis applications. We initially focused Korean-English bilingual information sources
on word-level voice data. It includes phoneti- and developing tools for lexicon construction.
cally balanced words, phonemic sequences pro-
Lexicon for morphological analysis
nounced by four different speakers, and narration
The lexicon for Korean morphological anal-
of sample stories. It also stores the sounds of sin-
ysis is currently being built to have 30,000
gle syllables, diphones, numerics, high-frequency
entries with oil'-line m a n a g e m e n t tools, and
words, gazetteers, flmctional words, and consecu-
will grow to 100,000 entries with on-line tools
tive word sequences. The data are stored in server
after two more years. 4
disks and CD-ROMs as a wave form. This ef-
fort will be extended to sentence-level collections
such as phonetically balanced sentences, speech 5 Language Engineering Tools
dialogues, and scenarios. Basically, the tools that we present here are for
4.3 H a n d w r i t t e n Scripts Data Base text corpus and dictionaries, except for voice and
character recognizers. The latter two programs
Since character recognition systems are under the are currently under the develol)ment and will be
control of applications engineers, the objective of integrated later.
this work is to provide well-tbrmed data and eval-
uation criteria for those recognition systems. We 5.1 Morphological Analyzer
stepwise our data collection into three phases: to
MorI>hological analysis is an i m p o r t a n t but dilfi-
scan, with 300 dpi resolution, one thousand sets of
c , l t t)art of the analysis since Korean is an aggluti-
590 high-frequency syllables in the first year, then
native language with sophisticated morpheme seg-
of 990 syllables and 2,350 syllables in the follow-
mentation rules and morphotactic rules. The n]or.-
ing years, a At each phase, we develop both the
phological analyzer is based on the Korean chart
square-hand (:haracters and free-style characters.
parsing (Lee, 1993). Its' current precision is over
4.4 Dictionaries and Terminological Data 92 percent for the grammatical inl)ut sentences. It
Base aims to achieve 98 percent accuracy in two nrore
years. It will be extended to cover special sym-
• Multilingual technical dictionary
bols, alien strings, elliptical or abbreviated words,
The objective is to set up mappings between
and spell errors to earn higher accuracy.
technical terms of Korean and other lan-
gnage(s) in both directions. '['he first work is 5.2 Tagger
done for computer science domain, and it has
35,000 entries each for Korean and English. It Because the output of morphological analysis is
will be extended to cover Chinese, Japanese, rather complex due to the characteristics of Ko-
and German as well as more domains includ- rean, the use of a tagger to reduce ambiguities
ing electrical/electronic engineering, medical seems important for further processing. (Shin
science, law, etc. et al., 1995) adopts the hidden Markov model
and takes into account the characteristics of Ko-
® Monolingual terminology data bank rean word phrase structures for more accurate tag-
Users need definitions and explanations of ging: a word phrase contains one or more r o o f
technical terms during their work on specific phemes, syntactic information (grammatical rela-
domains. This work provides users such ter- tions by bound morphemes), and semantic i n f o f
minological details. We assorted 15,000 en- mation (case roles by postpositions). The exper-
tries each for culture/art and Korean classical iments revealed 98 % accuracy for the test set of
literature. 5,500 word phrases out of 55,000 training data,
- Ontology-based lexicon and 94.7 % tbr 5,500 untrained test data.
Currently awnlable dictionaries are seman-
tically oriented. They don't provide pools ~We can conceive much nlore types of dictionar-
ies: for example, lexicons for syntactic attd semantic
3It is possible to compose up to 11,172 syllables out analyses, and dictionaries tha.t are to be created or ex-
of ea<:h Korean alphabet, but Korea/, Standard Code tracted from existing ones upon users' or developers'
KSC-:5601 prescribes 2,350 complete codes for Korean nee(Is. These will be i,clhded after the tirst phase of
syllables. the project, following future direction of the project.

1051
Another approach is based on the Markov ran- Although the current status is just an opening
dom field (MRF) theory (Jung, 1996), whose Ko- spot, the long-term goal is to bltikLfully automatic
rean version will be added to IP this year. servers for Korean language information. Since IP
plays a key role in the effort, we hope that our
5.3 Tree Tagger endeavors would be well geared to the needs of
(Kim, 1995) is a prototype using dependency nation-wide language engineering.
grammar and adopting statistical methods for
ranking the parse trees to get k-best parsing re-
sults. Its current accuracy is about 80 % for the References
trained data. While this is a working prototype,
we need a tree tagger with better performance Abney, Steven. 1991. Parsing by Chunks.
so that another tree tagger using partial parsing Berwick, R., Abney, S., and Tenny, C. (eds.),
method (Abney, 1991) is on breadboard. Principle-Based Parsing. Kluwer Academic
Publishers.
5.4 K o r e a n / E n g l i s h A l i g n m e n t S y s t e m Berners-Lee, Tim, and Connolly, Daniel. 1993.
An alignment system gathers correspondences Hypertext Markup Language: A Representation
between surface representations of both lan- of Textual Information and Mctainformation
guages. (Shin, 1996) experimented expectation- for tletrieval and interchange. CERN, USA.
maximization algorithm with 68.7 % accuracy at Jung, Sung-Young. 1996. it Markov Random Field
phrase level, and this will be incorporated into based English Part-of-rlhgging System. M. S.
version 2 platform. Thesis, Korea Advanced institute of Science.
and Technology. (to appear in COLING96.)
5.5 KWIC Manager
Keyword-in-context (KWIC) manager deals with Kim, tliongun. 1995. Korean Syntactic Analysis
word usage of text corpus. Its functions include in- with Probabilistic Dependency Grammar. M. S.
dexing and searching word phrases, morphemes or Thesis, Korea Advanced Institute of Science
unigrams, applying logic operations (AND, OR, and Technology.
NO2) to them, and sorting the results. Kim, aae-Hoon, and Seo, Jungyun. 1994. A Ko-
rean Part-@Speech 7hg Set for Natural Lan-
5.6 Text/Dictionary Management guage Processing. Technical report no. CAIR-
System TR-94-55. KAIST: Center for Artificial Intelli-
TI)MS' goals are twofold: to provide customi> gence Research.
able information extraction/indexing/search tools Lee, Eun-Chul. 1993. An hnproved Method on Ko-
and managerial functions for text data base; and rean Morphological Analysis Based on C Y K Al-
to provide an environment for dictionary deveb gorithm,. M. S. Thesis, Pohang Institute of Sci-
opment and management as well as converting or ence and Tcdmology.
merging existing dictionaries to the intended one
according to user's specification. Marcus, Mitchell P., Santorini, Beatrice, and
Because of the big size of each text to be Marcinkiewicz, Mary A. 1993. Building a Large
stored and lots of keywords to be indexed and Annotated Corpus of Fmglish: The Penn Tree~
searched for each text, it requires special stor- hank. Computational Linguistics, 19(2): 31.3-
ing and managing mechanisms. This is also the 330,
ease for the dictionary management. For the Oh, Gil-R,ok, Choi, Key-Sun, and Park, Se-Young.
extensibility and adaptability, we have devised 1994. ftangul Engineering. Seoul, Korea: Daey-
standard dictionary markup language based on onngsa.
SGML. Templates (dictionary features, text de- Shin, Jung-lto, Ilan, Young-Seok, Park, Young-
scriptors, and relations among those), specifica- Chan, and Choi, Key-Sun. 1995. An HMM
tions for text/dictionary editor and format trans- Part-of-Speech Tagger for Korean Based on
lator have been also being designed and low-level Word-phrase. Recent Advances in Natural Lan-
design is being undertaken. This work is being guage Processing, Bulgaria.
coded on PC Windows and will output the first
draft version this year. Shin, Jung-Ho. 1996. Aligning a Parallel Korean-
English Corpus at Word and Phrase Level. M. S.
6 Conclusion Thesis, Korea Adwmce Institute of Science and
Technology. (to appear in COLING96.)
To this point we described the motivation and cur-
Simpkins, N. K. 1994. ALEP (Advanced Language
rent status of the Korean IP, and took a brief look Engineering Platform): An Open Architecture
at resources and tools. We started the project
for Language Engineering. CEC and Cray Sys-
in 1994 to yield version I platform in 1995 and tems, Luxembourg.
are working on version 2 platform. The project
will continue till the years of twenty first century.