Shewa - NLP Project Report PDF
Shewa - NLP Project Report PDF
By Shewandires Menan
ID: BDU1100033PR
August, 2020
NATURAL LANGUAGE PROCESSING PROJECT PAPER
Contents
1. Introduction ......................................................................................................................................... 1
2. Approaches used for this work .......................................................................................................... 1
3. Methodology ........................................................................................................................................ 2
4. Implementation ................................................................................................................................... 3
5. Challenges ............................................................................................................................................ 4
6. References ............................................................................................................................................ 5
1. Introduction
Parsing, one of the steps to design a functional NLP application and which can work in cooperation
and as input to other many NLP application like grammar and spell checker, spell correction, and
etc. In parsing the central point involves in manipulation, understanding, and parsing (breaking
down to manageable components), understand their context, relation with each other to
successfully identify their correctness. Sentences are the starting point when we come to analyzing
a written material or documents [1]. Syntax refers to the way words are related to each other in a
sentence. Then we can say that sentence parsing, which is also called syntactic parsing, is the
process of identifying how words can be put together to form correct sentence and determining
what structural role (lexical category) each word plays in the sentence and what phrases are
subparts of what other phrases or what other words modify which words of the central point of the
whole sentence constructed. A sentence parser outputs a parse structure that could be used as a
component in many applications including semantic analysis, machine translation, information
storage and retrieval of textual data etc., [2]. Today, parsers of different kinds (e.g. probabilistic,
rule based) have been developed for languages, which have relatively wider use nationally and/or
internationally (e.g., English, German, Chinese, etc. [3] My project work is focused on the
implementation of Amharic sentence that displays the parse tree for the sentence. To do sentence
parsing there are different methods, some of them are Context free Grammar (CFG) from rule-
based approach and Probability Context Free Grammar (PCFG) from statistical approach. Hence
my work is done using these two approaches, i.e., CFG and PCFG [4].
PCFG is a context free grammar that associates a probability with each of its productions. It
generates the same set of parses for a text that the corresponding context free grammar does, and
assigns a probability to each parse. The probability of a parse generated by a PCFG is simply the
product of the probabilities of the productions used to generate it [1]. They produce a model of a
language based on real data, and therefore do not have to worry about things like grammatical
mistakes, which occur in real-life situations. Although PCFGs have many advantages, a critical
disadvantage is that context is not taken into account at all. In fact, a tri-gram (sequence of three
words in this case) model of a language would probably achieve better results, even though it takes
no account of internal structures in the language, more applicable to language like Amharic [3].
3. Methodology
The methodology I used to develop the implementation of Amharic Parse tree is, takes a set of
sample grammars 4 from simple to complex grammar production rules, and assigned those
probabilities for probabilistic approach parsing and draws their parse tree and specifies their
parsing structure based on the grammar.
To develop the implementation, talking source code wise: I have used a collection tools working
and supporting the main application for different purposes [2]. Below I have listed out the names.
❖ Python 3.7
❖ NLTK 3.2 Python Based Natural Language Processing Toolkit. (www.nltk.org)
❖ KeyMan Keyboard for Unicode Keyboard Writer (Amharic)
❖ PyScripter 3.7 for an interactive IDE for python.
In order to Setup my implementation, on a local environment, first python 3.7 must be installed
and then download NLTK 3.2 and install it under the python directory, because this used as library
inside a python code. Then you need to download NLTK data using python itself.
4. Implementation
The first sample implementation of my work is the CFG approach for Amharic sentence parsing tree. The
source code and the output of the implementation is as follows: An example of a CFG is given below. For
a Sentence Like "አበበ የ ሰዉ አጥር ላይ ሆኖ አየ" can be represented using the following grammar.
S -> NP VP
VP -> V NP | V NP PP | NP V
PP -> P NP | P P
V -> "አየ" | "በላ" | "ተራመዳ"
NP -> "አበበ" | "ከበደ" | "ጫላ" | Det N| Det N N | Det N PP | N N | Det N N PP
Det -> "የ" | "ለ"
N -> "ሰዉ" | "ውሻ" |"አጥር"| "ድመት" | "መናፈሻ"
P -> "በ" | "ላይ" | "በኩል"|"ሆኖ"| "ከ"
The Syntax Parse Structure for the above example and its Parse Tree Using the developed
application looks like the following respectively: (S (NP አበበ) (VP (NP (Det የ) (N ሰዉ) (N አጥር)
(PP (P ላይ) (P ሆኖ))) (V አየ)))
Output is:
And the second implementation of my work is PCFG approach for Amharic sentence parsing
tree. The source code and the output of the implementation is as follows:
Example of PCFG grammar is shown below and, the approach is explained in a topic below the
figure.
S -> NP VP [1.0]
VP -> V NP [0.2] VP -> V NP PP [0.3] VP -> NP V [0.1] VP -> NP Adj V [0.4]
PP -> P NP [0.2] PP -> P P [0.8]
V -> "አየ" [0.8] V -> "በላ" [0.1] V -> "ተራመደ" [0.1]
NP -> "አበበ" [0.2] NP -> "ከበደ" [0.1] NP ->"ጫላ" [0.1] NP -> Det N [0.1] NP -> Det N N [0.1]
NP -> Det N PP [0.1] NP -> N N [0.1] NP -> Det N N PP [0.2]
Det -> "የ" [0.9] Det -> "ለ" [0.1] N -> "ሰዉ [0.4]
N -> "ውሻ" [0.1] N -> "አጥር" [0.2] N -> "ድመት" [0.1] N -> "መናፈሻ" [0.1]
P -> "በ" [0.1] P ->"ላይ" [0.4] P -> "በኩል" [0.1] P ->"ሆኖ" [0.3] P ->"ከ" [0.1]
Adj ->"ትንሽ" [1.0]
The Syntax Parsed Structural Output using Viteberi algorithm using the above grammar is shown
below, with a final summed up probabilistic value.
viterbi_parser = nltk.ViterbiParser(grammer)
sent = "አበበ የ ሰዉ አጥር ላይ ሆኖ ትንሽ አየ".split()
print (viterbi_parser.parse(sent))
(S (NP አበበ) (VP (NP (Det የ) (N ሰዉ) (N አጥር) (PP (P ላይ) (P ሆኖ))) (Adj ትንሽ) (V አየ)))
(p=8.84736e-05)
5. Challenges
There are some challenges that occurred when doing the projects.
1. This study uses a very small sample prepared for the purpose of the work due to lack of
time and finding well organized corpus, machine editable dictionary, POS tagged words
and unable to find specially a POS tagger application for Amharic.
2. The prototype developed in the report/study parses is assumed to be supporting a 10 and
more composed -word Amharic sentences but, the to gain the real outcome of the prototype
developed, again due mainly to time constraint, lack of linguistic ability to possibility
determine grammar rules and probabilistic rules.
3. This report does not incorporate more advanced topic like ambiguity resolution, but showed sample
parsing using probabilistic approaches.
6. References
[1] A. Alemu, "Automatic Sentence Parsing For Amharic Text An Experiment Using
Probabilistic Context Free Grammars," A Thesis Submited In Partial Fulfilment Of The
Requirement For The Degree Of Master Of Scinece In Information Science, 2002.
[2] "Natural language processing toolkit" Accessed from http://www.nltk.org/.
[3] Daniel Jurafsky & James H. Martin, "Speech and Language Processing: An introduction
to natural language processing, Computational linguistics, and speech recognition", 2007.
[4] Abiyot Bayou, "Design and Development of Word Parser for Amharic Language",
Masters Thesis, Addis Ababa University. 2000.