0% found this document useful (0 votes)
788 views

Shewa - NLP Project Report PDF

This document describes a project report on parsing Amharic sentences. It discusses using context-free grammar (CFG) and probabilistic context-free grammar (PCFG) approaches to parse sample Amharic sentences and display their parse trees. Code examples in Python are provided to demonstrate parsing a sentence using each approach and displaying the resulting parse tree. Challenges addressed include limited sample data and inability to test on longer, more complex sentences due to time constraints and lack of linguistic resources for Amharic.

Uploaded by

mekuriaw
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
788 views

Shewa - NLP Project Report PDF

This document describes a project report on parsing Amharic sentences. It discusses using context-free grammar (CFG) and probabilistic context-free grammar (PCFG) approaches to parse sample Amharic sentences and display their parse trees. Code examples in Python are provided to demonstrate parsing a sentence using each approach and displaying the resulting parse tree. Challenges addressed include limited sample data and inability to test on longer, more complex sentences due to time constraints and lack of linguistic resources for Amharic.

Uploaded by

mekuriaw
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Bahir Dar University

Bahir Dar Institute of Technology


Faculty of Computing
MSc in Information Technology 1st Year (Regular)

Natural Language Processing (NLP)

Project Report paper on


“Amharic sentence Parse Tree”
.

By Shewandires Menan

ID: BDU1100033PR

Submitted to: Dr. Yaregal A.

August, 2020
NATURAL LANGUAGE PROCESSING PROJECT PAPER

Contents

1. Introduction ......................................................................................................................................... 1
2. Approaches used for this work .......................................................................................................... 1
3. Methodology ........................................................................................................................................ 2
4. Implementation ................................................................................................................................... 3
5. Challenges ............................................................................................................................................ 4
6. References ............................................................................................................................................ 5

BAHIR DAR UNIVERSITY | BIT i


NATURAL LANGUAGE PROCESSING PROJECT PAPER

1. Introduction
Parsing, one of the steps to design a functional NLP application and which can work in cooperation
and as input to other many NLP application like grammar and spell checker, spell correction, and
etc. In parsing the central point involves in manipulation, understanding, and parsing (breaking
down to manageable components), understand their context, relation with each other to
successfully identify their correctness. Sentences are the starting point when we come to analyzing
a written material or documents [1]. Syntax refers to the way words are related to each other in a
sentence. Then we can say that sentence parsing, which is also called syntactic parsing, is the
process of identifying how words can be put together to form correct sentence and determining
what structural role (lexical category) each word plays in the sentence and what phrases are
subparts of what other phrases or what other words modify which words of the central point of the
whole sentence constructed. A sentence parser outputs a parse structure that could be used as a
component in many applications including semantic analysis, machine translation, information
storage and retrieval of textual data etc., [2]. Today, parsers of different kinds (e.g. probabilistic,
rule based) have been developed for languages, which have relatively wider use nationally and/or
internationally (e.g., English, German, Chinese, etc. [3] My project work is focused on the
implementation of Amharic sentence that displays the parse tree for the sentence. To do sentence
parsing there are different methods, some of them are Context free Grammar (CFG) from rule-
based approach and Probability Context Free Grammar (PCFG) from statistical approach. Hence
my work is done using these two approaches, i.e., CFG and PCFG [4].

2. Approaches used for this work


The approaches I have used for this implementation as I mentioned on the above section, are CFG
and PCFG form statistical and non-statistical methods.
Context-free Grammar
A context-free grammar (CFG) is a formal system that describes a language by specifying how
any legal text can be derived from a distinguished symbol called the axiom, or sentence symbol.
[2] CFGs are a very important class of grammars for two reasons: The formalism is powerful
enough to describe most of the structure in natural languages, yet it is restricted enough so that efficient
parsers can be built to analyze sentences [3].

BAHIR DAR UNIVERSITY | BIT 1


NATURAL LANGUAGE PROCESSING PROJECT PAPER

Probabilistic Context-Free Grammars (PCFG) Parsing

PCFG is a context free grammar that associates a probability with each of its productions. It
generates the same set of parses for a text that the corresponding context free grammar does, and
assigns a probability to each parse. The probability of a parse generated by a PCFG is simply the
product of the probabilities of the productions used to generate it [1]. They produce a model of a
language based on real data, and therefore do not have to worry about things like grammatical
mistakes, which occur in real-life situations. Although PCFGs have many advantages, a critical
disadvantage is that context is not taken into account at all. In fact, a tri-gram (sequence of three
words in this case) model of a language would probably achieve better results, even though it takes
no account of internal structures in the language, more applicable to language like Amharic [3].

3. Methodology
The methodology I used to develop the implementation of Amharic Parse tree is, takes a set of
sample grammars 4 from simple to complex grammar production rules, and assigned those
probabilities for probabilistic approach parsing and draws their parse tree and specifies their
parsing structure based on the grammar.

To develop the implementation, talking source code wise: I have used a collection tools working
and supporting the main application for different purposes [2]. Below I have listed out the names.
❖ Python 3.7
❖ NLTK 3.2 Python Based Natural Language Processing Toolkit. (www.nltk.org)
❖ KeyMan Keyboard for Unicode Keyboard Writer (Amharic)
❖ PyScripter 3.7 for an interactive IDE for python.
In order to Setup my implementation, on a local environment, first python 3.7 must be installed
and then download NLTK 3.2 and install it under the python directory, because this used as library
inside a python code. Then you need to download NLTK data using python itself.

BAHIR DAR UNIVERSITY | BIT 2


NATURAL LANGUAGE PROCESSING PROJECT PAPER

4. Implementation
The first sample implementation of my work is the CFG approach for Amharic sentence parsing tree. The
source code and the output of the implementation is as follows: An example of a CFG is given below. For
a Sentence Like "አበበ የ ሰዉ አጥር ላይ ሆኖ አየ" can be represented using the following grammar.

S -> NP VP
VP -> V NP | V NP PP | NP V
PP -> P NP | P P
V -> "አየ" | "በላ" | "ተራመዳ"
NP -> "አበበ" | "ከበደ" | "ጫላ" | Det N| Det N N | Det N PP | N N | Det N N PP
Det -> "የ" | "ለ"
N -> "ሰዉ" | "ውሻ" |"አጥር"| "ድመት" | "መናፈሻ"
P -> "በ" | "ላይ" | "በኩል"|"ሆኖ"| "ከ"

The Syntax Parse Structure for the above example and its Parse Tree Using the developed
application looks like the following respectively: (S (NP አበበ) (VP (NP (Det የ) (N ሰዉ) (N አጥር)
(PP (P ላይ) (P ሆኖ))) (V አየ)))

Output is:

And the second implementation of my work is PCFG approach for Amharic sentence parsing
tree. The source code and the output of the implementation is as follows:

Example of PCFG grammar is shown below and, the approach is explained in a topic below the
figure.

BAHIR DAR UNIVERSITY | BIT 3


NATURAL LANGUAGE PROCESSING PROJECT PAPER

S -> NP VP [1.0]
VP -> V NP [0.2] VP -> V NP PP [0.3] VP -> NP V [0.1] VP -> NP Adj V [0.4]
PP -> P NP [0.2] PP -> P P [0.8]
V -> "አየ" [0.8] V -> "በላ" [0.1] V -> "ተራመደ" [0.1]
NP -> "አበበ" [0.2] NP -> "ከበደ" [0.1] NP ->"ጫላ" [0.1] NP -> Det N [0.1] NP -> Det N N [0.1]
NP -> Det N PP [0.1] NP -> N N [0.1] NP -> Det N N PP [0.2]
Det -> "የ" [0.9] Det -> "ለ" [0.1] N -> "ሰዉ [0.4]
N -> "ውሻ" [0.1] N -> "አጥር" [0.2] N -> "ድመት" [0.1] N -> "መናፈሻ" [0.1]
P -> "በ" [0.1] P ->"ላይ" [0.4] P -> "በኩል" [0.1] P ->"ሆኖ" [0.3] P ->"ከ" [0.1]
Adj ->"ትንሽ" [1.0]
The Syntax Parsed Structural Output using Viteberi algorithm using the above grammar is shown
below, with a final summed up probabilistic value.

Code Example Using Python

viterbi_parser = nltk.ViterbiParser(grammer)
sent = "አበበ የ ሰዉ አጥር ላይ ሆኖ ትንሽ አየ".split()
print (viterbi_parser.parse(sent))

Output of the above grammar and Viterberi_Parser in My application using Python

(S (NP አበበ) (VP (NP (Det የ) (N ሰዉ) (N አጥር) (PP (P ላይ) (P ሆኖ))) (Adj ትንሽ) (V አየ)))
(p=8.84736e-05)

5. Challenges
There are some challenges that occurred when doing the projects.
1. This study uses a very small sample prepared for the purpose of the work due to lack of
time and finding well organized corpus, machine editable dictionary, POS tagged words
and unable to find specially a POS tagger application for Amharic.
2. The prototype developed in the report/study parses is assumed to be supporting a 10 and
more composed -word Amharic sentences but, the to gain the real outcome of the prototype
developed, again due mainly to time constraint, lack of linguistic ability to possibility
determine grammar rules and probabilistic rules.
3. This report does not incorporate more advanced topic like ambiguity resolution, but showed sample
parsing using probabilistic approaches.

BAHIR DAR UNIVERSITY | BIT 4


NATURAL LANGUAGE PROCESSING PROJECT PAPER

6. References
[1] A. Alemu, "Automatic Sentence Parsing For Amharic Text An Experiment Using
Probabilistic Context Free Grammars," A Thesis Submited In Partial Fulfilment Of The
Requirement For The Degree Of Master Of Scinece In Information Science, 2002.
[2] "Natural language processing toolkit" Accessed from http://www.nltk.org/.

[3] Daniel Jurafsky & James H. Martin, "Speech and Language Processing: An introduction
to natural language processing, Computational linguistics, and speech recognition", 2007.

[4] Abiyot Bayou, "Design and Development of Word Parser for Amharic Language",
Masters Thesis, Addis Ababa University. 2000.

BAHIR DAR UNIVERSITY | BIT 5

You might also like