Laboratory Assignment Two
Laboratory Assignment Two
UNIVERSITY
ILÉ-IFÈ., NIGERIA
FACULTY OF TECHNOLOGY
COMPUTER SCIENCE AND ENGINEERING DEPARTMENT
THIS DOCUMENT IS NOT FOR SALE !
CPE 510: Introduction to Natural Language Processing and Its
Applications
Rain Semester, 2021-2022 Session
GROUP LABORATORY ASSIGNMENT 02: Application
This document is meant to support your laboratory work in CPE510. Attempts has
been made to present the content of this document as accurately as possible. How-
ever, cases of typographical, grammatical or other errors are likely. This is uninten-
tional and I will be happy if you could please draw my attention to such errors (email
to: [email protected]) as soon as you find them.
M AY, 2023
CONTENTS
1 Introduction 2
1.1 HLP System Development Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Understand the Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 State Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 Behaviour Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.4 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Laboratory Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Assignment Submission Dates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1
CHAPTER 1
INTRODUCTION
The practical classes in this course will, among other things, allow you to further explore the ideas
and concepts discussed during the lectures. Specifically, the laboratory assignments in this course
have three goals;
1. They help you to better understand the general concepts and principles discussed in CPE-510
course lectures. They also help to guide your private studies, by allowing you to explore specific
concepts in the subject matter of Human (Natural) Language Processing (HLP) techniques and
their applications.
2. They help to improve your proficiency in HLP systems development and computational prob-
lem solving in general.
3. They allow your instructor(s) to evaluate and assess your understanding of the subject matter
in this course.
Accordingly, you are strongly advised to make sure that all the work which you submit for assess-
ment arise out of your efforts. You are permitted and encouraged to discuss general solution design
(i.e., heuristics/ algorithms/grammar) with the course lecturers as well as the course and laboratory
instructors. You may also receive help with specific debugging problems during the practical ses-
sions from the laboratory instructors. However, you are expected to work independently or only
within your own team (where applicable) when in the laboratory. Remember that the motto of this
university is for learning and culture.
2
investigate the call and response system used by non-human natural entities such as whales, birds,
monkeys and so on.
The development processes of a HLP systems is similar to that of conventional computational
systems. During lectures, we have discussed the possibility of designing and implementing abstract
machines that model the language of a system. We have identified six steps in the design process:
3
is generated. This document contains a representation of the behaviour of an HLP system and the
desired solution. This task corresponds to the design task in engineering.
1.1.5 Implementation
Reduce the system behaviour designed in Step 4. into a computing artefact using appropriate software
or programming language codes on an appropriate hardware. The programming language and tools
selected will be informed by Step 4. They must have features appropriate for implementing your
HLP system. We will be adopting Python as the programming language in this course. Tools such
as the Natural Language Tool kit (NLTK) (Python based), JFLAP (system design) and Praat (speech
data collection analysis and simulation) will also be used.
1.1.6 Evaluation
This involves testing the system implemented in Step 5. to ensure compliance with problem specific-
ation. The test of a HLP system is in two stages: (i) Inside test and (ii) Outside test. In the inside
test, the performance of the system is assessed using data from the case examples used for the system
design. When the system has done well on the inside test, the the outside test is applied. In the Out-
side test, the performance of the system is assessed using data that are NOT from the case examples
used for the system design. Based on the outcomes of the evaluation, a decision to correct, modify or
deploy the system is taken as appropriate. The goal of the evaluation process is to determine how
closely the system mimics the target human language behaviour. Speed of operation, memory
used and other parameters for evaluating conventional system, are desirable but not sufficient criteria.
4
Table 1.1: Experiment Document Contents
5
CHAPTER 2
2.1 Preamble
The process ascribed to the activities of human speech organ is the object of study in the speech
production. Figure 2.1 depicts the anatomy of the principal organs in the human speech production
process. These organs are responsible for the articulation of speech. The scientific study of how these
organs works together during the speech production process as well as the characterisations of the
acoustic manifestation of the speech produced is the subject matter of Phonology and Phonetics.
The data and narratives from the fields of Phonology and Phonetics are essential to the devel-
opment of some HLP applications such as speech synthesis and speech recognition systems. For
example, an essential aspect in the development of text-to-speech synthesis system is to figure out
how to convert digital texts into speech sound waveforms. This development process is informed by
data and procedure grounded in phonology and phonetics. This is also the case in the design of a
speech recognition system.
In this part of the laboratory, we are concern with the waveforms produced by the human organs.
We want to interrogate how the data of these waveforms can be used to characterise spoken language
expression. This is with the aim to develop appropriate computing applications for selected human
languages to the extent that technology permits. The waveform in reference are those that are digitally
captured, represented, manipulate and store through microphone. Such waveforms will normally be
represented as annotated streams of binary signals.
In order to investigate spoken human language we need a simplified model by which we can
represent the important feature of the human speech organs. Such model is depicted in Figure 2.2.
Speech is the physical sensorily accessible waveform and signals. How the activities of the faculty
of language influences the organs of language, and how the activities of the organs of language
culminate to manifest as speech, remains a subject of debate and mystery. These are outside the
purview of human language processing. Note that there is a difference between the sound of speech
and the symbol of sound. The sound of speech is what the human sensual organ perceives or capture
in spoken language. The symbols of sound are ascribed by consensus and used to represent, for
example, letter in an orthography (writing system). The digital or binary signal of the symbols of
6
Figure 2.1: The Human speech organ
7
sound is what is represented and manipulated by electronic instruments and computing machines.
The aim of the set of experiments in this laboratory is to explain our observation about the cor-
relation between speech, based on sounds recorded as waveforms, and features ascribed to them.
Most indigenous West African languages are tonal. This implies that, the tone and phone constitute
their perception. Therefore, tone and phone are mutually inclusive and inextricably intertwined in the
sound of spoken tone language.
Consider the phone and tone in the Standard Yorùbá two (2) syllable (bi-syllabic) word Koko
depicted in Table 2.1. You will observe that all the words comprises the string Koko which itself
contains two phones Ko and ko. However, the tone accompanying each phone determines the word.
The possible word and non-word strings is listed in Table 2.1. The English language gloss of respect-
ive word is in the last column of in Table 2.1. In summary, it will be observed that the information
encoded in each word is a function of the phone and tone in its constituent syllables.
The phones in the word are represented with the symbols ko and ko. The concatenation of these
symbols, that is koko, represents the phone of the word. There are three (3) discrete tones in Yorùbá:
Neutral, Low and High. These are represented with the symbols M, L, and H, respectively. The
neutral tone is often called Mid-tone in conventional discourse on Yorùbá. The Low (L) and High
(H) tone emerges from the Neutral tone (M) in the spoken Yorùbá language. The structure of the two
syllables are drawn from the same symbols. However, each of the syllables in a word carries a tone.
In the word in Item No. 1 in Table 2.1, the two syllables carry the Neutral (or Mid) tone. The two
syllables in the word in Item No. 4, carries Low (L) and High tone (H), respectively.
Each of the spoken word is a continuous stream of sound in which the “tone” and “phone” are
intertwined. In the written form, however, each syllable and word is composed through a discrete
combination of symbols representing tones and phones. In the written Yorùbá language, the symbol
for each syllable is composed by stacking a tone on its corresponding phone.
8
As discussed during lectures, spoken language is a happening while written language is a process.
In fluent Yorùbá speech, tones and phones are emerges through mutually inclusive, complementary
and intertwined language phenomenon. In written Yorùbá language, tones and phones are constructed
through mutually exclusive and independent set of rules. It is important to note that the sensory
experiences of human speech is in continuum whereas quasi-continuum is ascribed to the sound
of spoke words. The discreteness and/or letters ascribed to spoken word are for the purpose of
orthography (written text). In this regard, the written text is a circumscription (an approximation)
of its spoken expression. As discussed during the lecture, there is no letter in the spoken human
language. More fundamentally, there is no quantity or number in spoken human language.
Speech signals are recorded through various instrument. Modern electronic devices make it pos-
sible to ascribed number to recorded spoken language. Modern computing machines make it possible
to represent and store recorded speech digitally. The following are the numerical cues for various fea-
tures of spoken languages. A sample of such recording is the spoken sentence “Bàbá àgbè. ti ta kòkó”
depicted in Figure 2.3.
1. To develop a tone recognition/synthesis system, the F 0 data in the speech waveform are re-
quired.
9
2. To develop a phone recognition/synthesis system, the F 1 and F 2 data in the speech signal are
required.
3. To develop a syllable or speech recognition/synthesis system, the F 0 together with the F 1 and
F 2 data in the speech signal are required.
2.2.2 Task 1
1. Download the Praat software, install and explore its features, particularly those relating to
speech signal analysis.
2. Select any six words, with not more that three syllables, from the vocabulary of kinship terms
generated in Laboratory I (cf. Table 2.2). Record the English and those in the African indigen-
ous language of your choice. For example “Father” and “Baba” for English and Yorùbá speech.
You are expected to create two set of data samples; one with a FEMALE voice and the other
with a MALE voice. Each word must be recorded such that the speech signal is clear and clean
(no background noise).
3. Explore and study the waveforms corresponding to syllables in each of the words you recorded.
10
Figure 2.3: A Praat Screen shot of the phrase “Bàbá àgbè. ti ta kòkó”
4. Observe and extract the fundamental frequency F 0 in the speech signals of the first and last
syllables in each word. Discuss the average F 0 for the make and female voices. Also observe
the pattern of the first two formants, i.e. F 1 and F 2.
5. Observe and extract the third and fourth formants, i.e. F 3 and F 4. Discuss the average F 0 for
individual members of your Laboratory group.
6. Record at least two (2) isolated syllables that comprises any of the words in item 2 above and
discuss the features of the F 0 vis-a-vis the one in the word sample. HINT: study the beginning,
middle and end of the F 0 waveform.
11
CHAPTER 3
As discussed in the lecture, the faculty of human language is outside the ambit of meaningful dis-
cussed. The activities of the faculty of language and the expression generated through the organs of
language is a happening. The object of study in Human language process is the habitual Instrument
of human communication. This is called the instrument of language in this course.
A theoretically based categorisation of these languages have been presented in the conventional
literature.
12
Table 3.1: Chomsky hierarchy
A → b; C D (Non-monotonic logic)
G ::= hΣ, V, P, Si
where:
Σ Is the alphabet of the grammar. It is also called the set of Terminal symbols.
P Is a set of rules call Production. It is written in the form A → b meaning that A can be re-written
as b. It is also called the set of re-written rules.
S Is the Start symbol. It is used to indicate the beginning of an expression (or sentence).
13
The type of production determines the grammar which in turn determines the language type.
Most of the grammar we shall be working with in this course will be in two categories: (i) linear
and (ii) non-linear.
(i.) A → BC (Context-neutral)
In the Chomsky hierarchy, the language formulated with context-neutral grammar is more power-
ful than that formulated with regular grammar. This implies that the computing process that can be
expressed (i.e. strings that can be generated) by an agency of regular grammar can also be processed
by an agency context-neutral language. In the formulation of the instrument of language presented
during this course, regular language is subsumed in context-neutral language.
14
language can be formulated and expressed in human language. A fundamental criteria for the formu-
lation of a grammar is an alphabet. There is no alphabet in the spoken human language. The set of
letters ascribed to the sounds in spoken human language is for the purpose of creating an orthography
(written language).
Therefore, the idea of a grammar, as used in this course, applies to the written human language
only. An expression in a written language comprises (i) Assertion and (ii) Relation. Assertion are
ascribed to instances in the universe of discourse of the expression. A relation establishes the logical
connection between two (2) or more instances with which an assertion had been ascribed. A relation
can also give expression to the transition reckoned in the universe of discourse.
A sentence is the Prime-axiom of language expression. A sentence comprises any, or all, of the
following three (3) items:
(iii.) The object or agency that suffered the action (object) (Elés.e);
Ade is the agency of the action. The action performed is Carry and the Chair is the object that
suffered the action. In this expression, Ade is also called the Subject and the Chair the Object while
Carry is the Verb. This is the basis of the (SVO) formulation ascribed to the structure of valid
expression in the English language.
Subject-Verb-Object
Computationally, the verb is the relation which serves the role of an operator or function. Relation
is an operation (verb) that an agency gives expression to by manipulating operands (nouns). In this
case the subject and object are the operands.
Based on the above analysis, the structure of a valid sentence can be formulated as composing a
Subject followed by a Verb followed by an Object.
V erb(Subject, Object)
Carry(Ade, chair)
NOTE that, SVO structure is NOT universe to all instrument of human languages. Some lan-
guages, such as Bambara, use VSO structure. Indeed, the SVO structure is not strictly observed in
15
most human language expression. Most sentences in the English and Yorùbá languages conforms
with the SVO structure. However, there are differences in the treatment of assertion placement in the
structure of valid expression.
The above English language sentence can be translated into Yorùbá as:
Gbé(Adé, àga)
Note that, whereas the object, that is “Chair”, is located at the end in the structure of English sentence,
the Yorùbá equivalent “Àga” is not.
G = hVT , VN , S, P i
The symbols in this formulation are further explained in the context of the definition of the in-
strument of human language as discussed during our lectures.
16
3.3.2 The VN symbol
VN is the finite set of symbols representing instances of Auxiliary axioms. Each symbol corresponds
to a sub-expression (sub-sentence) in the language formulated with the grammar. Each Auxiliary-
axiom symbol can be further expanded or reduced into simpler composition of with simpler Auxiliary-
axioms and/or Primitive-terms. The simplest expansion of an Auxiliary-axiom is obtained with sym-
bols drawn from Primitive-term. Auxiliary-axioms the string, sub-string, pattern, sub-pattern and
constant in the grammatical formulation of regular language. Auxiliary-axioms the variables and
part-of speech such as verb, noun, phase, adjective, adverb, determinant in the grammatical formu-
lation of context-neutral languages. Auxiliary-axioms form the stem of a tree representation of an
expression.
1. An instance Prime-axiom is represented with all capital letter. For example: hDAT Ai.
17
3. An instance Primitive-term is represented with small letter except for digit which is represented
as is. For example a.
4. An instance of rule is represented as hBi ::⇒ b . The symbols ::⇒ implies “can be rewritten
as”.
5. The option or choice notation is represented with a vertical line, that is |. For example,
B ::⇒ b|q implies, B can be replaced by b or q in an expression.
6. The concatenation operation is implied in the consecutive arrangement of symbols. For ex-
ample hBi ::⇒ hCihDi implies that the axiom B can be written as a C followed by a D.
18
3.4.1 Task 1
Compose any Six (6) sentences from the kinship vocabulary you generated in Laboratory II (c.f.
Table 2.2). Each member of the group should generate a sentence. Your sentence should have the
following features:
3. Using the above grammar, analyse and discuss the six (6) English sentences using parse trees.
4. Using the NLTK grammar tool in Python, explore the correctness of the English language
sentences.
3.4.2 Task 2
1. Based on the indigenous African language selected in Laboratory II, discuss the grammar for
sentence formulation in Table 3.2.
2. Repeat the tasks you executed for the English language using the indigenous language selected.
4. Discuss your observation and reflections on the grammars and processes of the two languages.
3.4.3 Task 3
Using the Python programming language (you could use NLTK toolkit):
1. Develop a software for checking the correctness of English sentences, based on the grammar
defined above.
2. Develop a software for checking the correctness of the indigenous language your selected,
based on the grammar you defined above.
3. Test your system with at least six examples of correct and incorrect sentences. Your evaluation
should be limited to the database generated in Laboratory II. Observe and document the kind
of sentences that your system will fail to correctly it grammar.
4. Discuss your observation and reflections how the laboratory activity will inform the develop-
ment of a translation system between English and the indigenous language you chose.
19
3.4.4 Task 4
Review the ELIZA chatbot in Laboratory II.
1. Design and implement a chatbot based on the English data you collected in Laboratory II.
2. Your chatbot machine should be able to answer question such as “Who is a father?”. If the
machine responded “A male parent”, you should be able to follow with the question “Who is a
parent”, and so forth.
4. Reflect on and explain your observation in respect the systems you developed and ELIZA.
20
3.5 Bibliography
1. Searle, J. (1999) Mind, Language, and Society: Doing Philosophy in the Real World, Weiden-
feld & Nicolson, London, 1999.
2. Campbell, J. (1982) Grammatical Man: Information, Entropy, Language, and Life. Simon and
Schuster, New York.
3. H. L. Dreyfus and S. E. Dreyfus(2004) From Socrates to Expert Systems: The Limits and
Dangers of Calculative Rationality The URL:http://socrates.berkeley.edu/
hdreyfus/html/paper socrates.html (Last visited July, 2011)
6. E. Cambria and B. White (2014) Jumping NLP Curves: A Review of Natural Language Pro-
cessing Research; IEEE Computational Intelligence Magazine, May 2014, pp. 48–57.
7. Boroditsky, L. (2011) How Language Shapes Thought: The languages we speak affect our
perceptions of the world Scientific American, pp 63–65.
8. Chomsky, N. (1956) Three Models for the description of language, IEEE Trans. on Info. The-
ory, Vol. 2, No. 3, pp.113–124.
9. Jäger, Gerhard and Roger, J. (2012) Formal language theory: refining the Chomsky hierarchy,
Phil. Trans. Research Soc., Vol. 367, pp. 1956–1970
10. Crystal, D. (2000) Language death, Cambridge university Press, ISBN: 978-0-521-01271-3
11. Joshi, A.J.(1991) Natural Language Processing, Science, New Series, Vol. 253, N0. 5025, pp.
1242–1249
21