0% found this document useful (0 votes)
40 views

Text Analysis Based On Natural Language Processing NLP

Uploaded by

rkhumara21
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

Text Analysis Based On Natural Language Processing NLP

Uploaded by

rkhumara21
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

The 2nd International Conference on Advances in Engineering Sciences and Technology 2022 (AEST-2022)

University of Babylon, Iraq.

Text Analysis Based on Natural Language


Processing (NLP)
2022 2nd International Conference on Advances in Engineering Science and Technology (AEST) | 979-8-3503-3490-6/22/$31.00 ©2022 IEEE | DOI: 10.1109/AEST55805.2022.10413039

1st Azhar Kassem Flayeh 2nd Yaser Issam Hamodi 3rd Nashwan Dheyaa Zaki
department of computer engineering & Media Technology Engineering
Computer Engineering - College of
information Technology 3University of information Technology
Ministry of Higher Education& Scientific Electrical And Electronic Techniques
Ministry of Higher Education& and Communications/College of
Research, Baghdad, Iraq
Scientific Research , Baghdad, Iraq Engineering, Baghdad,Iraq
[email protected]
[email protected] [email protected]

Abstract— In the modern era of information explosion, contrast between the field of natural language processing and
where there is enormous information including thousands and the field of computational linguistics. This is what the
millions of texts loaded on the Internet of over 1.2 million Society for Computational Linguistics has defined,
terabytes, there are no accurate statistics on the size of texts describing it as the science that focuses on the theoretical
within this huge amount of data. Given the multiplicity of
websites on the Internet and the spread of social networking
aspects of natural language processing. Modern algorithms
services like Twitter, for example, it is stated that about 6000 for natural language processing are based on machine
tweets are posted per second in the year 2019. This work aims learning which is requires an understanding of several
to analyze these texts through resorting to artificial intelligence disparate areas, including linguistics, computer science, and
to help solve this problem. In this article, a review is presented statistics. There are many challenges in the field of natural
on the latest concepts in the field of word processing, language processing, including speech recognition,
information extraction, and implementing an interface for text understanding natural language, and how to generate natural
analysis using the C++ programming language and Visual language. It can be said that the period in which the history
Studio 2015 edition. A comparison is made between two of natural language processing began in the 1950s (even
articles in terms of the number of verbs, prepositions, article
strength, number of letters, with the possibility of sorting the
though the work originally began during 1950). The initial
text into verbs, nouns, object, and adjectives, as will be beginning is now called the Turing test, published by the
displayed in the program interface. scientist Alan Turing in his article titled “Intelligence”. In
1954, the Georgetown experiment was implemented, which
Keywords— Natural Language Processing NLP, Ambiguity, is a fully automatic translation of more than sixty sentences
Information retrieval (IR), Sentiment Analysis. from Russian into English. The authors noted that the
machine translation problem will be solved within three or
I. INTRODUCTION five years. However, real progress remained much slower, so
The term natural language is given to the languages that the funding for the machine translation project was reduced
humans use for human-human interactions, such as Arabic, after the publication of the ALPAC Report in 1966, which
English, French and other languages. These languages are showed that research, after ten years, failed to meet
called natural languages because they are the result of the expectations. During the 1970s, many programmers wrote
natural development of society. For example, English about the "conceptual theory," which involves organizing
language has been developed over years without having real-world information into computer-understood data. These
planned rules or words for it, and the same case is true for works include MARGIE [3], TaleSpin [4], UALM [5], PAM
other languages [1]. As for constructed languages, planned, [6], SAM [6-8], and Plot Units [9]. During this era, many
artificial or invented, their terms and linguistic characteristics chatterbots were released, including PARRY, Racter, and
are placed before they are used by people, as is the case with Jabberwacky. Despite this reduction in funding, however,
programming languages including Java and Python. In research into machine translation continued until the late
addition to the aforementioned aspects, there are built 1980s.
languages directed for the purposes of human In the field of machine translation, there were many
communication, and among these languages is the "Lojban" notable early successes, thanks to IBM Research who
language [2]. It is considered one of the integrated invented worked on successively developing more complex statistical
languages that aim to improve human communication models. These systems were able to take advantage of the
because it is free from linguistic ambiguity. Natural existing multilingual textual manuscripts produced by the
Language Processing (NLP) is a computer science and Parliament of Canada and the European Union as laws called
linguistics field that deals with the interactions between for the translation of all government actions into all the
computers and natural languages. It is originated as a branch official languages of the corresponding systems of
of artificial intelligence, which is in turn branched out from government. Most of the other systems depended on
informatics. It can be said that there is convergence and computers that have been developed specifically for the tasks

979-8-3503-3490-6/22/$31.00 ©2022 IEEE 774


Authorized licensed use limited to: MULTIMEDIA UNIVERSITY. Downloaded on April 26,2024 at 21:14:39 UTC from IEEE Xplore. Restrictions apply.
The 2nd International Conference on Advances in Engineering Sciences and Technology 2022 (AEST-2022)
University of Babylon, Iraq.

that these systems perform, which is a major constraint in the limited set of digital documents. After the great development
success of these systems. Therefore, a large number of and emergence of the World Wide Web an explosion of
researches have turned to ways to learn more effectively information occurred in many different languages. A major
from limited amounts of data. breakthrough has been made in the field of information
Recently, researchers have increasingly focused on retrieval (IR), which is an application of natural language
unsupervised and semi-supervised learning algorithms that processing. There are several applications that use NLP
could learn from data that have not been annotated manually techniques, and perhaps the most important one is
with the required answers, or by using a combination of Information Extraction, Question Answering and Machine
annotated and unexplained data. One of the successful Translation, which will be addressed in this work. There are
systems developed in the 1960s was the SHRDLU system, a many uses for NLP in several areas including the industry's
natural language system that operates in “finite worlds” leading data, technology, healthcare, and therapeutic
restricted by restricted vocabulary. This program remained expertise.
famous in the history of artificial intelligence, and it was As for the health field during the 1960’s, the inventor
originally developed by Terry Winograd at MIT [3]. This and engineer Vladimir Zworykin emphasized that the
program simulates the work of a robot so that it could challenge that doctors are facing is not only the lack of data
determine several things of various shapes such as cubes and and information about the state of health, but also the
pyramids placed on a table. In addition, it can speak in a insufficiency of traditional techniques of summarization and
natural language with someone and answer his/her questions indexing to provide the doctor with the required knowledge
about these things. Winograd added a grammar derived from to reach his or her right. This prompted the National Library
the systemic grammar of Hallida. This is a system of logical of Medicine (NLM) to pay attention to the topic and start the
networks that express certain properties of the structural Medical Data Analysis and Retrieval System (MEDLARS)
units, especially of the states of verbs, their tenses, and their project in 1964. The project involved the creation of a
construction (for the known and unknown). Some semantic database of the most common and popular medical books
meanings are used during linguistic analysis, which leads to and literature in the world (PubMed), which includes more
a reduction in the number of possible constructs of a than 30 million medical articles and over 800,000 new
sentence, as compared to applying the synthetic criteria citations. This has helped many clinicians and researchers to
alone. ELIZA, a simulation of the Rogerian Therapy track the rapidly growing medical literature. However, the
Director, was written by Joseph Weisenbaum between 1964 sheer volume of medical articles and their dramatic growth in
and 1966. It does not involve information about human recent decades still challenges the limits of human perception
thought or emotion. Sometimes ELIZA offers a human-like despite advances in information technology [5].
reaction. Another program is LUNAR, which is based on the As for the uses of (NLP) in the fields of industry, the
use of a method known as Augmented Transition Network work of companies and bank accounts, for example, the bank
(ATN) to exchange information with a database in natural conducts customer opinion polls. The idea of this survey is
language. This database contains information on samples of the ability to examine customer feedback and inform the
the nature of moon rocks collected by NASA astronauts. company to take necessary actions, in order to improve
The questions that are directed to the program will be customer experience. This is done by using text
analyzed in two stages: compositional analysis (constructing decomposition programs and leveraging machine learning
a manually tree or several trees if possible), and semantic and natural language processing algorithms to extract
analysis of a tree or shrub (builds an internal representation massive amounts of information to analyze the text in more
of an inquiry). In response to the request in the internal efficient way. Organizations can use text analytics software,
representation of the question, the information required in the leveraging machine learning, and natural language
question is searched and the answers are prepared in natural processing algorithms, to find meaning in massive amounts
language. The program includes approximately 3,500 words of text by analyzing free text comments provided in
in addition to grammar. The Lunar program has had a very customer feedback forms. Next, text analytics are used to
great impact on the development of natural language help companies see hidden customer insights and easily
programs due to its technology used for the ATN, which has answer questions. Additionally, with the help of text
become one of the most popular methods of processing analytics software, companies can find new and emerging
natural languages over the past ten years. The main part of topics, track trends and issues, and provide visual reports to
the program can either paraphrase sentences or draw managers to keep track of what customers think [6]. There
conclusions about the intellectual world. MARGIE is a are many tools used in research in the field of natural
program that converts phrases from natural language into a language processing, including:
form of intellectual interconnectedness. It is also possible to NLTK: It is a large group of tools related to the field of
formulate the sentences entered or draw conclusions about natural languages for several natural languages, and it is
the intellectual world that it constitutes [4]. available on Python [7] http://www.nltk.org/ .
The most important characteristic of constructed Scikit-learn: It may not have a direct relationship with
languages is that they are often devoid of ambiguity. Natural natural language processing, but it is a vast library of tools in
languages cannot be free from this linguistic ambiguity due the field of machine learning available on Python
to their complexity and natural development. http://scikit-learn.org
A few decades ago, the NLP field matured, which was HTK: This is a set of tools specialized in processing of
limited in its early beginnings to collecting data from a phonemes, and it is on http://htk.eng.cam.ac.uk

775

Authorized licensed use limited to: MULTIMEDIA UNIVERSITY. Downloaded on April 26,2024 at 21:14:39 UTC from IEEE Xplore. Restrictions apply.
The 2nd International Conference on Advances in Engineering Sciences and Technology 2022 (AEST-2022)
University of Babylon, Iraq.

II. RELATED WORK 3) Machine Translation


Most researchers focused on NLP, by building tools and It intends to translate from one language to another, for
systems to make NLP what it is today. Tools like Sentiment example from Arabic to English. It is considered one of
Analyzer, Parts of Speech (POS) Taggers, Chunking, Named the most difficult applications in the field of language
Entity Recognitions (NER), Emotion detection, and processing, because languages differ in their grammatical
Semantic Role Labelling made NLP a sufficient topic for and morphological structure. There are languages in which
research. terms are available with no translation in other languages,
The sentiment analyzer (Jeonghee et al., 2003) works by which may cause confusion in its meaning. For example,
extracting sentiments about given topic. Sentiment analysis there are three cases of the verb in the Arabic language
consists of a topic specific feature term extraction, (past, present, and imperative), while there are several
sentiment extraction, and association by relationship cases of the verb in English, including the simple past,
analysis. Sentiment Analysis utilizes two linguistic resources past continuous, the present simple, present continuous,
for the analysis: the sentiment lexicon and the sentiment and others. Here, the translation process becomes more
pattern database. It analyses the documents for positive and difficult, and perhaps the most important evidence for the
negative words and tries to give ratings on scale -5 to +5. As difficulty of this is that the translation of a topic by two
for the part of speech taggers, there are currently a number of experts in the field of language differs in the wording, but
programs for European languages, whereas research is still the content is the same [13].
being conducted on taggers for other languages like Arabic, The advantage of natural languages is that it includes
Sanskrit [9], and Hindi [10]. They can efficiently tag and linguistic participation. However, ambiguity may occur
classify words into nouns, adjectives, and verbs. Most inevitably because of the natural development of
procedures for parts of speech work efficiently on European languages, causing confusion. Linguistic participation can
languages, but not with Asian or Middle-Eastern languages. be divided into several levels: verbal or morphological
Arabic , for example, has been processed using the Support participation, syntactic participation, and semantic
Vector Machine (SVM) [11] approach to automatically participation. As for the built languages, they are free from
tokenize, tag, and annotate base phrases in Arabic text. these forms of participation. Their levels are illustrated in
Chunking, known as Shadow Parsing, works by labelling the following subsections.
segments of sentences with syntactic correlated keywords
like Noun Phrase and Verb Phrase (NP or VP). 4) Lexical Ambiguity
III. NATURAL LANGUAGE PROCESSING NLP This level is primarily based on the word and the
context of the sentence in which it appears. For example,
It is intended to process natural languages by computers, the word beauty in Arabic has two different meanings:
which falls within the specialization of computer science and beauty and camels, based on its context in a sentence (the
artificial intelligence, being one of the most important and beauty of nature), while in another place the same letters of
difficult applications. It also introduces the science of this word appear in another meaning (I saw the camels of
language processing with broad scientific branches, Bedouin). Here, it refers to the animal (the camel) and the
including linguistics, computer engineering, electronics, and same applies to the English language. There are words that
statistics [12]. have several meanings, for example the word desk, whose
A. Applications of natural language processing meaning can be information desk or a student desk. The
most relevant topic in the field of natural languages is
The most important applications that use NLP techniques
deconstructing the meaning of words, also known as Word
are summarized below.
Sense Disambiguation (WSD).
1) Information Extraction:
a) Syntactic Ambiguity
It is the process of extracting data from language and
At this level, it is highly dependent on the presence of more
then converting it into abstract data so that it can be stored
than one analysis (parsing) of one sentence. This level
easily. This is what happens in the stock buying and selling
varies from one language to another according to the
market, where news about the shares of a particular company
grammatical characteristics of the language, but it is
is relied upon. The information is extracted from the news,
present in all languages.
the new share price and the company name are verified, and
then the extracted information is stored in a database. b) Semantic Ambiguity
2) Question Answering Often, the meaning of the sentence is not clear despite the
clarity of the syntax and no ambiguity in the meanings of
One of the most important applications used and
the words. That is, the whole sentence bears two
available on the iOS system is the (Siri) application. In this
interpretations, for example when someone says (a
application, the questions are answered in an integrated
colleague of mine has died), this colleague is not specified
natural language. In other words, the answer is formulated in
as to whether he is a classmate or a coworker.
an integrated manner: if the question is (Where is the
Children's Hospital?), the answer will be (The Children's B. Sub-domains of natural language processing
Hospital is on the left side, near the gas station). There are several branches and applications within the field
of natural language processing, which can be arranged into
levels:

776

Authorized licensed use limited to: MULTIMEDIA UNIVERSITY. Downloaded on April 26,2024 at 21:14:39 UTC from IEEE Xplore. Restrictions apply.
The 2nd International Conference on Advances in Engineering Sciences and Technology 2022 (AEST-2022)
University of Babylon, Iraq.

1. Lexical level done by using C♯| language, as shown in the following


2. Syntax level figures [ Fig. 2, Fig. 3, Fig. 4 and Fig. 5]. Using NPL, the
3. Semantic level of meaning number of verbs prepositions, and nouns, were obtained for
each article, as shown in the following figures:
IV. PART-OF-SPEECH TAGGING
This process can be classified under the morphological
level field and is intended to extract parts of speech from the
text (where speech is divided, for example, into past tense,
present tense verb, imperative verb, object, noun, adjective).
Thus, each word in the text is included into the section that it
represents. This type of field is one of the basic things that
are used in most NLP applications. Among the most
important algorithms and techniques used in this field are the
Hidden Markov Model HMM, and the Viterbi Algorithm.
Syntactic Parsing falls under the field of the grammatical
level, which is intended to analyze the sentence
automatically. Among the most famous algorithms in this
field are CKY Parsing, Early Parsing, Collins Parser, in
addition to the Arc-Eager Dependency Parsing algorithm. It Fig. 1: Interface analysis.
should be noted that there is a difference in the operation of
each algorithm as compare to one another.
As for the semantic level, it comprises the representation
of the meaning of a sentence in a way that is easy to
understand by the computer. One of the most popular
methods is the use of first-order logic [14].
V. THE ADVANTAGES AND DISADVANTAGES OF NLP
NLP is the ability of artificial intelligence to
understand language from the general context. With the help
of NLP, data and information can be extracted from text
documents, taking advantage of artificial intelligence
software that simplifies the work and uses NLP. Every
internet user encounters some form of the NLP applications.
For example, search engines such as Google or Bing use
natural language processing to indicate potential suggestions
for search queries. Therefore, search engines try to complete
the request, thus increasing the ability of users to choose
from suggestions or continue to enter their inquiries [15].
A. Advantages of NLP Fig. 2: Interface before text analyses.
a. extensive analysis
b. most objective and accurate analysis
c. less cost with simplified operations
d. obtains customer satisfaction
e. discovers and understands the private and public
market
f. Increases employees' ability to work
B. Disadvantages of NLP
a. Loss of context
b. Unpredictable with the possibility of errors.
c. limited function and one specific task.
d. The development of new models needs a long time
to obtain a high level of performance.
VI. IMPLEMENTATION AND DISCUSSION
In this section, the three aforementioned levels are
applied to compare between two texts in a single interface
using Visual Studio 2015 software. The programming was Fig. 3: Interface after text1 analyses.

777

Authorized licensed use limited to: MULTIMEDIA UNIVERSITY. Downloaded on April 26,2024 at 21:14:39 UTC from IEEE Xplore. Restrictions apply.
The 2nd International Conference on Advances in Engineering Sciences and Technology 2022 (AEST-2022)
University of Babylon, Iraq.

[2] Cowan, J. W. , “ The complete Lojban language”, Vol. 15. Logical


Language Group, 1997.
[3] Wilks, Y. , "R. Schank, Conceptual Information Processing." Studies in
Language. International Journal sponsored by the Foundation “Foundations
of Language” 2, no. 3: 397-412, 1978.
[4] Meehan, J. R. "TALE-SPIN, An Interactive Program that Writes
Stories." In Ijcai, vol. 77, pp. 91-98. 1977.
[5] Lehnert, W. , "Human and computational question
answering." Cognitive Science 1, no. 1 47-73, 1977.
[6] Wilensky, A. J., Levy, R.H., Troupin, A.S., Moretti-Ojemann, L. and
Friel, P., “Clorazepate kinetics in treated epileptics”. Clinical
Pharmacology & Therapeutics, 24(1), pp.22-30, 1978.
[7] Schank, R. C. "Computer understanding of natural language." Behavior
Research Methods & Instrumentation ,1978.
[8] Carbonell, J. G. , “Subjective understanding: computer models of belief
systems”, Yale University, 1979.
[9] Lehnert, W.G. ,“Plot units and narrative summarization. Cognitive
science”, 5(4), pp.29 3-331, 1981.
[10] Otter, D.W., Medina, J.R. and Kalita, J.K., “A survey of the usages of
deep learning for natural language processing.” IEEE transactions on neural
networks and learning systems, 32(2), pp.604-624, 2020.
[11] Rajput, A., "Natural language processing, sentiment analysis, and
clinical analytics." In Innovation in Health Informatics, pp. 79-97.
Academic Press, 2020.
[12] Ruder, S., “Neural transfer learning for natural language processing.”
Diss. NUI Galway, 2019.
[13] Tapaswi, N., and Suresh J., "Treebank based deep grammar
acquisition and Part-Of-Speech Tagging for Sanskrit sentences." 2012 CSI
Sixth International Conference on Software Engineering (CONSEG). IEEE,
Fig. 4: Interface after text2 analyses. 2012.
[14] Ranjan, P., and H. V. S. S. A. Basu, "Part of speech tagging and
local word grouping techniques for natural language parsing in
VII. CONCLUSIONS Hindi." Proceedings of the 1st International Conference on Natural
Language Processing (ICON 2003). 2003.
In this article, one of the applications of natural [15] Diab, M., Kadri, H., and Dan, J., "Automatic tagging of Arabic text:
language processing has been implemented by taking From raw text to base phrase chunks." Proceedings of HLT-NAACL 2004:
advantage of the characteristics of the fields of computer Short papers. 2004.
science and artificial intelligence. An explanation is
provided for the the levels in which texts are analyzed, as
well as on how to benefit from NLP in the fields of industry,
health, banking and others.
REFERENCES
[1] Qiu, X., Tianxiang S. , Yige X., Yunfan S., Ning D. , and Xuanjing H.
"Pre-trained models for natural language processing: A survey." Science
China Technological Sciences 63, no. 10 1872-1897, 2020.

778

Authorized licensed use limited to: MULTIMEDIA UNIVERSITY. Downloaded on April 26,2024 at 21:14:39 UTC from IEEE Xplore. Restrictions apply.

You might also like