Text Analysis Based On Natural Language Processing NLP
Text Analysis Based On Natural Language Processing NLP
1st Azhar Kassem Flayeh 2nd Yaser Issam Hamodi 3rd Nashwan Dheyaa Zaki
department of computer engineering & Media Technology Engineering
Computer Engineering - College of
information Technology 3University of information Technology
Ministry of Higher Education& Scientific Electrical And Electronic Techniques
Ministry of Higher Education& and Communications/College of
Research, Baghdad, Iraq
Scientific Research , Baghdad, Iraq Engineering, Baghdad,Iraq
[email protected]
[email protected] [email protected]
Abstract— In the modern era of information explosion, contrast between the field of natural language processing and
where there is enormous information including thousands and the field of computational linguistics. This is what the
millions of texts loaded on the Internet of over 1.2 million Society for Computational Linguistics has defined,
terabytes, there are no accurate statistics on the size of texts describing it as the science that focuses on the theoretical
within this huge amount of data. Given the multiplicity of
websites on the Internet and the spread of social networking
aspects of natural language processing. Modern algorithms
services like Twitter, for example, it is stated that about 6000 for natural language processing are based on machine
tweets are posted per second in the year 2019. This work aims learning which is requires an understanding of several
to analyze these texts through resorting to artificial intelligence disparate areas, including linguistics, computer science, and
to help solve this problem. In this article, a review is presented statistics. There are many challenges in the field of natural
on the latest concepts in the field of word processing, language processing, including speech recognition,
information extraction, and implementing an interface for text understanding natural language, and how to generate natural
analysis using the C++ programming language and Visual language. It can be said that the period in which the history
Studio 2015 edition. A comparison is made between two of natural language processing began in the 1950s (even
articles in terms of the number of verbs, prepositions, article
strength, number of letters, with the possibility of sorting the
though the work originally began during 1950). The initial
text into verbs, nouns, object, and adjectives, as will be beginning is now called the Turing test, published by the
displayed in the program interface. scientist Alan Turing in his article titled “Intelligence”. In
1954, the Georgetown experiment was implemented, which
Keywords— Natural Language Processing NLP, Ambiguity, is a fully automatic translation of more than sixty sentences
Information retrieval (IR), Sentiment Analysis. from Russian into English. The authors noted that the
machine translation problem will be solved within three or
I. INTRODUCTION five years. However, real progress remained much slower, so
The term natural language is given to the languages that the funding for the machine translation project was reduced
humans use for human-human interactions, such as Arabic, after the publication of the ALPAC Report in 1966, which
English, French and other languages. These languages are showed that research, after ten years, failed to meet
called natural languages because they are the result of the expectations. During the 1970s, many programmers wrote
natural development of society. For example, English about the "conceptual theory," which involves organizing
language has been developed over years without having real-world information into computer-understood data. These
planned rules or words for it, and the same case is true for works include MARGIE [3], TaleSpin [4], UALM [5], PAM
other languages [1]. As for constructed languages, planned, [6], SAM [6-8], and Plot Units [9]. During this era, many
artificial or invented, their terms and linguistic characteristics chatterbots were released, including PARRY, Racter, and
are placed before they are used by people, as is the case with Jabberwacky. Despite this reduction in funding, however,
programming languages including Java and Python. In research into machine translation continued until the late
addition to the aforementioned aspects, there are built 1980s.
languages directed for the purposes of human In the field of machine translation, there were many
communication, and among these languages is the "Lojban" notable early successes, thanks to IBM Research who
language [2]. It is considered one of the integrated invented worked on successively developing more complex statistical
languages that aim to improve human communication models. These systems were able to take advantage of the
because it is free from linguistic ambiguity. Natural existing multilingual textual manuscripts produced by the
Language Processing (NLP) is a computer science and Parliament of Canada and the European Union as laws called
linguistics field that deals with the interactions between for the translation of all government actions into all the
computers and natural languages. It is originated as a branch official languages of the corresponding systems of
of artificial intelligence, which is in turn branched out from government. Most of the other systems depended on
informatics. It can be said that there is convergence and computers that have been developed specifically for the tasks
that these systems perform, which is a major constraint in the limited set of digital documents. After the great development
success of these systems. Therefore, a large number of and emergence of the World Wide Web an explosion of
researches have turned to ways to learn more effectively information occurred in many different languages. A major
from limited amounts of data. breakthrough has been made in the field of information
Recently, researchers have increasingly focused on retrieval (IR), which is an application of natural language
unsupervised and semi-supervised learning algorithms that processing. There are several applications that use NLP
could learn from data that have not been annotated manually techniques, and perhaps the most important one is
with the required answers, or by using a combination of Information Extraction, Question Answering and Machine
annotated and unexplained data. One of the successful Translation, which will be addressed in this work. There are
systems developed in the 1960s was the SHRDLU system, a many uses for NLP in several areas including the industry's
natural language system that operates in “finite worlds” leading data, technology, healthcare, and therapeutic
restricted by restricted vocabulary. This program remained expertise.
famous in the history of artificial intelligence, and it was As for the health field during the 1960’s, the inventor
originally developed by Terry Winograd at MIT [3]. This and engineer Vladimir Zworykin emphasized that the
program simulates the work of a robot so that it could challenge that doctors are facing is not only the lack of data
determine several things of various shapes such as cubes and and information about the state of health, but also the
pyramids placed on a table. In addition, it can speak in a insufficiency of traditional techniques of summarization and
natural language with someone and answer his/her questions indexing to provide the doctor with the required knowledge
about these things. Winograd added a grammar derived from to reach his or her right. This prompted the National Library
the systemic grammar of Hallida. This is a system of logical of Medicine (NLM) to pay attention to the topic and start the
networks that express certain properties of the structural Medical Data Analysis and Retrieval System (MEDLARS)
units, especially of the states of verbs, their tenses, and their project in 1964. The project involved the creation of a
construction (for the known and unknown). Some semantic database of the most common and popular medical books
meanings are used during linguistic analysis, which leads to and literature in the world (PubMed), which includes more
a reduction in the number of possible constructs of a than 30 million medical articles and over 800,000 new
sentence, as compared to applying the synthetic criteria citations. This has helped many clinicians and researchers to
alone. ELIZA, a simulation of the Rogerian Therapy track the rapidly growing medical literature. However, the
Director, was written by Joseph Weisenbaum between 1964 sheer volume of medical articles and their dramatic growth in
and 1966. It does not involve information about human recent decades still challenges the limits of human perception
thought or emotion. Sometimes ELIZA offers a human-like despite advances in information technology [5].
reaction. Another program is LUNAR, which is based on the As for the uses of (NLP) in the fields of industry, the
use of a method known as Augmented Transition Network work of companies and bank accounts, for example, the bank
(ATN) to exchange information with a database in natural conducts customer opinion polls. The idea of this survey is
language. This database contains information on samples of the ability to examine customer feedback and inform the
the nature of moon rocks collected by NASA astronauts. company to take necessary actions, in order to improve
The questions that are directed to the program will be customer experience. This is done by using text
analyzed in two stages: compositional analysis (constructing decomposition programs and leveraging machine learning
a manually tree or several trees if possible), and semantic and natural language processing algorithms to extract
analysis of a tree or shrub (builds an internal representation massive amounts of information to analyze the text in more
of an inquiry). In response to the request in the internal efficient way. Organizations can use text analytics software,
representation of the question, the information required in the leveraging machine learning, and natural language
question is searched and the answers are prepared in natural processing algorithms, to find meaning in massive amounts
language. The program includes approximately 3,500 words of text by analyzing free text comments provided in
in addition to grammar. The Lunar program has had a very customer feedback forms. Next, text analytics are used to
great impact on the development of natural language help companies see hidden customer insights and easily
programs due to its technology used for the ATN, which has answer questions. Additionally, with the help of text
become one of the most popular methods of processing analytics software, companies can find new and emerging
natural languages over the past ten years. The main part of topics, track trends and issues, and provide visual reports to
the program can either paraphrase sentences or draw managers to keep track of what customers think [6]. There
conclusions about the intellectual world. MARGIE is a are many tools used in research in the field of natural
program that converts phrases from natural language into a language processing, including:
form of intellectual interconnectedness. It is also possible to NLTK: It is a large group of tools related to the field of
formulate the sentences entered or draw conclusions about natural languages for several natural languages, and it is
the intellectual world that it constitutes [4]. available on Python [7] http://www.nltk.org/ .
The most important characteristic of constructed Scikit-learn: It may not have a direct relationship with
languages is that they are often devoid of ambiguity. Natural natural language processing, but it is a vast library of tools in
languages cannot be free from this linguistic ambiguity due the field of machine learning available on Python
to their complexity and natural development. http://scikit-learn.org
A few decades ago, the NLP field matured, which was HTK: This is a set of tools specialized in processing of
limited in its early beginnings to collecting data from a phonemes, and it is on http://htk.eng.cam.ac.uk
775
Authorized licensed use limited to: MULTIMEDIA UNIVERSITY. Downloaded on April 26,2024 at 21:14:39 UTC from IEEE Xplore. Restrictions apply.
The 2nd International Conference on Advances in Engineering Sciences and Technology 2022 (AEST-2022)
University of Babylon, Iraq.
776
Authorized licensed use limited to: MULTIMEDIA UNIVERSITY. Downloaded on April 26,2024 at 21:14:39 UTC from IEEE Xplore. Restrictions apply.
The 2nd International Conference on Advances in Engineering Sciences and Technology 2022 (AEST-2022)
University of Babylon, Iraq.
777
Authorized licensed use limited to: MULTIMEDIA UNIVERSITY. Downloaded on April 26,2024 at 21:14:39 UTC from IEEE Xplore. Restrictions apply.
The 2nd International Conference on Advances in Engineering Sciences and Technology 2022 (AEST-2022)
University of Babylon, Iraq.
778
Authorized licensed use limited to: MULTIMEDIA UNIVERSITY. Downloaded on April 26,2024 at 21:14:39 UTC from IEEE Xplore. Restrictions apply.