NLP - Notes
NLP - Notes
NLP (Natural Language Processing) is dedicated for making it possible for computers
to comprehend and process human languages. Artificial intelligence is a subfield of
linguistic, computer science, information engineering and artificial intelligence
that studies how computers interact with human languages, particularly how to train
computers to handle and analyze massive volume of natural language data.
Automatic summarization -- is useful for gathering data from social media and other
online sources, as well as for summarizing the meaning of documents and other
written material.
Sentimental analysis -- to better comprehend what Internet users are saying about a
company's goods and services, businesses use natural language processing tools like
sentimental analysis to understand the customer's requirement.
Virtual assistants -- these days digital assistants like Google Assistant, Cortana,
Siri and Alexa play a significant role in our lives not only can we communicate
with them, but they can also facilitate our life.
*Chatbots*
A chat board is one of the most widely used NLP applications. Many chat boards on
the market now employ the same strategy as we did in the instance above.
The computer language is understood by the computer on the other hand. All input
must be transformed to numbers before being sent into the machine. And if a single
error is made while typing the machine throws an error and skips over that area.
Machines only use extremely simple and elementary forms of communication.
*Data Processing*
Data processing is a method of manipulation of data. It means that conversation of
raw data into meaningful and machine readable content. It basically is a process of
converting raw data into meaningful information.
Since human language are complex we need to first of all simplify them in order to
make sure that the understanding becomes possible. Text normalization helps in
cleaning up the textual data in such a way that it comes down to a level where its
complexity is lower than the actual data. Let us go through text normalization in
detail.
*Text normalization*
The process of converting a text into a conical bracket standard form is known as
text normalization. For instance the conical form of word "good" can be created
from the word "goood" and "gud".
* Sentence segmentation*
under sentence segmentation the whole corpus is divided into sentences. Each
sentence is taken as a different data so now the whole corpus gets reduced to
sentences.
*Tokenisation*
Sentences are first broken into segments and then each segment is further divided
into tokens. Any word, number, or special character that appears in a sentence is
referred to as a token. Tokenisation treats each word, integer, and special
character as a separate entity and creates a token for each of them.
*Stemming*
The remaining words are boiled down to their root words in this step. In other
words, Stemming is the process of stripping words of their effects and returning
them to their original forms.
*Lemmatisation*
Steaming and lemmatisation are alternate techniques to one another because they
function to remove Affixes. However, limitation differs from both of them in that
the words that result from the elimination of affix (also known as the lemma) is
meaningful.
*Bag of Words*
A bag of words is a textual illustration that shows where words appear in a
document. There are two components: A collection of a well known words, a metric
for the amount of well known words.
A natural language processing model called bag of words aids in the extraction of
textual information that can be used by machine learning techniques. We gathered
the instances of each term from the bag of words and create the corpus's
vocabulary.
Step-by-step approach to implement bag of words algorithm:
1. Text Normalisation: Collect data and pre-process it
2. Create Dictionary: Make a list of all unique words occurring in the corpus.
(Vocabulary)
3. Create document vectors: For each document in the corpus, find out how many
times the word from the unique list of words has occurred.
4. Create document vectors for all the documents
*Term Frequency*
The measurement of a term's frequency inside a document is called term frequency.
The simplest calculation is to count the instances of each word. However, there are
always to change that value based on the length of the document or the frequency of
the terms that appears the most often.