0% found this document useful (0 votes)
2 views3 pages

NLP - Notes

Natural Language Processing (NLP) enables computers to understand and process human languages, with applications including automatic summarization, sentiment analysis, text classification, and virtual assistants. Key concepts in NLP include text normalization, tokenization, and techniques like stemming and lemmatization to simplify and analyze text data. The document also discusses the bag of words model and metrics such as term frequency and inverse document frequency for analyzing text data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views3 pages

NLP - Notes

Natural Language Processing (NLP) enables computers to understand and process human languages, with applications including automatic summarization, sentiment analysis, text classification, and virtual assistants. Key concepts in NLP include text normalization, tokenization, and techniques like stemming and lemmatization to simplify and analyze text data. The document also discusses the bag of words model and metrics such as term frequency and inverse document frequency for analyzing text data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 3

NATURAL LANGUAGE PROCESSING - Class 10 Notes

NLP (Natural Language Processing) is dedicated for making it possible for computers
to comprehend and process human languages. Artificial intelligence is a subfield of
linguistic, computer science, information engineering and artificial intelligence
that studies how computers interact with human languages, particularly how to train
computers to handle and analyze massive volume of natural language data.

* Applications of natural language processing*


Most people utilize NLP apps on a regular basis in their daily lives following our
few examples of a real world uses of natural language processing:

Automatic summarization -- is useful for gathering data from social media and other
online sources, as well as for summarizing the meaning of documents and other
written material.

Sentimental analysis -- to better comprehend what Internet users are saying about a
company's goods and services, businesses use natural language processing tools like
sentimental analysis to understand the customer's requirement.

Text classification -- enables you to classify a document and organize it to make


it easier to find the information you need or to carry certain tasks. Spam
screening in email is one of examples of how text categorization is used.

Virtual assistants -- these days digital assistants like Google Assistant, Cortana,
Siri and Alexa play a significant role in our lives not only can we communicate
with them, but they can also facilitate our life.

*Chatbots*
A chat board is one of the most widely used NLP applications. Many chat boards on
the market now employ the same strategy as we did in the instance above.

There are two types of chat bots


1. Script-bot
2.Smart-bot

Script bots are easy to make.


Script bots work around a script which is programmed in them.
Mostly they are free and are easy to integrate to a messaging platform.
No or little language processing skills.
Limited functionality.

Smart bots are flexible and powerful.


Smartboards work on bigger database and other resources directly.
Smart bots learn with more data.
Coding is required to take this up on board.
Wide functionality.

* Human language v/s Computer language*


Humans need language to communicate which we constantly process. Our brain
continuously processes the sound adheres around us and works to make sense of them.
Our brain continuously processes and stores everything even as the teacher is
delivering the lesson in the classroom.

The computer language is understood by the computer on the other hand. All input
must be transformed to numbers before being sent into the machine. And if a single
error is made while typing the machine throws an error and skips over that area.
Machines only use extremely simple and elementary forms of communication.
*Data Processing*
Data processing is a method of manipulation of data. It means that conversation of
raw data into meaningful and machine readable content. It basically is a process of
converting raw data into meaningful information.
Since human language are complex we need to first of all simplify them in order to
make sure that the understanding becomes possible. Text normalization helps in
cleaning up the textual data in such a way that it comes down to a level where its
complexity is lower than the actual data. Let us go through text normalization in
detail.

*Text normalization*
The process of converting a text into a conical bracket standard form is known as
text normalization. For instance the conical form of word "good" can be created
from the word "goood" and "gud".

* Sentence segmentation*
under sentence segmentation the whole corpus is divided into sentences. Each
sentence is taken as a different data so now the whole corpus gets reduced to
sentences.

*Tokenisation*
Sentences are first broken into segments and then each segment is further divided
into tokens. Any word, number, or special character that appears in a sentence is
referred to as a token. Tokenisation treats each word, integer, and special
character as a separate entity and creates a token for each of them.

* Removing stop words, special characters and numbers*


In this step the tokens which are not necessary are removed from the token list.
What can be the possible words which we might not require?
Stop words are words that are used frequently in a corpus but provide nothing
useful. Humans utilize grammar to make their sentence clear and understandable for
the other person. However, grammatical terms fall under the category of stop words
because they do not add any significance to the information that is communicated
through the statement. Stop words include a, and, and, or, for, it, is, etc.

*Converting text to a common case*


After eliminating the stop words, we change the text's case throughout, probably to
lower case. This makes sure that the machine's case sensitivity does not treat
similar terms differently solely because of varied case usage.

*Stemming*
The remaining words are boiled down to their root words in this step. In other
words, Stemming is the process of stripping words of their effects and returning
them to their original forms.

*Lemmatisation*
Steaming and lemmatisation are alternate techniques to one another because they
function to remove Affixes. However, limitation differs from both of them in that
the words that result from the elimination of affix (also known as the lemma) is
meaningful.

*Bag of Words*
A bag of words is a textual illustration that shows where words appear in a
document. There are two components: A collection of a well known words, a metric
for the amount of well known words.
A natural language processing model called bag of words aids in the extraction of
textual information that can be used by machine learning techniques. We gathered
the instances of each term from the bag of words and create the corpus's
vocabulary.
Step-by-step approach to implement bag of words algorithm:
1. Text Normalisation: Collect data and pre-process it
2. Create Dictionary: Make a list of all unique words occurring in the corpus.
(Vocabulary)
3. Create document vectors: For each document in the corpus, find out how many
times the word from the unique list of words has occurred.
4. Create document vectors for all the documents

*Term Frequency*
The measurement of a term's frequency inside a document is called term frequency.
The simplest calculation is to count the instances of each word. However, there are
always to change that value based on the length of the document or the frequency of
the terms that appears the most often.

*Inverse Document Frequency*


A term's frequency inside a corpus of documents is determined by its inverse
document frequency. It is calculated by dividing the total number of documents in
the corpus by the number of documents that contain the phrase.

You might also like