0% found this document useful (0 votes)
4 views

Natural Language Processing_NOTES

Natural Language Processing (NLP) is a branch of AI that enables computers to understand human languages, primarily working with text data. Key applications include automatic summarization, sentiment analysis, text classification, and virtual assistants, while challenges involve syntax, semantics, and language ambiguity. Data processing techniques such as text normalization, stemming, lemmatization, Bag of Words, and TFIDF are essential for preparing raw text for analysis and machine learning.

Uploaded by

Neha Makhija
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Natural Language Processing_NOTES

Natural Language Processing (NLP) is a branch of AI that enables computers to understand human languages, primarily working with text data. Key applications include automatic summarization, sentiment analysis, text classification, and virtual assistants, while challenges involve syntax, semantics, and language ambiguity. Data processing techniques such as text normalization, stemming, lemmatization, Bag of Words, and TFIDF are essential for preparing raw text for analysis and machine learning.

Uploaded by

Neha Makhija
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Natural Language Processing

Overview of Natural Language Processing (NLP)

 Domain of AI: NLP is a branch of AI focused on enabling computers to process and


understand human (natural) languages.
 Data Type: NLP works with text data, unlike Data Science (numeric/tabular data) and
Computer Vision (visual data like images/videos).

Key Applications of NLP

1. Automatic Summarization: Helps to condense information from large datasets,


reducing redundancy and enhancing content diversity, especially for news or social
media.
2. Sentiment Analysis: Detects opinions and emotions in text, useful for businesses to
gauge customer sentiment and make data-driven decisions.
3. Text Classification: Assigns categories to documents for efficient data organization, e.g.,
email spam filtering.
4. Virtual Assistants: NLP enables assistants (e.g., Siri, Alexa) to understand and perform
tasks based on voice commands using speech recognition.

Developing an NLP Project

 Problem Scoping: Define the problem (e.g., helping those with mental health issues via a
chatbot).
 Data Acquisition: Collect conversational data via surveys, databases, therapist sessions,
and interviews.
 Data Exploration and Cleaning: Process raw text data into a simpler, more machine-
readable format by normalizing and reducing vocabulary.
 Modeling: Feed preprocessed data into an NLP model (e.g., chatbot) to interpret and
respond to user queries.
 Evaluation: Test model accuracy by comparing chatbot responses with expected answers
to identify underfitting, perfect fit, or overfitting.

Types of Chatbots

 Script-bot: Operates on a predefined script, limited in flexibility, usually has minimal


language processing.
 Smart-bot: AI-powered, flexible, and capable of learning from data, e.g., Google
Assistant and Alexa.

Challenges in NLP

1. Syntax and Semantics: Computers must interpret grammatical structure (syntax) and
contextual meaning (semantics).
o Syntax issues: Different parts of speech (nouns, verbs) complicate sentence
structure.
o Meaning: Words can have multiple meanings depending on context.
2. Ambiguity in Language: Phrases with correct syntax may lack clear meaning or convey
multiple interpretations.
o Example: "Chickens feed extravagantly while the moon drinks tea."
Grammatical, yet nonsensical.

Human vs. Computer Language Processing

 Human Processing: The brain continuously interprets and prioritizes spoken language.
 Computer Processing: Computers need text converted to numeric form; even minor
errors cause failures in processing.

1. Data Processing in NLP: Why It’s Important

 Natural language, as used by humans, is full of variations and ambiguities, making it challenging
for computers, which interpret everything numerically.
 Data processing in NLP involves cleaning, organizing, and transforming raw text data into a
structured format that algorithms can work with. It’s an essential first step for applications like
sentiment analysis, chatbots, machine translation, and more.

2. Text Normalisation

 Text normalization simplifies text to make it consistent across documents, reducing complexity
and making processing easier. This often includes cleaning, formatting, and unifying case across
text.

Key Steps in Text Normalisation:

 Sentence Segmentation: Dividing a body of text (or corpus) into individual sentences.
o Useful for models that process one sentence at a time.
o Achieved by splitting text based on punctuation marks or predefined sentence
boundaries.

 Tokenisation: Breaking down each sentence into smaller units called tokens.
o Tokens can be words, sub-words, characters, or symbols.
o Tokenisation enables more granular analysis of text, like understanding individual words
in a sentence.
o Different tokenization approaches:
 Word tokenisation: Splits text by spaces and punctuation.
 Sub-word tokenisation: Splits at the level of sub-words, useful for rare words or
morphological analysis.
 Character tokenisation: Splits text into individual characters, often used for
languages with complex morphology.

 Stopwords Removal: Commonly used words that don’t contribute significant meaning
(e.g., "the," "is," "and") are removed to reduce noise.
o Stopwords vary by language and context and can be defined manually or with pre-built
libraries.
o Helps focus on meaningful words that carry sentiment or key information.

 Removing Special Characters and Numbers: Strips away characters like punctuation,
symbols, and numbers if they’re irrelevant to analysis.
o This step is context-dependent. For example, punctuation might be kept in sentiment
analysis as it can convey emotion (like “!” for emphasis).

 Case Conversion: Converts all text to a common case, typically lowercase.


o Prevents case differences from misleading the model (e.g., "Apple" vs. "apple" would be
considered the same after this step).

3. Stemming

 Stemming is the process of reducing words to their root form by stripping affixes (e.g., "running"
to "run").
 The result is often not a real word but a base form, helping reduce vocabulary size.
 Example: "playing," "played," and "plays" all become "play."
 Stemming helps speed up the process by simplifying words but may lead to inaccuracies due to
the generation of non-real words (like “runn” instead of “run”).

4. Lemmatization

 Lemmatization is similar to stemming but focuses on obtaining actual root words (lemmas) that
are valid in the language.
 Lemmatization considers the context and part of speech to yield meaningful base forms.
 Example: "better" would become "good," rather than just "better" reduced to "bet."
 Though slower than stemming, lemmatization is more accurate and useful for applications
requiring linguistic accuracy, such as language translation and document classification.

5. Bag of Words (BoW)

 Bag of Words is a simple method for extracting features from text for machine learning models.
 Represents text data in terms of a collection of words and their frequencies, disregarding
grammar and word order.

Steps in Creating Bag of Words:

 Vocabulary Extraction: Building a unique list of words in the corpus.


 Vectorization: Creating a document-word matrix where each row represents a document, each
column represents a word, and each cell records the count of that word in the document.
 Example:
o Text corpus: ["I love NLP", "I love AI", "AI is fun"]
o Vocabulary: ["I", "love", "NLP", "AI", "is", "fun"]
o Document vectors:
 ["I love NLP"] → [1, 1, 1, 0, 0, 0]
 ["I love AI"] → [1, 1, 0, 1, 0, 0]
 ["AI is fun"] → [0, 0, 0, 1, 1, 1]
 Bag of Words simplifies data but ignores word order, which can sometimes lead to loss of
contextual information.

6. TFIDF (Term Frequency & Inverse Document Frequency)

 TFIDF enhances the Bag of Words model by assigning importance to words based on their
frequency in a document compared to the entire corpus.
 Words that are common in all documents receive lower scores, while words that are unique to a
document are given higher importance.

Components:

 Term Frequency (TF): Measures how often a word appears in a specific document.
o Formula: TF = (Number of occurrences of a word in the document) / (Total number of
words in the document).
 Inverse Document Frequency (IDF): Decreases the weight of words that appear frequently
across many documents.
o Formula: IDF = log(Total number of documents / Number of documents containing the
word).
 TFIDF Calculation: Multiplies TF and IDF values to balance between term frequency and global
rarity.
o TFIDF(w) = TF(w) * IDF(w).
 Example:
o Word: “AI” appears often in one document, and “NLP” appears rarely in the same
document.
o TFIDF scores adjust based on these occurrences, giving less importance to frequently
occurring terms across all documents and highlighting unique or rare words within
specific documents.

Applications of TFIDF:

 Document Classification: Assigns documents to categories based on keyword relevance.


 Topic Modeling: Helps identify prominent themes in large corpora.
 Information Retrieval: Improves the accuracy of search engines and recommendation systems.
 Stop Word Filtering: Automatically reduces the significance of frequent, non-specific words.

In essence, data processing in NLP prepares raw text for advanced analysis and machine
learning, converting language into structured formats that emphasize meaningful content and
reduce redundancies. This pre-processed data can then be used for various applications,
including sentiment analysis, text classification, chatbots, and more.

You might also like