Natural Language Processing_NOTES
Natural Language Processing_NOTES
Problem Scoping: Define the problem (e.g., helping those with mental health issues via a
chatbot).
Data Acquisition: Collect conversational data via surveys, databases, therapist sessions,
and interviews.
Data Exploration and Cleaning: Process raw text data into a simpler, more machine-
readable format by normalizing and reducing vocabulary.
Modeling: Feed preprocessed data into an NLP model (e.g., chatbot) to interpret and
respond to user queries.
Evaluation: Test model accuracy by comparing chatbot responses with expected answers
to identify underfitting, perfect fit, or overfitting.
Types of Chatbots
Challenges in NLP
1. Syntax and Semantics: Computers must interpret grammatical structure (syntax) and
contextual meaning (semantics).
o Syntax issues: Different parts of speech (nouns, verbs) complicate sentence
structure.
o Meaning: Words can have multiple meanings depending on context.
2. Ambiguity in Language: Phrases with correct syntax may lack clear meaning or convey
multiple interpretations.
o Example: "Chickens feed extravagantly while the moon drinks tea."
Grammatical, yet nonsensical.
Human Processing: The brain continuously interprets and prioritizes spoken language.
Computer Processing: Computers need text converted to numeric form; even minor
errors cause failures in processing.
Natural language, as used by humans, is full of variations and ambiguities, making it challenging
for computers, which interpret everything numerically.
Data processing in NLP involves cleaning, organizing, and transforming raw text data into a
structured format that algorithms can work with. It’s an essential first step for applications like
sentiment analysis, chatbots, machine translation, and more.
2. Text Normalisation
Text normalization simplifies text to make it consistent across documents, reducing complexity
and making processing easier. This often includes cleaning, formatting, and unifying case across
text.
Sentence Segmentation: Dividing a body of text (or corpus) into individual sentences.
o Useful for models that process one sentence at a time.
o Achieved by splitting text based on punctuation marks or predefined sentence
boundaries.
Tokenisation: Breaking down each sentence into smaller units called tokens.
o Tokens can be words, sub-words, characters, or symbols.
o Tokenisation enables more granular analysis of text, like understanding individual words
in a sentence.
o Different tokenization approaches:
Word tokenisation: Splits text by spaces and punctuation.
Sub-word tokenisation: Splits at the level of sub-words, useful for rare words or
morphological analysis.
Character tokenisation: Splits text into individual characters, often used for
languages with complex morphology.
Stopwords Removal: Commonly used words that don’t contribute significant meaning
(e.g., "the," "is," "and") are removed to reduce noise.
o Stopwords vary by language and context and can be defined manually or with pre-built
libraries.
o Helps focus on meaningful words that carry sentiment or key information.
Removing Special Characters and Numbers: Strips away characters like punctuation,
symbols, and numbers if they’re irrelevant to analysis.
o This step is context-dependent. For example, punctuation might be kept in sentiment
analysis as it can convey emotion (like “!” for emphasis).
3. Stemming
Stemming is the process of reducing words to their root form by stripping affixes (e.g., "running"
to "run").
The result is often not a real word but a base form, helping reduce vocabulary size.
Example: "playing," "played," and "plays" all become "play."
Stemming helps speed up the process by simplifying words but may lead to inaccuracies due to
the generation of non-real words (like “runn” instead of “run”).
4. Lemmatization
Lemmatization is similar to stemming but focuses on obtaining actual root words (lemmas) that
are valid in the language.
Lemmatization considers the context and part of speech to yield meaningful base forms.
Example: "better" would become "good," rather than just "better" reduced to "bet."
Though slower than stemming, lemmatization is more accurate and useful for applications
requiring linguistic accuracy, such as language translation and document classification.
Bag of Words is a simple method for extracting features from text for machine learning models.
Represents text data in terms of a collection of words and their frequencies, disregarding
grammar and word order.
TFIDF enhances the Bag of Words model by assigning importance to words based on their
frequency in a document compared to the entire corpus.
Words that are common in all documents receive lower scores, while words that are unique to a
document are given higher importance.
Components:
Term Frequency (TF): Measures how often a word appears in a specific document.
o Formula: TF = (Number of occurrences of a word in the document) / (Total number of
words in the document).
Inverse Document Frequency (IDF): Decreases the weight of words that appear frequently
across many documents.
o Formula: IDF = log(Total number of documents / Number of documents containing the
word).
TFIDF Calculation: Multiplies TF and IDF values to balance between term frequency and global
rarity.
o TFIDF(w) = TF(w) * IDF(w).
Example:
o Word: “AI” appears often in one document, and “NLP” appears rarely in the same
document.
o TFIDF scores adjust based on these occurrences, giving less importance to frequently
occurring terms across all documents and highlighting unique or rare words within
specific documents.
Applications of TFIDF:
In essence, data processing in NLP prepares raw text for advanced analysis and machine
learning, converting language into structured formats that emphasize meaningful content and
reduce redundancies. This pre-processed data can then be used for various applications,
including sentiment analysis, text classification, chatbots, and more.