Week 8-Module 7 NLP
Week 8-Module 7 NLP
Natural Language
Processing
Natural Language Processing (NLP)
NLP has a vast range of applications that are woven into our daily lives:
Machine Translation: Breaking down language barriers by translating text or speech from one language
to another [e.g., Google Translate].
Smart Assistants: Responding to voice commands and questions in a natural way [e.g., Siri, Alexa,
Google Assistant].
• Sentiment Analysis: Extracting opinions and emotions from text data [e.g., social
media monitoring].
• Text Summarization: Condensing large amounts of text into key points.
• Autocorrect and Predictive Text: Suggesting corrections and completions as you
type.
• Spam Filtering: Identifying and blocking unwanted emails.
• Search Engines: Ranking search results based on relevance to your query.
Challenges in Processing Human Language
Here's a glimpse into some fundamental NLP tasks that form the building blocks
for many applications:
• Tokenization: Breaking down text into smaller units like words, punctuation
marks, or phrases.
• Part-of-Speech (POS) tagging: Identifying the grammatical function of each
word in a sentence (e.g., noun, verb, adjective).
• Named Entity Recognition (NER): Recognizing and classifying named
entities in text, such as people, organizations, locations, dates, monetary
values, etc.
1. Tokenization:
Imagine you're dissecting a sentence. Tokenization is the first step, where you
break the sentence down into its individual building blocks. These blocks
can be:
• Words: "The", "quick", "brown", "fox"
• Punctuation marks: ".", ",", "?"
• Sometimes even phrases: "New York City" (depending on the application)
2. POS Tagging:
After you have your tokens, POS tagging assigns a grammatical role
(part-of-speech) to each one. Here's an example:
Sentence: "The quick brown fox jumps over the lazy dog."
This focuses on identifying and classifying specific entities within the tokens. Imagine
circling important names on a page. NER does something similar, recognizing entities like:
• People: "Albert Einstein"
• Organizations: "Google"
• Locations: "Paris"
• Dates: "July 4th, 2024"
• Monetary values: "$100"
Practical Examples
1. Search Engines:
Tokenization: When you search for "best restaurants NYC", the search engine
breaks it down into tokens like "best", "restaurants", "NYC".
NER: This helps the search engine understand you're looking for highly-rated
restaurants in New York City and refines the search results accordingly.
2. Social Media Analysis:
Tokenization: Analyzing a tweet like "Feeling POS Tagging: It can identify "Feeling" as a verb, NER: This might not be relevant here, but NER
great after winning the game #GoTeam! "great" as an adjective, "winning" as a verb could be used to identify the team mentioned in the
#Champions". (participle), "game" as a noun, and hashtags as hashtags for further analysis.
proper nouns.
3. Spam Filtering:
Tokenization: Breaking down a spam email with subject line "Free $
$$ for you!".
NER: This might not have much role here, but tokenization and POS
tagging help identify the generic and promotional nature of the email,
potentially flagging it as spam.
• Tokenization: Breaking down a sentence in one
4. Machine language (e.g., Spanish) into individual words.
• Text data often comes in a raw and messy format. It can contain inconsistencies,
irrelevant information, and variations in how words are written.
• Cleaning and normalization are crucial steps in NLP to prepare the text for further
processing. Here's a breakdown of some common techniques:
Stopwords are very common words
1. that carry little meaning on their own
(e.g., "the", "a", "is").
Removing Removing them can improve
Stopwords: processing efficiency and focus the
analysis on more content-rich words.
• Punctuation marks, symbols, and
emojis can add noise to the data.
• Depending on the task, you might
choose to remove them entirely
2. Removing or convert them to a standard
Special format.
Characters:
Text data can be written in
different cases (uppercase,
3. lowercase).
Lowercasing/Uppercasing:
Converting everything to
lowercase or uppercase ensures
consistency and simplifies further
processing.
4. Normalizing Text:
These techniques aim to reduce words to their base forms. However, they have subtle
differences:
Lemmatization: This process tries to convert a word to its dictionary form (lemma),
considering its grammatical role in the sentence (e.g., "running" becomes "run", "better"
becomes "good"). It requires a morphological analysis of the word.
Stemming: This process chops off suffixes to arrive at a base form (stem) that might not
always be a real word (e.g., "running" becomes "run", "better" becomes "bet"). It's a
simpler and faster approach but can sometimes lead to incorrect base forms.
The choice between lemmatization
and stemming depends on your
specific application.
Lemmatization is generally
preferred for tasks where
preserving meaning and
Cont... grammatical accuracy is crucial.
Stemming can be faster and
sufficient for simpler tasks where
the exact meaning of the base form
isn't critical.
• Text Normalization Libraries: Libraries
like NLTK (Python) and spaCy (Python)
offer functionalities for many of these text
cleaning and normalization tasks.
• Context-Specific Normalization: The
specific techniques you apply might vary
Additional depending on your NLP task and the nature
Considerations of your text data.
• Trade-offs: There can be trade-offs between
cleaning too aggressively and losing
information, and cleaning too lightly and
introducing noise. Finding the right balance
depends on your specific needs.
Some of the examples
1. Social Media Sentiment Analysis:
Imagine analyzing tweets to understand public sentiment towards a new
product launch. You'd want to clean the text by:
• Removing stopwords: Words like "a", "the", "is" don't contribute much to
sentiment.
• Removing special characters: Emojis, hashtags, and punctuation can be
removed or converted for consistency.
• Lowercasing: Case variations shouldn't affect sentiment analysis.
• Normalizing slang and abbreviations: "OMG" could be converted to "oh
my god" for better understanding.
2. Web Scraping and Text Summarization:
You might scrape news articles to summarize the main points.
Here, cleaning involves:
Removing HTML tags and code: Irrelevant for textual content.
Correcting typos and misspellings: Users might make mistakes while typing.
Techniques:
Word2Vec: Two popular architectures are Skip-gram and CBOW. They predict surrounding words based on a given
word (Skip-gram) or vice versa (CBOW). Words used for prediction and the target word become closer in the vector
space.
GloVe: Analyzes word co-occurrence statistics from a large corpus to learn word vectors. Words that frequently co-occur
are positioned closer in the vector space.
Benefits:
Captures semantic relationships Enables tasks like word similarity Can be used as input features for
between words. detection and analogy completion. various NLP models.
4. Language Models and Pre-trained
Transformers:
Concept: Language models are statistical methods that predict the next word in a sequence based on the preceding
words. Pre-trained transformers are powerful language models trained on massive amounts of text data.
Techniques:
Traditional Language Models (e.g., n-grams): Predict the next word based on the n preceding words (e.g., bigrams,
trigrams).
Pre-trained Transformers (e.g., BERT, GPT-3): These are complex neural network architectures trained on massive text
corpora. They learn contextual representations of words and can be fine-tuned for various NLP tasks like text
classification, question answering, and summarization.
Benefits:
Sentiment analysis, also known as opinion mining, is the process of computationally identifying and classifying the emotional tone behind a
piece of text. It aims to understand whether the sentiment expressed is positive, negative, or neutral.
Applications:
• Social media monitoring: Analyze public opinion towards brands, products, or events.
• Customer reviews: Understand customer satisfaction and identify areas for improvement.
• Market research: Gauge audience sentiment towards specific topics or products.
• Spam filtering: Identify and filter out spam emails with negative or promotional tones.
• Lexicon-based approach: Uses pre-defined dictionaries of
words with positive, negative, and neutral sentiment scores.
The overall sentiment is calculated based on the sentiment
scores of the words in the text.
• Machine learning: Trains models on labeled data (text
with known sentiment) to automatically classify new text.
Popular algorithms include Naive Bayes, Support Vector
Machines (SVM), and Logistic Regression.
Techniques: • Deep learning: Utilizes neural networks like Recurrent
Neural Networks (RNNs) and Long Short-Term Memory
(LSTM) networks to capture complex relationships
between words and improve sentiment classification
accuracy.
Building Sentiment Analysis Models
1. Data Preparation:
Preprocess the text by cleaning it (removing noise, punctuation, stop words) and
potentially normalizing it (lowercasing, stemming/lemmatization).
Bag-of-Words
For machine learning (BoW): Represent the
models, create text as a vector where
features that represent each element
the text. This could indicates the
involve: frequency of a word
in the vocabulary.
2. Feature
Engineering: TF-IDF: Assigns
Word Embeddings:
weights to words
Represent words as
based on their
numerical vectors
importance within the
capturing semantic
document and across
relationships.
the corpus.
1 2 3
3. Model
Training: Choose a suitable machine
learning or deep learning
algorithm for sentiment
classification.
Train the model on your
labeled data.
Evaluate the model's
performance on a separate
test dataset.
Use metrics like accuracy,
precision, recall, and F1-
score to assess the model's
performance.
4. Evaluation: Fine-tune the model or
explore different algorithms
if performance is not
satisfactory.
Sentiment analysis models assign a sentiment score
or class (positive, negative, neutral) to a piece of
text.
Interpreting
It's crucial to understand the limitations:
Sentiment
Analysis Models might misclassify sarcasm, irony, or
Results complex emotions.
Theoretical
Explanation Machine Learning: Algorithms learn patterns from
labeled data to classify new text samples.
Allocation
(LDA)
Latent Dirichlet Allocation (LDA) is a
popular topic modeling algorithm. Here's
the basic idea:
• Each document is assumed to be a mixture
of various topics in different proportions.
• Each topic is represented by a probability
distribution over words in the vocabulary.
Cont...
LDA analyzes the documents in a corpus and
tries to discover these underlying topics and
their distribution across documents.
3. Evaluating Topic There's no single "best" number of topics for
LDA. Here are some approaches to guide
Models and your selection:
• Perplexity: LDA calculates perplexity, a
Selecting the measure of how well the model fits unseen
data. Lower perplexity often indicates a
Optimal Number of better fit. However, it can be sensitive to
model parameters.
Topics • Topic Coherence: Evaluate how well the
words within a topic are semantically
related. Various metrics like coherence
score (CoherenceModel in Gensim) can
help assess this.
• Domain Knowledge: Consider your
understanding of the domain and the
expected number of relevant themes within
the documents.
4. Introduction to Text Generation Techniques
Text generation aims to create coherent and realistic sequences of words, similar to
human-written text. Here are two common approaches:
1. Markov Chains:
A Markov chain is a statistical model that predicts the next word based on the
probability of it appearing after a specific sequence of preceding words (n-grams).
Simple and computationally efficient, but generated text can be repetitive and lack
long-range coherence.
RNNs are a type of neural network architecture
specifically designed for sequential data like text.
2. Recurrent
Neural They can learn complex relationships between
Networks words across longer sequences, leading to more
sophisticated and grammatically correct text
(RNNs): generation.