Combine PDF
Combine PDF
SCHOOL OF COMPUTING
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
SCSA 2604 NATURAL LANGUAGE PROCESSING LAB
AIM: To perform sentiment analysis program using an SVM classifier with TF-IDF
vectorization.
PROCEDURE:
Data Preparation: Downloading the dataset, converting it into a suitable format (words and
sentiments), and structuring it into a DataFrame.
Splitting Data: Dividing the dataset into training and testing sets to train the model on a
portion and evaluate it on another.
TF-IDF Vectorization: Converting text data into numerical vectors using TF-IDF (Term
Frequency-Inverse Document Frequency) representation.
SVM Initialization and Training: Setting up an SVM classifier and training it using the TF-
IDF vectors obtained from the training text data.
Prediction and Evaluation: Transforming test data into TF-IDF vectors, predicting sentiment
labels, and evaluating the model's performance by comparing predicted labels with actual
labels using accuracy and a classification report.
The following algorithm outlines the process of building a sentiment analysis model using an
SVM classifier with TF-IDF vectorization in Python. Adjustments can be made to use
different datasets, vectorization techniques, or machine learning models based on specific
requirements.
ALGORITHM:
1. Library Installation and Import: Install required libraries (scikit-learn and nltk).
Import necessary modules from these libraries.
2. Download NLTK Resources: Download the movie_reviews dataset from NLTK.
3. Load and Prepare Dataset: Load the movie_reviews dataset.
Convert the dataset into a suitable format (list of words and corresponding sentiments)
and create a DataFrame.
1
4. Split Data into Train and Test Sets: Split the dataset into training and testing sets (e.g.,
Page
PROGRAM:
# Install necessary libraries
!pip install scikit-learn
!pip install nltk
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')
print(classification_report(y_test, y_pred))
4
Page
PROCEDURE:
Library Installation and Import: Ensures the NLTK library is available for use and imports
the necessary modules for text processing.
Download NLTK Resources: Downloads essential resources (punkt for tokenization,
averaged_perceptron_tagger for POS tagging) required by NLTK.
Sample Text: Defines a piece of text to demonstrate POS tagging.
Tokenization: Divides the text into individual words or tokens, making it suitable for further
analysis.
POS Tagging: Assigns each word in the text its respective grammatical category or POS tag
using NLTK's POS tagging functionality.
Display POS Tags: Prints or displays the words along with their associated POS tags obtained
from the tagging process.
The following algorithm outlines the steps involved in performing Parts of Speech (POS)
tagging using NLTK in Python. It demonstrates how to tokenize a text and assign
grammatical categories to individual words, providing insight into the linguistic structure of
the text.
ALGORITHM:
1. Library Installation and Import: Install NLTK library if not already installed.
Import the necessary NLTK library for text processing and POS tagging.
2. Download NLTK Resources: Download NLTK resources required for tokenization
and POS tagging (punkt for tokenization, averaged_perceptron_tagger for POS
tagging).
3. Sample Text: Define a sample text for POS tagging.
1
4. Tokenization: Break down the provided text into individual words (tokens) using
Page
PROGRAM:
# Install NLTK (if not already installed)
!pip install nltk
OUTPUT:
Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (3.8.1)
Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk)
2
(8.1.7)
Page
3
Page
LAB 7: CHUNKING
PROCEDURE:
In Natural Language Processing (NLP), chunking is the process of extracting short, meaningful
phrases (chunks) from a sentence based on specific patterns of parts of speech (POS). Python
provides tools like NLTK (Natural Language Toolkit) to perform chunking. This example
demonstrates a basic noun phrase (NP) and verb phrase (VP) chunking using NLTK. You can
adjust the chunk grammar patterns to capture different types of phrases or entities based on
your specific needs.
The chunk_grammar variable contains patterns defined using regular expressions for
identifying noun phrases and verb phrases. Adjusting these patterns can help extract different
types of chunks like prepositional phrases, named entities, etc.
Tokenization: Breaking the sentence into individual tokens or words.
POS Tagging: Assigning part-of-speech tags to each token (identifying whether it's a noun,
verb, adjective, etc.).
Chunking: Grouping tokens into larger structures (noun phrases, verb phrases) based on
defined grammar rules.
Chunk Grammar: Regular expressions defining patterns for identifying specific chunk
structures (like noun phrases).
Chunk Parser: Utilizing the chunk grammar to parse and extract chunks based on the
provided POS-tagged tokens.
The following algorithm outlines the steps involved in the noun phrase chunking process
using NLTK in Python, highlighting the key processes and the role of chunk grammar in
identifying and extracting specific syntactic structures from text data.
1
Page
1. Import Necessary Libraries: Import required modules from NLTK for tokenization,
POS tagging, and chunking.
2. Download NLTK Resources (if needed): Ensure NLTK resources like tokenizers and
POS taggers are downloaded (nltk.download('punkt'),
nltk.download('averaged_perceptron_tagger')).
3. Define a Sample Sentence: Set a sample sentence that will be used for chunking.
4. Tokenization: Break the sentence into individual words or tokens using NLTK's
word_tokenize() function.
5. Part-of-Speech (POS) Tagging: Tag each token with its corresponding part-of-speech
using NLTK's pos_tag() function.
7. Chunk Parser Creation: Create a chunk parser using RegexpParser() and provide the
defined chunk grammar.
8. Chunking: Parse the tagged sentence using the created chunk parser to extract chunks
based on the defined grammar.
9. Display Chunks: Iterate through the parsed chunks and print the subtrees labeled as
'NP', which represent the identified noun phrases.
2
Page
# Sample sentence
sentence = "The quick brown fox jumps over the lazy dog"
# POS tagging
tagged = pos_tag(tokens)
OUTPUT:
Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (3.8.1)
Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk)
(8.1.7)
Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk)
(1.3.2)
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.10/dist-packages
(from nltk) (2023.6.3)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from nltk)
(4.66.1)
(NP The/DT quick/JJ brown/NN)
(NP fox/NN)
(NP the/DT lazy/JJ dog/NN)
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /root/nltk_data...
[nltk_data] Package averaged_perceptron_tagger is already up-to-
[nltk_data] date!
4
Page
Dataset:
The dataset consists of a collection of news articles in text format. Each article is labeled with
its category (e.g., politics, sports, entertainment) and contains textual content for analysis.
Approach:
1. Preprocess the dataset by tokenizing the text into words and sentences.
2. Perform parts of speech tagging using a pre-trained model or a custom-trained model.
3. Extract relevant parts of speech such as nouns, verbs, adjectives, and adverbs from the
tagged text.
4. Analyze the distribution of different parts of speech across the articles to understand
their linguistic characteristics.
5. Integrate the extracted information into the recommendation system to improve the
relevance of recommended articles for users.
Program :
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
def pos_tagging(text):
sentences = sent_tokenize(text)
tagged_tokens = []
for sentence in sentences:
tokens = word_tokenize(sentence)
tagged_tokens.extend(nltk.pos_tag(tokens))
return tagged_tokens
def main():
# Example news article
article_text = """
Manchester United secured a 3-1 victory over Chelsea in yesterday's
match.
Goals from Rashford, Greenwood, and Fernandes sealed the win for
United.
Chelsea's only goal came from Pulisic in the first half.
The victory boosts United's chances in the Premier League title
race.
"""
tagged_tokens = pos_tagging(article_text)
print("Original Article Text:\n", article_text)
print("\nParts of Speech Tagging:")
for token, pos_tag in tagged_tokens:
print(f"{token}: {pos_tag}")
if __name__ == "__main__":
main()
Output:
Original Article Text:
AIM: The aim of this case study is to demonstrate the extraction of noun phrases from a
given text using chunking, a technique in Natural Language Processing (NLP). We will
utilize Python's NLTK library to implement chunking and extract meaningful noun phrases
from the text.
Problem Statement:
Given a sample text, our goal is to identify and extract noun phrases, which are
sequences of words containing a noun and optionally other words like adjectives or
determiners. The problem involves implementing a program that tokenizes the text, performs
part-of-speech tagging, applies chunking to identify noun phrases, and finally outputs the
extracted noun phrases.
Objectives :
1. Tokenize the input text into words.
2. Perform part-of-speech tagging to assign grammatical tags to each word.
3. Define a chunk grammar to identify noun phrases.
4. Apply chunking to extract noun phrases from the text.
5. Display the extracted noun phrases.
Dataset:
For this case study, we will use a sample text: "The quick brown fox jumps over the lazy
dog."
Approach:
The approach involves several steps to extract noun phrases from the given text using
chunking in Natural Language Processing (NLP). Firstly, the input text is tokenized into
individual words to prepare it for further processing. Following tokenization, each word is
tagged with its part-of-speech using NLTK's pos_tag function, which assigns grammatical
tags to each word based on its context. Next, a chunk grammar is defined to specify the
patterns that identify noun phrases. This grammar is then utilized to apply chunking, which
groups consecutive words that match the defined patterns into noun phrases. Finally, the
extracted noun phrases are outputted, providing meaningful insights into the structure and
content of the text. This approach allows for the identification and extraction of important
linguistic units, facilitating various NLP tasks such as information extraction, text
summarization, and sentiment analysis.
Program :
import nltk
import os
# Sample text
text = "The quick brown fox jumps over the lazy dog."
# Apply chunking
chunked_text = chunk_parser.parse(pos_tags)
# Output
print("Original Text:", text)
print("Noun Phrases:")
for phrase in noun_phrases:
print("-", phrase)
Output:
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /root/nltk_data...
[nltk_data] Unzipping taggers/averaged_perceptron_tagger.zip.
Original Text: The quick brown fox jumps over the lazy dog.
Noun Phrases:
- The quick brown
- fox
- the lazy dog
Result:
Chunking is a valuable technique in NLP for identifying and extracting meaningful
phrases from text. In this case study, we successfully implemented chunking using Python's
NLTK library to extract noun phrases from a given text. By identifying and extracting noun
phrases, we gained insights into the structure and semantics of the text, which can be
beneficial for various NLP applications such as information extraction, sentiment analysis,
and text summarization.
Lab 2 – Case Study
Objectives:
The objectives of the provided program are to implement a simple email autocomplete system using the
GPT-2 language model. The program aims to facilitate user interaction by suggesting autocompletions
based on the context provided and the user's input. Key objectives include initializing and integrating
the GPT-2 model and tokenizer from the Hugging Face Transformers library, defining a class structure
(EmailAutocompleteSystem) to encapsulate the autocomplete system, and creating a method
(generate_suggestions) to generate context-aware suggestions. The program encourages user
engagement by incorporating a user input loop, allowing continuous interaction until the user chooses
to exit. The ultimate goal is to demonstrate the practical use of a pre-trained language model for
generating relevant suggestions in the context of email composition, showcasing the capabilities of the
GPT-2 model for natural language processing tasks.
Approach:
1. Data Collection:
Collect a diverse dataset of emails, including different writing styles, topics, and
formality levels.
Annotate the dataset with proper context information, such as sender, recipient,
subject, and the body of the email.
2. Data Preprocessing:
Clean and tokenize the text data.
Handle issues like punctuation, capitalization, and special characters.
Split the dataset into training and testing sets.
3. Model Selection:
Choose a suitable NLP model for word generation. Options may include recurrent
neural networks (RNNs), long short-term memory networks (LSTMs), or
transformer models like GPT-3.
Fine-tune or train the model on the email dataset to understand the specific
language patterns used in emails.
4. Context Integration:
Design a mechanism to incorporate contextual information from the email, such
as the subject, previous sentences, and the relationship between the sender and
recipient.
Implement a way for the model to understand the context shift within the email
body.
5. User Interface:
Develop a user-friendly interface that integrates with popular email clients or
standalone applications.
Allow users to enable or disable the autocomplete feature as needed.
Provide visual cues to indicate suggested words or phrases.
6. Model Evaluation:
Evaluate the model's performance on the test dataset using metrics like perplexity,
accuracy, and precision.
Gather user feedback on the effectiveness and usability of the autocomplete
system.
7. Fine-Tuning and Iteration:
Analyze user feedback and performance metrics to identify areas for
improvement.
Consider refining the model based on user suggestions and addressing any
limitations.
8. Deployment:
Deploy the trained model as a service that can be accessed by the email
application.
Ensure scalability and reliability of the autocomplete system.
Potential Challenges:
Context Understanding: Ensuring the model effectively understands and incorporates
the context of the email.
Ambiguity Handling: Dealing with ambiguous phrases and understanding the user's
intended meaning.
Personalization: Tailoring the system to individual writing styles and preferences.
Success Criteria:
Improved email composition efficiency and speed.
Positive user feedback on the accuracy and relevance of autocomplete suggestions.
Reduction in typing errors and improved overall user experience.
By successfully developing and implementing this word generation program, the company aims
to enhance the productivity and user experience of individuals engaged in email communication.
Program :
!pip install transformers
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer
class EmailAutocompleteSystem:
def __init__(self):
self.model_name = "gpt2"
self.tokenizer = GPT2Tokenizer.from_pretrained(self.model_name)
self.model = GPT2LMHeadModel.from_pretrained(self.model_name)
with torch.no_grad():
output = self.model.generate(input_ids, max_length=50, num_return_sequences=1,
no_repeat_ngram_size=2)
# Example usage
if __name__ == "__main__":
autocomplete_system = EmailAutocompleteSystem()
while True:
user_input = input("Enter your sentence (type 'exit' to end): ")
if user_input.lower() == 'exit':
break
if suggestions:
print("Autocomplete Suggestions:", suggestions)
else:
print("No suggestions available.")
Output:
Enter your sentence (type 'exit' to end): hello, how are you ? How's
everything going on !
The attention mask and the pad token id were not set. As a consequence, you
may observe unexpected behavior. Please pass your input's `attention_mask` to
obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Autocomplete Suggestions: ["How's", 'everything', 'going', 'on!', "I'm", 'a',
'programmer', 'and', "I've", 'been', 'working', 'on', 'a', 'project', 'for',
'a', 'while', 'now.', 'I', 'have', 'a', 'lot', 'of', 'ideas', 'for', 'the']
Enter your sentence (type 'exit' to end): exit
Result:
The result demonstrates the integration of a powerful language model for enhancing user experience in
composing emails through intelligent autocomplete suggestions.
SATHYABAMA INSTITUTE OF SCIENCE & TECHNOLOGY
SCHOOL OF COMPUTING
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
SCSA 2604 NATURAL LANGUAGE PROCESSING LAB
PROCEDURE:
This algorithm outlines the steps involved in the text classification task using
LinearSVC on the 20 Newsgroups dataset. It provides a structured approach to implementing
the program and understanding the workflow.
ALGORITHM:
Algorithm: Text Classification using LinearSVC
End Algorithm
PROGRAM:
# Install scikit-learn if not already installed
!pip install scikit-learn
OUTPUT:
Requirement already satisfied: scikit-learn in
/usr/local/lib/python3.10/dist-packages (1.2.2)
Requirement already satisfied: numpy>=1.17.3 in
/usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.23.5)
Requirement already satisfied: scipy>=1.3.2 in
/usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.11.4)
Requirement already satisfied: joblib>=1.1.1 in
/usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.3.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in
/usr/local/lib/python3.10/dist-packages (from scikit-learn) (3.2.0)
Accuracy: 0.9504823151125402
Classification Report:
precision recall f1-score support
Problem Statement:
A customer support company receives a large volume of incoming emails from customers
with various inquiries, complaints, and feedback. Manually categorizing and prioritizing
these emails is time-consuming and inefficient. The company wants to develop a text
classification system to automatically classify incoming emails into predefined categories,
allowing for faster response times and better customer service.
Objectives :
The text classification system successfully categorizes incoming customer emails into
predefined categories.
It improves the efficiency of the customer support team by automating email
classification and prioritization.
The company can respond to customer inquiries and issues more promptly, leading to
higher customer satisfaction and retention.
Dataset:
The company has a dataset of past customer emails along with their corresponding
categories. Each email is labeled with one or more categories, indicating the type of inquiry
or issue raised by the customer. For demonstration purposes, we will use the
fetch_20newsgroups dataset from scikit-learn, which contains a collection of newsgroup
documents, spanning 20 different newsgroups. We'll simulate this dataset as if it were
customer support emails categorized into predefined categories.
Approach:
Data Preparation:
Load the 20 Newsgroups dataset as a proxy for customer support emails.
Select a subset of categories that represent different types of customer inquiries,
complaints, and feedback.
Prepare the data and target labels from the dataset.
Data Preprocessing:
Clean the email text data by removing unnecessary information such as email headers,
signatures, and HTML tags.
Tokenize the text and convert it to lowercase.
Remove stopwords and apply techniques like stemming or lemmatization to reduce
words to their base forms.
Feature Extraction:
Use TF-IDF Vectorizer to convert text data into numerical features, limiting the maximum
number of features to 10,000 and removing English stopwords.
Model Selection:
Choose a suitable classification algorithm such as Linear Support Vector Classifier
(LinearSVC) for text classification.
Train the chosen model on the training data.
Model Evaluation:
Predict labels for the test set using the trained model.
Evaluate the classifier's performance using accuracy and a classification report, which
includes precision, recall, and F1-score for each category.
Future Enhancements:
Continuous monitoring and updating of the model to adapt to evolving customer
inquiries and language patterns.
Integration of sentiment analysis to assess the sentiment of customer emails and
prioritize urgent or critical issues.
Expansion of the model to handle multiclass classification and a wider range of
customer inquiry categories.
Program :
# Import necessary libraries
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report
# Load the 20 Newsgroups dataset as a proxy for customer support emails
newsgroups = fetch_20newsgroups(subset='all', categories=['comp.sys.ibm.pc.hardware',
'comp.sys.mac.hardware', 'rec.autos', 'rec.motorcycles', 'sci.electronics'])
Classification Report:
precision recall f1-score support
Result:
This case study outlines the problem statement, dataset, approach, expected outcome,
and future enhancements for developing a text classification system for customer support
email classification. It demonstrates the application of machine learning techniques to
automate and improve customer service processes.
SATHYABAMA INSTITUTE OF SCIENCE & TECHNOLOGY
SCHOOL OF COMPUTING
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
SCSA 2604 NATURAL LANGUAGE PROCESSING LAB
PROCEDURE:
Semantic analysis is a broad area in NLP. This program demonstrates semantic analysis by
leveraging pre-trained word vectors using Word2Vec from Gensim. It utilizes word
embeddings to find words similar to each word in the provided sentences.
Library Installation: Ensure the necessary libraries (Gensim and NLTK) are installed.
Library Import: Import the required libraries (gensim for word vectors and nltk for
tokenization).
Pre-trained Word Vectors: Load pre-trained word vectors (Word2Vec) using Gensim's
api.load() method.
Sample Sentences: Define sample sentences for semantic analysis.
Tokenization: Break down the sentences into individual words using NLTK's word_tokenize()
method.
Semantic Analysis: Iterate through each word in the tokenized sentences and:
Check if the word exists in the pre-trained Word2Vec model.
If the word exists, find similar words using the most_similar() method from the
word vectors model.
Display or store the similar words for each word in the sentence.
If the word doesn't exist in the pre-trained model, indicate that it's not present.
The following algorithm outlines the steps involved in performing semantic analysis using pre-
trained word vectors (Word2Vec) in Python, demonstrating how to find similar words for each
word in the provided sentences based on the loaded word vectors.
1
Page
PROGRAM:
# Install necessary libraries
!pip install gensim
!pip install nltk
# Sample sentences
sentences = [
"Natural language processing is a challenging but fascinating field.",
"Word embeddings capture semantic meanings of words in a vector space."
2
Page
OUTPUT:
Requirement already satisfied: gensim in /usr/local/lib/python3.10/dist-packages (4.3.2)
Requirement already satisfied: numpy>=1.18.5 in /usr/local/lib/python3.10/dist-packages
(from gensim) (1.23.5)
Requirement already satisfied: scipy>=1.7.0 in /usr/local/lib/python3.10/dist-packages (from
gensim) (1.11.3)
Requirement already satisfied: smart-open>=1.8.1 in /usr/local/lib/python3.10/dist-packages
(from gensim) (6.4.0)
Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (3.8.1)
Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk)
(8.1.7)
Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk)
(1.3.2)
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.10/dist-packages
(from nltk) (2023.6.3)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from nltk)
(4.66.1)
[==================================================] 100.0%
1662.8/1662.8MB downloaded
3
('Chris_Manfredini_kicked', 0.47327715158462524)]
0.5015827417373657)]
Page
6
Page
# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in
stop_words]
# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in
filtered_tokens]
# Synonyms generation
synonyms = set()
for token in lemmatized_tokens:
for syn in wordnet.synsets(token):
for lemma in syn.lemmas():
synonyms.add(lemma.name())
return list(synonyms)
Output:
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
Customer Query: I received a damaged product. Can I get a refund?
Semantic Analysis (Synonyms): ['refund', 'grow', 'baffle', 'pay_off', 'Cartesian_product',
'arrive', 'engender', 'standard', 'have', 'damaged', 'experience', 'develop', 'sustain', 'product',
'acquire', 'encounter', 'take_in', 'find', 'stupefy', 'bugger_off', 'draw', 'pose', 'aim', 'nonplus',
'induce', 'mother', 'stimulate', 'make', 'repayment', 'convey', 'cause', 'mathematical_product',
'get', 'damage', 'produce', 'set_out', 'merchandise', 'buzz_off', 'beat', 'meet', 'start', 'commence',
'return', 'pick_up', 'production', 'fix', 'stick', "get_under_one's_skin", 'go', 'mystify', 'take',
'perplex', 'welcome', 'vex', 'begin', 'come', 'fuck_off', 'bring', 'contract', 'capture', 'generate',
'give_back', 'incur', 'repay', 'let', 'become', 'start_out', 'gravel', 'scram', 'obtain', 'pay_back',
'amaze', 'catch', 'beget', 'get_down', 'set_about', 'invite', 'bring_forth', 'drive', 'sire', 'intersection',
'discredited', 'suffer', 'received', 'ware', 'dumbfound', 'fetch', 'father', 'arrest', 'flummox', 'puzzle',
'bewilder', 'receive']
Result:
By following this approach, the program aims to achieve the objectives of performing
semantic analysis on customer queries and improving customer service operations by providing
valuable insights into the semantics of the queries.
SATHYABAMA INSTITUTE OF SCIENCE & TECHNOLOGY
SCHOOL OF COMPUTING
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
SCSA 2604 NATURAL LANGUAGE PROCESSING LAB
PROBLEM STATEMENT:
A company receives a large volume of customer feedback across various channels such as
emails, social media, and surveys. Understanding and categorizing this feedback manually is
time-consuming and inefficient. The goal is to develop an NLP-based program to automatically
process and analyze customer feedback to extract valuable insights.
OBJECTIVE:
Utilize spaCy and NLP techniques to process customer feedback text, extract tokens, perform
lemmatization, and conduct dependency parsing to uncover underlying relationships between
words.
APPROACH:
Data Collection:
Gather a dataset containing customer feedback from different sources, including emails, social
media comments, and survey responses.
Insight Generation:
Categorize the feedback based on sentiment, identify frequently occurring topics, or extract
key phrases related to specific issues or praises mentioned by customers.
Implementation:
Use Python and spaCy to develop the program for text processing and analysis.
Incorporate visualization techniques (e.g., graphs, word clouds) to represent the findings and
insights derived from the processed feedback.
Evaluation:
Evaluate the accuracy and efficiency of tokenization, lemmatization, and dependency parsing
in handling different types of customer feedback.
Measure the program's ability to extract meaningful insights and categorize feedback
accurately.
PROGRAM:
import spacy
def analyze_feedback(feedback):
for idx, text in enumerate(feedback, start=1):
print(f"\nAnalyzing Feedback {idx}: '{text}'")
doc = nlp(text)
# Dependency parsing
print("\nDependency Parsing:")
for token in doc:
print(token.text, token.dep_, token.head.text, token.head.pos_,
[child for child in token.children])
if __name__ == "__main__":
analyze_feedback(customer_feedback)
OUTPUT:
Analyzing Feedback 1: 'The product is amazing! I love the quality.'
Tokens: ['The', 'product', 'is', 'amazing', '!', 'I', 'love', 'the', 'quality', '.']
Lemmas: ['the', 'product', 'be', 'amazing', '!', 'I', 'love', 'the', 'quality', '.']
Dependency Parsing:
The det product NOUN []
3
Dependency Parsing:
The det service NOUN []
customer compound service NOUN []
service nsubj was AUX [The, customer]
was ROOT was AUX [service, disappointed, .]
terrible amod disappointed ADJ []
, punct disappointed ADJ []
very advmod disappointed ADJ []
disappointed acomp was AUX [terrible, ,, very]
. punct was AUX []
Dependency Parsing:
4
Dependency Parsing:
The det delivery NOUN [ ]
delivery nsubj was AUX [The]
was ROOT was AUX [delivery, frustrating, .]
late advmod frustrating ADJ [ ]
, punct frustrating ADJ [ ]
very advmod frustrating ADJ [ ]
frustrating acomp was AUX [late, ,, very]
. punct was AUX [ ]
CONCLUSION:
The developed NLP-based program utilizing spaCy proves to be an efficient solution for
processing and analyzing customer feedback. Its capability to extract tokens, perform
lemmatization, and conduct dependency parsing aids in understanding the sentiment,
identifying key topics, and establishing relationships within the feedback data. This enables
companies to derive actionable insights, prioritize issues, and enhance customer satisfaction
based on the analysis of their feedback.
RESULT:
This case study demonstrates the practical application of the provided code snippet using
5
spaCy in a business context, specifically for customer feedback analysis, showcasing how
Page
NLP techniques can be employed to extract valuable insights from unstructured text data.
#1
import spacy
# Load English tokenizer, tagger, parser, NER, and word vectors
nlp = spacy.load("en_core_web_sm")
# Sample text for analysis
text = "Natural Language Processing is a fascinating field of study."
# Process the text with spaCy
doc = nlp(text)
# Extracting tokens and lemmatization
tokens = [token.text for token in doc]
lemmas = [token.lemma_ for token in doc]
print("Tokens:", tokens)
print("Lemmas:", lemmas)
# Dependency parsing
print("\nDependency Parsing:")
for token in doc:
print(token.text, token.dep_, token.head.text, token.head.pos_,
[child for child in token.children])
Tokens: ['Natural', 'Language', 'Processing', 'is', 'a', 'fascinating', 'field', 'of', 'study', '.']
Lemmas: ['Natural', 'Language', 'Processing', 'be', 'a', 'fascinating', 'field', 'of', 'study', '.']
Dependency Parsing:
Natural compound Language PROPN []
Language compound Processing PROPN [Natural]
Processing nsubj is AUX [Language]
is ROOT is AUX [Processing, field, .]
a det field NOUN []
fascinating amod field NOUN []
field attr is AUX [a, fascinating, of]
of prep field NOUN [study]
study pobj of ADP []
. punct is AUX []
#1 case study
import spacy
# Load English tokenizer, tagger, parser, NER, and word vectors
nlp = spacy.load("en_core_web_sm")
# Sample customer feedback data
customer_feedback = [
"The product is amazing! I love the quality.",
"The customer service was terrible, very disappointed.",
"Great experience overall, highly recommended.",
"The delivery was late, very frustrating."
]
def analyze_feedback(feedback):
for idx, text in enumerate(feedback, start=1):
print(f"\nAnalyzing Feedback {idx}: '{text}'")
doc = nlp(text)
tokens = [token.text for token in doc]
lemmas = [token.lemma_ for token in doc]
print("Tokens:", tokens)
print("Lemmas:", lemmas)
print("\nDependency Parsing:")
for token in doc:
print(token.text, token.dep_, token.head.text, token.head.pos_,
[child for child in token.children])
if __name__ == "__main__":
analyze_feedback(customer_feedback)
Dependency Parsing:
The det delivery NOUN []
delivery nsubj was AUX [The]
was ROOT was AUX [delivery, frustrating, .]
late advmod frustrating ADJ []
, punct frustrating ADJ []
very advmod frustrating ADJ []
frustrating acomp was AUX [late, ,, very]
. punct was AUX []
https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 1/9
4/23/24, 2:20 PM nlp.ipynb - Colab
#2
import nltk
import random
nltk.download('punkt')
nltk.download('gutenberg')
words = nltk.corpus.gutenberg.words()
bigrams = list(nltk.bigrams(words))
starting_word = "the"
generated_text = [starting_word]
for _ in range(20):
next_word = random.choice(possible_words)
generated_text.append(next_word)
print(' '.join(generated_text))
#2 Case study
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer
class EmailAutocompleteSystem:
def __init__(self):
self.model_name = "gpt2"
self.tokenizer = GPT2Tokenizer.from_pretrained(self.model_name)
self.model = GPT2LMHeadModel.from_pretrained(self.model_name)
def generate_suggestions(self, user_input, context):
input_text = f"{context} {user_input}"
input_ids = self.tokenizer.encode(input_text, return_tensors="pt")
with torch.no_grad():
output = self.model.generate(input_ids, max_length=50, num_return_sequences=1,no_repeat_ngram_size=2)
generated_text = self.tokenizer.decode(output[0], skip_special_tokens=True)
suggestions = generated_text.split()[len(user_input.split()):]
return suggestions
if __name__ == "__main__":
autocomplete_system = EmailAutocompleteSystem()
email_context = "Subject: Discussing Project Proposal\nHi [Recipient],"
while True:
user_input = input("Enter your sentence (type 'exit' to end): ")
if user_input.lower() == 'exit':
break
suggestions = autocomplete_system.generate_suggestions(user_input, email_context)
if suggestions:
print("Autocomplete Suggestions:", suggestions)
else:
print("No suggestions available.")
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:88: UserWarni
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public mo
warnings.warn(
tokenizer_config.json: 100% 26.0/26.0 [00:00<00:00, 636B/s]
https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 2/9
4/23/24, 2:20 PM nlp.ipynb - Colab
#3
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
# Load the 20 Newsgroups dataset
categories = ['sci.med', 'sci.space', 'comp.graphics', 'talk.politics.mideast']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories)
# Split the data into training and testing sets
X_train = newsgroups_train.data
X_test = newsgroups_test.data
y_train = newsgroups_train.target
y_test = newsgroups_test.target
# Create a pipeline with TF-IDF vectorizer and LinearSVC classifier
model = make_pipeline(
TfidfVectorizer(),
LinearSVC()
)
# Train the model
model.fit(X_train, y_train)
# Predict labels for the test set
predictions = model.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)
print("\nClassification Report:")
print(classification_report(y_test, predictions))
Accuracy: 0.9504823151125402
Classification Report:
precision recall f1-score support
#3 Case study
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report
# Load the 20 Newsgroups dataset as a proxy for customer support emails
newsgroups = fetch_20newsgroups(subset='all', categories=['comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'rec.autos', 'rec.motorcy
# Prepare data and target labels
X = newsgroups.data
y = newsgroups.target
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create TF-IDF vectorizer
vectorizer = TfidfVectorizer(stop_words='english', max_features=10000)
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)
# Train the LinearSVC classifier
classifier = LinearSVC()
classifier.fit(X_train, y_train)
# Predict labels for the test set
predictions = classifier.predict(X_test)
# Evaluate the classifier
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)
print("\nClassification Report:")
print(classification_report(y_test, predictions, target_names=newsgroups.target_names))
Accuracy: 0.9389623601220752
Classification Report:
precision recall f1-score support
https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 3/9
4/23/24, 2:20 PM nlp.ipynb - Colab
rec.motorcycles 0.96 0.99 0.97 205
sci.electronics 0.92 0.93 0.92 189
#4
# Install necessary libraries
!pip install gensim
!pip install nltk
# Import required libraries
import gensim.downloader as api
from nltk.tokenize import word_tokenize
# Download pre-trained word vectors (Word2Vec)
word_vectors = api.load("word2vec-google-news-300")
# Sample sentences
sentences = [
"Natural language processing is a challenging but fascinating field.",
"Word embeddings capture semantic meanings of words in a vector space."
]
# Tokenize sentences
tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in sentences]
# Perform semantic analysis using pre-trained word vectors
for tokenized_sentence in tokenized_sentences:
for word in tokenized_sentence:
if word in word_vectors:
similar_words = word_vectors.most_similar(word)
print(f"Words similar to '{word}': {similar_words}")
else:
print(f"'{word}' is not in the pre-trained Word2Vec model.")
https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 4/9
4/23/24, 2:20 PM nlp.ipynb - Colab
#4 case study
import nltk
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
# Initialize NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
# Function to perform semantic analysis
def semantic_analysis(text):
tokens = word_tokenize(text)
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]
synonyms = set()
for token in lemmatized_tokens:
for syn in wordnet.synsets(token):
for lemma in syn.lemmas():
synonyms.add(lemma.name())
return list(synonyms)
# Example customer queries
customer_queries = [
"I received a damaged product. Can I get a refund?",
"I'm having trouble accessing my account.",
"How can I track my order status?",
"The item I received doesn't match the description.",
"Is there a discount available for bulk orders?"
]
# Semantic analysis for each query
for query in customer_queries:
print("Customer Query:", query)
synonyms = semantic_analysis(query)
print("Semantic Analysis (Synonyms):", synonyms)
print("\n")
https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 5/9
4/23/24, 2:20 PM nlp.ipynb - Colab
#5
# Install necessary libraries
!pip install scikit-learn
!pip install nltk
# Import required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
from nltk.corpus import movie_reviews # Sample dataset from NLTK
# Download NLTK resources (run only once if not downloaded)
import nltk
nltk.download('movie_reviews')
# Load the movie_reviews dataset
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 6/9
4/23/24, 2:20 PM nlp.ipynb - Colab
#5 case study
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
# Download NLTK resources (only required once)
nltk.download('vader_lexicon')
# Sample reviews
reviews = [
"This product is amazing! I love it.",
"The product was good, but the packaging was damaged.",
"Very disappointing experience. Would not recommend.",
"Neutral feedback on the product.",
]
# Initialize Sentiment Intensity Analyzer
sid = SentimentIntensityAnalyzer()
# Analyze sentiment for each review
for review in reviews:
print("Review:", review)
scores = sid.polarity_scores(review)
print("Sentiment:", end=' ')
if scores['compound'] > 0.05:
print("Positive")
elif scores['compound'] < -0.05:
print("Negative")
else:
print("Neutral")
print()
#6
# Install NLTK (if not already installed)
!pip install nltk
# Import necessary libraries
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
# Sample text for POS tagging
text = "Parts of speech tagging helps to understand the function of each word in a sentence."
# Tokenize the text into words
tokens = nltk.word_tokenize(text)
# Perform POS tagging
pos_tags = nltk.pos_tag(tokens)
# Display the POS tags
print("POS tags:", pos_tags)
https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 7/9
4/23/24, 2:20 PM nlp.ipynb - Colab
#6 Case study
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
# Download NLTK resources (if not already downloaded)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
def pos_tagging(text):
sentences = sent_tokenize(text)
tagged_tokens = []
for sentence in sentences:
tokens = word_tokenize(sentence)
tagged_tokens.extend(nltk.pos_tag(tokens))
return tagged_tokens
def main():
article_text = """Manchester United secured a 3-1 victory over Chelsea in yesterday's match.
Goals from Rashford, Greenwood, and Fernandes sealed the win for United.
Chelsea's only goal came from Pulisic in the first half.
The victory boosts United's chances in the Premier League title race.
"""
tagged_tokens = pos_tagging(article_text)
print("Original Article Text:\n", article_text)
print("\nParts of Speech Tagging:")
for token, pos_tag in tagged_tokens:
print(f"{token}: {pos_tag}")
if __name__ == "__main__":
main()
https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 8/9
4/23/24, 2:20 PM nlp.ipynb - Colab
#7
!pip install nltk
import nltk
from nltk import RegexpParser
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
# Download NLTK resources (run only once if not downloaded)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
# Sample sentence
sentence = "The quick brown fox jumps over the lazy dog"
# Tokenize the sentence
tokens = word_tokenize(sentence)
# POS tagging
tagged = pos_tag(tokens)
# Define a chunk grammar using regular expressions
# NP (noun phrase) chunking: "NP: {<DT>?<JJ>*<NN>}"
# This grammar captures optional determiner (DT), adjectives (JJ), and nouns (NN) as a noun phrase
chunk_grammar = r"""
NP: {<DT>?<JJ>*<NN>}
"""
# Create a chunk parser with the defined grammar
chunk_parser = RegexpParser(chunk_grammar)
# Parse the tagged sentence to extract chunks
chunks = chunk_parser.parse(tagged)
# Display the chunks
for subtree in chunks.subtrees():
if subtree.label() == 'NP':
print(subtree)
https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 9/9
4/23/24, 2:20 PM nlp.ipynb - Colab
#1
import spacy
# Load English tokenizer, tagger, parser, NER, and word vectors
nlp = spacy.load("en_core_web_sm")
# Sample text for analysis
text = "Natural Language Processing is a fascinating field of study."
# Process the text with spaCy
doc = nlp(text)
# Extracting tokens and lemmatization
tokens = [token.text for token in doc]
lemmas = [token.lemma_ for token in doc]
print("Tokens:", tokens)
print("Lemmas:", lemmas)
# Dependency parsing
print("\nDependency Parsing:")
for token in doc:
print(token.text, token.dep_, token.head.text, token.head.pos_,
[child for child in token.children])
Tokens: ['Natural', 'Language', 'Processing', 'is', 'a', 'fascinating', 'field', 'of', 'study', '.']
Lemmas: ['Natural', 'Language', 'Processing', 'be', 'a', 'fascinating', 'field', 'of', 'study', '.']
Dependency Parsing:
Natural compound Language PROPN []
Language compound Processing PROPN [Natural]
Processing nsubj is AUX [Language]
is ROOT is AUX [Processing, field, .]
a det field NOUN []
fascinating amod field NOUN []
field attr is AUX [a, fascinating, of]
of prep field NOUN [study]
study pobj of ADP []
. punct is AUX []
#1 case study
import spacy
# Load English tokenizer, tagger, parser, NER, and word vectors
nlp = spacy.load("en_core_web_sm")
# Sample customer feedback data
customer_feedback = [
"The product is amazing! I love the quality.",
"The customer service was terrible, very disappointed.",
"Great experience overall, highly recommended.",
"The delivery was late, very frustrating."
]
def analyze_feedback(feedback):
for idx, text in enumerate(feedback, start=1):
print(f"\nAnalyzing Feedback {idx}: '{text}'")
doc = nlp(text)
tokens = [token.text for token in doc]
lemmas = [token.lemma_ for token in doc]
print("Tokens:", tokens)
print("Lemmas:", lemmas)
print("\nDependency Parsing:")
for token in doc:
print(token.text, token.dep_, token.head.text, token.head.pos_,
[child for child in token.children])
if __name__ == "__main__":
analyze_feedback(customer_feedback)
Dependency Parsing:
The det delivery NOUN []
delivery nsubj was AUX [The]
was ROOT was AUX [delivery, frustrating, .]
late advmod frustrating ADJ []
, punct frustrating ADJ []
very advmod frustrating ADJ []
frustrating acomp was AUX [late, ,, very]
. punct was AUX []
https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 1/9
4/23/24, 2:20 PM nlp.ipynb - Colab
#2
import nltk
import random
nltk.download('punkt')
nltk.download('gutenberg')
words = nltk.corpus.gutenberg.words()
bigrams = list(nltk.bigrams(words))
starting_word = "the"
generated_text = [starting_word]
for _ in range(20):
next_word = random.choice(possible_words)
generated_text.append(next_word)
print(' '.join(generated_text))
#2 Case study
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer
class EmailAutocompleteSystem:
def __init__(self):
self.model_name = "gpt2"
self.tokenizer = GPT2Tokenizer.from_pretrained(self.model_name)
self.model = GPT2LMHeadModel.from_pretrained(self.model_name)
def generate_suggestions(self, user_input, context):
input_text = f"{context} {user_input}"
input_ids = self.tokenizer.encode(input_text, return_tensors="pt")
with torch.no_grad():
output = self.model.generate(input_ids, max_length=50, num_return_sequences=1,no_repeat_ngram_size=2)
generated_text = self.tokenizer.decode(output[0], skip_special_tokens=True)
suggestions = generated_text.split()[len(user_input.split()):]
return suggestions
if __name__ == "__main__":
autocomplete_system = EmailAutocompleteSystem()
email_context = "Subject: Discussing Project Proposal\nHi [Recipient],"
while True:
user_input = input("Enter your sentence (type 'exit' to end): ")
if user_input.lower() == 'exit':
break
suggestions = autocomplete_system.generate_suggestions(user_input, email_context)
if suggestions:
print("Autocomplete Suggestions:", suggestions)
else:
print("No suggestions available.")
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:88: UserWarni
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public mo
warnings.warn(
tokenizer_config.json: 100% 26.0/26.0 [00:00<00:00, 636B/s]
https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 2/9
4/23/24, 2:20 PM nlp.ipynb - Colab
#3
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
# Load the 20 Newsgroups dataset
categories = ['sci.med', 'sci.space', 'comp.graphics', 'talk.politics.mideast']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories)
# Split the data into training and testing sets
X_train = newsgroups_train.data
X_test = newsgroups_test.data
y_train = newsgroups_train.target
y_test = newsgroups_test.target
# Create a pipeline with TF-IDF vectorizer and LinearSVC classifier
model = make_pipeline(
TfidfVectorizer(),
LinearSVC()
)
# Train the model
model.fit(X_train, y_train)
# Predict labels for the test set
predictions = model.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)
print("\nClassification Report:")
print(classification_report(y_test, predictions))
Accuracy: 0.9504823151125402
Classification Report:
precision recall f1-score support
#3 Case study
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report
# Load the 20 Newsgroups dataset as a proxy for customer support emails
newsgroups = fetch_20newsgroups(subset='all', categories=['comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'rec.autos', 'rec.motorcy
# Prepare data and target labels
X = newsgroups.data
y = newsgroups.target
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create TF-IDF vectorizer
vectorizer = TfidfVectorizer(stop_words='english', max_features=10000)
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)
# Train the LinearSVC classifier
classifier = LinearSVC()
classifier.fit(X_train, y_train)
# Predict labels for the test set
predictions = classifier.predict(X_test)
# Evaluate the classifier
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)
print("\nClassification Report:")
print(classification_report(y_test, predictions, target_names=newsgroups.target_names))
Accuracy: 0.9389623601220752
Classification Report:
precision recall f1-score support
https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 3/9
4/23/24, 2:20 PM nlp.ipynb - Colab
rec.motorcycles 0.96 0.99 0.97 205
sci.electronics 0.92 0.93 0.92 189
#4
# Install necessary libraries
!pip install gensim
!pip install nltk
# Import required libraries
import gensim.downloader as api
from nltk.tokenize import word_tokenize
# Download pre-trained word vectors (Word2Vec)
word_vectors = api.load("word2vec-google-news-300")
# Sample sentences
sentences = [
"Natural language processing is a challenging but fascinating field.",
"Word embeddings capture semantic meanings of words in a vector space."
]
# Tokenize sentences
tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in sentences]
# Perform semantic analysis using pre-trained word vectors
for tokenized_sentence in tokenized_sentences:
for word in tokenized_sentence:
if word in word_vectors:
similar_words = word_vectors.most_similar(word)
print(f"Words similar to '{word}': {similar_words}")
else:
print(f"'{word}' is not in the pre-trained Word2Vec model.")
https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 4/9
4/23/24, 2:20 PM nlp.ipynb - Colab
#4 case study
import nltk
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
# Initialize NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
# Function to perform semantic analysis
def semantic_analysis(text):
tokens = word_tokenize(text)
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]
synonyms = set()
for token in lemmatized_tokens:
for syn in wordnet.synsets(token):
for lemma in syn.lemmas():
synonyms.add(lemma.name())
return list(synonyms)
# Example customer queries
customer_queries = [
"I received a damaged product. Can I get a refund?",
"I'm having trouble accessing my account.",
"How can I track my order status?",
"The item I received doesn't match the description.",
"Is there a discount available for bulk orders?"
]
# Semantic analysis for each query
for query in customer_queries:
print("Customer Query:", query)
synonyms = semantic_analysis(query)
print("Semantic Analysis (Synonyms):", synonyms)
print("\n")
https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 5/9
4/23/24, 2:20 PM nlp.ipynb - Colab
#5
# Install necessary libraries
!pip install scikit-learn
!pip install nltk
# Import required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
from nltk.corpus import movie_reviews # Sample dataset from NLTK
# Download NLTK resources (run only once if not downloaded)
import nltk
nltk.download('movie_reviews')
# Load the movie_reviews dataset
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 6/9
4/23/24, 2:20 PM nlp.ipynb - Colab
#5 case study
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
# Download NLTK resources (only required once)
nltk.download('vader_lexicon')
# Sample reviews
reviews = [
"This product is amazing! I love it.",
"The product was good, but the packaging was damaged.",
"Very disappointing experience. Would not recommend.",
"Neutral feedback on the product.",
]
# Initialize Sentiment Intensity Analyzer
sid = SentimentIntensityAnalyzer()
# Analyze sentiment for each review
for review in reviews:
print("Review:", review)
scores = sid.polarity_scores(review)
print("Sentiment:", end=' ')
if scores['compound'] > 0.05:
print("Positive")
elif scores['compound'] < -0.05:
print("Negative")
else:
print("Neutral")
print()
#6
# Install NLTK (if not already installed)
!pip install nltk
# Import necessary libraries
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
# Sample text for POS tagging
text = "Parts of speech tagging helps to understand the function of each word in a sentence."
# Tokenize the text into words
tokens = nltk.word_tokenize(text)
# Perform POS tagging
pos_tags = nltk.pos_tag(tokens)
# Display the POS tags
print("POS tags:", pos_tags)
https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 7/9
4/23/24, 2:20 PM nlp.ipynb - Colab
#6 Case study
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
# Download NLTK resources (if not already downloaded)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
def pos_tagging(text):
sentences = sent_tokenize(text)
tagged_tokens = []
for sentence in sentences:
tokens = word_tokenize(sentence)
tagged_tokens.extend(nltk.pos_tag(tokens))
return tagged_tokens
def main():
article_text = """Manchester United secured a 3-1 victory over Chelsea in yesterday's match.
Goals from Rashford, Greenwood, and Fernandes sealed the win for United.
Chelsea's only goal came from Pulisic in the first half.
The victory boosts United's chances in the Premier League title race.
"""
tagged_tokens = pos_tagging(article_text)
print("Original Article Text:\n", article_text)
print("\nParts of Speech Tagging:")
for token, pos_tag in tagged_tokens:
print(f"{token}: {pos_tag}")
if __name__ == "__main__":
main()
https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 8/9
4/23/24, 2:20 PM nlp.ipynb - Colab
#7
!pip install nltk
import nltk
from nltk import RegexpParser
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
# Download NLTK resources (run only once if not downloaded)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
# Sample sentence
sentence = "The quick brown fox jumps over the lazy dog"
# Tokenize the sentence
tokens = word_tokenize(sentence)
# POS tagging
tagged = pos_tag(tokens)
# Define a chunk grammar using regular expressions
# NP (noun phrase) chunking: "NP: {<DT>?<JJ>*<NN>}"
# This grammar captures optional determiner (DT), adjectives (JJ), and nouns (NN) as a noun phrase
chunk_grammar = r"""
NP: {<DT>?<JJ>*<NN>}
"""
# Create a chunk parser with the defined grammar
chunk_parser = RegexpParser(chunk_grammar)
# Parse the tagged sentence to extract chunks
chunks = chunk_parser.parse(tagged)
# Display the chunks
for subtree in chunks.subtrees():
if subtree.label() == 'NP':
print(subtree)
https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 9/9
SATHYABAMA INSTITUTE OF SCIENCE & TECHNOLOGY
SCHOOL OF COMPUTING
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
SCSA 2604 NATURAL LANGUAGE PROCESSING LAB
PROBLEM STATEMENT:
A company receives a large volume of customer feedback across various channels such as
emails, social media, and surveys. Understanding and categorizing this feedback manually is
time-consuming and inefficient. The goal is to develop an NLP-based program to automatically
process and analyze customer feedback to extract valuable insights.
OBJECTIVE:
Utilize spaCy and NLP techniques to process customer feedback text, extract tokens, perform
lemmatization, and conduct dependency parsing to uncover underlying relationships between
words.
APPROACH:
Data Collection:
Gather a dataset containing customer feedback from different sources, including emails, social
media comments, and survey responses.
Insight Generation:
Categorize the feedback based on sentiment, identify frequently occurring topics, or extract
key phrases related to specific issues or praises mentioned by customers.
Implementation:
Use Python and spaCy to develop the program for text processing and analysis.
Incorporate visualization techniques (e.g., graphs, word clouds) to represent the findings and
insights derived from the processed feedback.
Evaluation:
Evaluate the accuracy and efficiency of tokenization, lemmatization, and dependency parsing
in handling different types of customer feedback.
Measure the program's ability to extract meaningful insights and categorize feedback
accurately.
PROGRAM:
import spacy
def analyze_feedback(feedback):
for idx, text in enumerate(feedback, start=1):
print(f"\nAnalyzing Feedback {idx}: '{text}'")
doc = nlp(text)
# Dependency parsing
print("\nDependency Parsing:")
for token in doc:
print(token.text, token.dep_, token.head.text, token.head.pos_,
[child for child in token.children])
if __name__ == "__main__":
analyze_feedback(customer_feedback)
OUTPUT:
Analyzing Feedback 1: 'The product is amazing! I love the quality.'
Tokens: ['The', 'product', 'is', 'amazing', '!', 'I', 'love', 'the', 'quality', '.']
Lemmas: ['the', 'product', 'be', 'amazing', '!', 'I', 'love', 'the', 'quality', '.']
Dependency Parsing:
The det product NOUN []
3
Dependency Parsing:
The det service NOUN []
customer compound service NOUN []
service nsubj was AUX [The, customer]
was ROOT was AUX [service, disappointed, .]
terrible amod disappointed ADJ []
, punct disappointed ADJ []
very advmod disappointed ADJ []
disappointed acomp was AUX [terrible, ,, very]
. punct was AUX []
Dependency Parsing:
4
Dependency Parsing:
The det delivery NOUN [ ]
delivery nsubj was AUX [The]
was ROOT was AUX [delivery, frustrating, .]
late advmod frustrating ADJ [ ]
, punct frustrating ADJ [ ]
very advmod frustrating ADJ [ ]
frustrating acomp was AUX [late, ,, very]
. punct was AUX [ ]
CONCLUSION:
The developed NLP-based program utilizing spaCy proves to be an efficient solution for
processing and analyzing customer feedback. Its capability to extract tokens, perform
lemmatization, and conduct dependency parsing aids in understanding the sentiment,
identifying key topics, and establishing relationships within the feedback data. This enables
companies to derive actionable insights, prioritize issues, and enhance customer satisfaction
based on the analysis of their feedback.
RESULT:
This case study demonstrates the practical application of the provided code snippet using
5
spaCy in a business context, specifically for customer feedback analysis, showcasing how
Page
NLP techniques can be employed to extract valuable insights from unstructured text data.
AIM: The aim of this case study is to demonstrate the extraction of noun phrases from a
given text using chunking, a technique in Natural Language Processing (NLP). We will
utilize Python's NLTK library to implement chunking and extract meaningful noun phrases
from the text.
Problem Statement:
Given a sample text, our goal is to identify and extract noun phrases, which are
sequences of words containing a noun and optionally other words like adjectives or
determiners. The problem involves implementing a program that tokenizes the text, performs
part-of-speech tagging, applies chunking to identify noun phrases, and finally outputs the
extracted noun phrases.
Objectives :
1. Tokenize the input text into words.
2. Perform part-of-speech tagging to assign grammatical tags to each word.
3. Define a chunk grammar to identify noun phrases.
4. Apply chunking to extract noun phrases from the text.
5. Display the extracted noun phrases.
Dataset:
For this case study, we will use a sample text: "The quick brown fox jumps over the lazy
dog."
Approach:
The approach involves several steps to extract noun phrases from the given text using
chunking in Natural Language Processing (NLP). Firstly, the input text is tokenized into
individual words to prepare it for further processing. Following tokenization, each word is
tagged with its part-of-speech using NLTK's pos_tag function, which assigns grammatical
tags to each word based on its context. Next, a chunk grammar is defined to specify the
patterns that identify noun phrases. This grammar is then utilized to apply chunking, which
groups consecutive words that match the defined patterns into noun phrases. Finally, the
extracted noun phrases are outputted, providing meaningful insights into the structure and
content of the text. This approach allows for the identification and extraction of important
linguistic units, facilitating various NLP tasks such as information extraction, text
summarization, and sentiment analysis.
Program :
import nltk
import os
# Sample text
text = "The quick brown fox jumps over the lazy dog."
# Apply chunking
chunked_text = chunk_parser.parse(pos_tags)
# Output
print("Original Text:", text)
print("Noun Phrases:")
for phrase in noun_phrases:
print("-", phrase)
Output:
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /root/nltk_data...
[nltk_data] Unzipping taggers/averaged_perceptron_tagger.zip.
Original Text: The quick brown fox jumps over the lazy dog.
Noun Phrases:
- The quick brown
- fox
- the lazy dog
Result:
Chunking is a valuable technique in NLP for identifying and extracting meaningful
phrases from text. In this case study, we successfully implemented chunking using Python's
NLTK library to extract noun phrases from a given text. By identifying and extracting noun
phrases, we gained insights into the structure and semantics of the text, which can be
beneficial for various NLP applications such as information extraction, sentiment analysis,
and text summarization.
SATHYABAMA INSTITUTE OF SCIENCE & TECHNOLOGY
SCHOOL OF COMPUTING
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
SCSA 2604 NATURAL LANGUAGE PROCESSING LAB
AIM: To perform sentiment analysis program using an SVM classifier with TF-IDF
vectorization.
PROCEDURE:
Data Preparation: Downloading the dataset, converting it into a suitable format (words and
sentiments), and structuring it into a DataFrame.
Splitting Data: Dividing the dataset into training and testing sets to train the model on a
portion and evaluate it on another.
TF-IDF Vectorization: Converting text data into numerical vectors using TF-IDF (Term
Frequency-Inverse Document Frequency) representation.
SVM Initialization and Training: Setting up an SVM classifier and training it using the TF-
IDF vectors obtained from the training text data.
Prediction and Evaluation: Transforming test data into TF-IDF vectors, predicting sentiment
labels, and evaluating the model's performance by comparing predicted labels with actual
labels using accuracy and a classification report.
The following algorithm outlines the process of building a sentiment analysis model using an
SVM classifier with TF-IDF vectorization in Python. Adjustments can be made to use
different datasets, vectorization techniques, or machine learning models based on specific
requirements.
ALGORITHM:
1. Library Installation and Import: Install required libraries (scikit-learn and nltk).
Import necessary modules from these libraries.
2. Download NLTK Resources: Download the movie_reviews dataset from NLTK.
3. Load and Prepare Dataset: Load the movie_reviews dataset.
Convert the dataset into a suitable format (list of words and corresponding sentiments)
and create a DataFrame.
1
4. Split Data into Train and Test Sets: Split the dataset into training and testing sets (e.g.,
Page
PROGRAM:
# Install necessary libraries
!pip install scikit-learn
!pip install nltk
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')
print(classification_report(y_test, y_pred))
4
Page
PROCEDURE:
This algorithm outlines the steps involved in the text classification task using
LinearSVC on the 20 Newsgroups dataset. It provides a structured approach to implementing
the program and understanding the workflow.
ALGORITHM:
Algorithm: Text Classification using LinearSVC
End Algorithm
PROGRAM:
# Install scikit-learn if not already installed
!pip install scikit-learn
OUTPUT:
Requirement already satisfied: scikit-learn in
/usr/local/lib/python3.10/dist-packages (1.2.2)
Requirement already satisfied: numpy>=1.17.3 in
/usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.23.5)
Requirement already satisfied: scipy>=1.3.2 in
/usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.11.4)
Requirement already satisfied: joblib>=1.1.1 in
/usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.3.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in
/usr/local/lib/python3.10/dist-packages (from scikit-learn) (3.2.0)
Accuracy: 0.9504823151125402
Classification Report:
precision recall f1-score support
Dataset:
The dataset consists of a collection of news articles in text format. Each article is labeled with
its category (e.g., politics, sports, entertainment) and contains textual content for analysis.
Approach:
1. Preprocess the dataset by tokenizing the text into words and sentences.
2. Perform parts of speech tagging using a pre-trained model or a custom-trained model.
3. Extract relevant parts of speech such as nouns, verbs, adjectives, and adverbs from the
tagged text.
4. Analyze the distribution of different parts of speech across the articles to understand
their linguistic characteristics.
5. Integrate the extracted information into the recommendation system to improve the
relevance of recommended articles for users.
Program :
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
def pos_tagging(text):
sentences = sent_tokenize(text)
tagged_tokens = []
for sentence in sentences:
tokens = word_tokenize(sentence)
tagged_tokens.extend(nltk.pos_tag(tokens))
return tagged_tokens
def main():
# Example news article
article_text = """
Manchester United secured a 3-1 victory over Chelsea in yesterday's
match.
Goals from Rashford, Greenwood, and Fernandes sealed the win for
United.
Chelsea's only goal came from Pulisic in the first half.
The victory boosts United's chances in the Premier League title
race.
"""
tagged_tokens = pos_tagging(article_text)
print("Original Article Text:\n", article_text)
print("\nParts of Speech Tagging:")
for token, pos_tag in tagged_tokens:
print(f"{token}: {pos_tag}")
if __name__ == "__main__":
main()
Output:
Original Article Text:
LAB 7: CHUNKING
PROCEDURE:
In Natural Language Processing (NLP), chunking is the process of extracting short, meaningful
phrases (chunks) from a sentence based on specific patterns of parts of speech (POS). Python
provides tools like NLTK (Natural Language Toolkit) to perform chunking. This example
demonstrates a basic noun phrase (NP) and verb phrase (VP) chunking using NLTK. You can
adjust the chunk grammar patterns to capture different types of phrases or entities based on
your specific needs.
The chunk_grammar variable contains patterns defined using regular expressions for
identifying noun phrases and verb phrases. Adjusting these patterns can help extract different
types of chunks like prepositional phrases, named entities, etc.
Tokenization: Breaking the sentence into individual tokens or words.
POS Tagging: Assigning part-of-speech tags to each token (identifying whether it's a noun,
verb, adjective, etc.).
Chunking: Grouping tokens into larger structures (noun phrases, verb phrases) based on
defined grammar rules.
Chunk Grammar: Regular expressions defining patterns for identifying specific chunk
structures (like noun phrases).
Chunk Parser: Utilizing the chunk grammar to parse and extract chunks based on the
provided POS-tagged tokens.
The following algorithm outlines the steps involved in the noun phrase chunking process
using NLTK in Python, highlighting the key processes and the role of chunk grammar in
identifying and extracting specific syntactic structures from text data.
1
Page
1. Import Necessary Libraries: Import required modules from NLTK for tokenization,
POS tagging, and chunking.
2. Download NLTK Resources (if needed): Ensure NLTK resources like tokenizers and
POS taggers are downloaded (nltk.download('punkt'),
nltk.download('averaged_perceptron_tagger')).
3. Define a Sample Sentence: Set a sample sentence that will be used for chunking.
4. Tokenization: Break the sentence into individual words or tokens using NLTK's
word_tokenize() function.
5. Part-of-Speech (POS) Tagging: Tag each token with its corresponding part-of-speech
using NLTK's pos_tag() function.
7. Chunk Parser Creation: Create a chunk parser using RegexpParser() and provide the
defined chunk grammar.
8. Chunking: Parse the tagged sentence using the created chunk parser to extract chunks
based on the defined grammar.
9. Display Chunks: Iterate through the parsed chunks and print the subtrees labeled as
'NP', which represent the identified noun phrases.
2
Page
# Sample sentence
sentence = "The quick brown fox jumps over the lazy dog"
# POS tagging
tagged = pos_tag(tokens)
OUTPUT:
Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (3.8.1)
Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk)
(8.1.7)
Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk)
(1.3.2)
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.10/dist-packages
(from nltk) (2023.6.3)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from nltk)
(4.66.1)
(NP The/DT quick/JJ brown/NN)
(NP fox/NN)
(NP the/DT lazy/JJ dog/NN)
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /root/nltk_data...
[nltk_data] Package averaged_perceptron_tagger is already up-to-
[nltk_data] date!
4
Page
PROCEDURE:
Library Installation and Import: Ensures the NLTK library is available for use and imports
the necessary modules for text processing.
Download NLTK Resources: Downloads essential resources (punkt for tokenization,
averaged_perceptron_tagger for POS tagging) required by NLTK.
Sample Text: Defines a piece of text to demonstrate POS tagging.
Tokenization: Divides the text into individual words or tokens, making it suitable for further
analysis.
POS Tagging: Assigns each word in the text its respective grammatical category or POS tag
using NLTK's POS tagging functionality.
Display POS Tags: Prints or displays the words along with their associated POS tags obtained
from the tagging process.
The following algorithm outlines the steps involved in performing Parts of Speech (POS)
tagging using NLTK in Python. It demonstrates how to tokenize a text and assign
grammatical categories to individual words, providing insight into the linguistic structure of
the text.
ALGORITHM:
1. Library Installation and Import: Install NLTK library if not already installed.
Import the necessary NLTK library for text processing and POS tagging.
2. Download NLTK Resources: Download NLTK resources required for tokenization
and POS tagging (punkt for tokenization, averaged_perceptron_tagger for POS
tagging).
3. Sample Text: Define a sample text for POS tagging.
1
4. Tokenization: Break down the provided text into individual words (tokens) using
Page
PROGRAM:
# Install NLTK (if not already installed)
!pip install nltk
OUTPUT:
Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (3.8.1)
Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk)
2
(8.1.7)
Page
3
Page
Problem Statement:
A customer support company receives a large volume of incoming emails from customers
with various inquiries, complaints, and feedback. Manually categorizing and prioritizing
these emails is time-consuming and inefficient. The company wants to develop a text
classification system to automatically classify incoming emails into predefined categories,
allowing for faster response times and better customer service.
Objectives :
The text classification system successfully categorizes incoming customer emails into
predefined categories.
It improves the efficiency of the customer support team by automating email
classification and prioritization.
The company can respond to customer inquiries and issues more promptly, leading to
higher customer satisfaction and retention.
Dataset:
The company has a dataset of past customer emails along with their corresponding
categories. Each email is labeled with one or more categories, indicating the type of inquiry
or issue raised by the customer. For demonstration purposes, we will use the
fetch_20newsgroups dataset from scikit-learn, which contains a collection of newsgroup
documents, spanning 20 different newsgroups. We'll simulate this dataset as if it were
customer support emails categorized into predefined categories.
Approach:
Data Preparation:
Load the 20 Newsgroups dataset as a proxy for customer support emails.
Select a subset of categories that represent different types of customer inquiries,
complaints, and feedback.
Prepare the data and target labels from the dataset.
Data Preprocessing:
Clean the email text data by removing unnecessary information such as email headers,
signatures, and HTML tags.
Tokenize the text and convert it to lowercase.
Remove stopwords and apply techniques like stemming or lemmatization to reduce
words to their base forms.
Feature Extraction:
Use TF-IDF Vectorizer to convert text data into numerical features, limiting the maximum
number of features to 10,000 and removing English stopwords.
Model Selection:
Choose a suitable classification algorithm such as Linear Support Vector Classifier
(LinearSVC) for text classification.
Train the chosen model on the training data.
Model Evaluation:
Predict labels for the test set using the trained model.
Evaluate the classifier's performance using accuracy and a classification report, which
includes precision, recall, and F1-score for each category.
Future Enhancements:
Continuous monitoring and updating of the model to adapt to evolving customer
inquiries and language patterns.
Integration of sentiment analysis to assess the sentiment of customer emails and
prioritize urgent or critical issues.
Expansion of the model to handle multiclass classification and a wider range of
customer inquiry categories.
Program :
# Import necessary libraries
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report
# Load the 20 Newsgroups dataset as a proxy for customer support emails
newsgroups = fetch_20newsgroups(subset='all', categories=['comp.sys.ibm.pc.hardware',
'comp.sys.mac.hardware', 'rec.autos', 'rec.motorcycles', 'sci.electronics'])
Classification Report:
precision recall f1-score support
Result:
This case study outlines the problem statement, dataset, approach, expected outcome,
and future enhancements for developing a text classification system for customer support
email classification. It demonstrates the application of machine learning techniques to
automate and improve customer service processes.
SATHYABAMA INSTITUTE OF SCIENCE & TECHNOLOGY
SCHOOL OF COMPUTING
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
SCSA 2604 NATURAL LANGUAGE PROCESSING LAB
PROCEDURE:
Semantic analysis is a broad area in NLP. This program demonstrates semantic analysis by
leveraging pre-trained word vectors using Word2Vec from Gensim. It utilizes word
embeddings to find words similar to each word in the provided sentences.
Library Installation: Ensure the necessary libraries (Gensim and NLTK) are installed.
Library Import: Import the required libraries (gensim for word vectors and nltk for
tokenization).
Pre-trained Word Vectors: Load pre-trained word vectors (Word2Vec) using Gensim's
api.load() method.
Sample Sentences: Define sample sentences for semantic analysis.
Tokenization: Break down the sentences into individual words using NLTK's word_tokenize()
method.
Semantic Analysis: Iterate through each word in the tokenized sentences and:
Check if the word exists in the pre-trained Word2Vec model.
If the word exists, find similar words using the most_similar() method from the
word vectors model.
Display or store the similar words for each word in the sentence.
If the word doesn't exist in the pre-trained model, indicate that it's not present.
The following algorithm outlines the steps involved in performing semantic analysis using pre-
trained word vectors (Word2Vec) in Python, demonstrating how to find similar words for each
word in the provided sentences based on the loaded word vectors.
1
Page
PROGRAM:
# Install necessary libraries
!pip install gensim
!pip install nltk
# Sample sentences
sentences = [
"Natural language processing is a challenging but fascinating field.",
"Word embeddings capture semantic meanings of words in a vector space."
2
Page
OUTPUT:
Requirement already satisfied: gensim in /usr/local/lib/python3.10/dist-packages (4.3.2)
Requirement already satisfied: numpy>=1.18.5 in /usr/local/lib/python3.10/dist-packages
(from gensim) (1.23.5)
Requirement already satisfied: scipy>=1.7.0 in /usr/local/lib/python3.10/dist-packages (from
gensim) (1.11.3)
Requirement already satisfied: smart-open>=1.8.1 in /usr/local/lib/python3.10/dist-packages
(from gensim) (6.4.0)
Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (3.8.1)
Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk)
(8.1.7)
Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk)
(1.3.2)
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.10/dist-packages
(from nltk) (2023.6.3)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from nltk)
(4.66.1)
[==================================================] 100.0%
1662.8/1662.8MB downloaded
3
('Chris_Manfredini_kicked', 0.47327715158462524)]
0.5015827417373657)]
Page
6
Page
# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in
stop_words]
# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in
filtered_tokens]
# Synonyms generation
synonyms = set()
for token in lemmatized_tokens:
for syn in wordnet.synsets(token):
for lemma in syn.lemmas():
synonyms.add(lemma.name())
return list(synonyms)
Output:
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
Customer Query: I received a damaged product. Can I get a refund?
Semantic Analysis (Synonyms): ['refund', 'grow', 'baffle', 'pay_off', 'Cartesian_product',
'arrive', 'engender', 'standard', 'have', 'damaged', 'experience', 'develop', 'sustain', 'product',
'acquire', 'encounter', 'take_in', 'find', 'stupefy', 'bugger_off', 'draw', 'pose', 'aim', 'nonplus',
'induce', 'mother', 'stimulate', 'make', 'repayment', 'convey', 'cause', 'mathematical_product',
'get', 'damage', 'produce', 'set_out', 'merchandise', 'buzz_off', 'beat', 'meet', 'start', 'commence',
'return', 'pick_up', 'production', 'fix', 'stick', "get_under_one's_skin", 'go', 'mystify', 'take',
'perplex', 'welcome', 'vex', 'begin', 'come', 'fuck_off', 'bring', 'contract', 'capture', 'generate',
'give_back', 'incur', 'repay', 'let', 'become', 'start_out', 'gravel', 'scram', 'obtain', 'pay_back',
'amaze', 'catch', 'beget', 'get_down', 'set_about', 'invite', 'bring_forth', 'drive', 'sire', 'intersection',
'discredited', 'suffer', 'received', 'ware', 'dumbfound', 'fetch', 'father', 'arrest', 'flummox', 'puzzle',
'bewilder', 'receive']
Result:
By following this approach, the program aims to achieve the objectives of performing
semantic analysis on customer queries and improving customer service operations by providing
valuable insights into the semantics of the queries.
Lab 2 – Case Study
Objectives:
The objectives of the provided program are to implement a simple email autocomplete system using the
GPT-2 language model. The program aims to facilitate user interaction by suggesting autocompletions
based on the context provided and the user's input. Key objectives include initializing and integrating
the GPT-2 model and tokenizer from the Hugging Face Transformers library, defining a class structure
(EmailAutocompleteSystem) to encapsulate the autocomplete system, and creating a method
(generate_suggestions) to generate context-aware suggestions. The program encourages user
engagement by incorporating a user input loop, allowing continuous interaction until the user chooses
to exit. The ultimate goal is to demonstrate the practical use of a pre-trained language model for
generating relevant suggestions in the context of email composition, showcasing the capabilities of the
GPT-2 model for natural language processing tasks.
Approach:
1. Data Collection:
Collect a diverse dataset of emails, including different writing styles, topics, and
formality levels.
Annotate the dataset with proper context information, such as sender, recipient,
subject, and the body of the email.
2. Data Preprocessing:
Clean and tokenize the text data.
Handle issues like punctuation, capitalization, and special characters.
Split the dataset into training and testing sets.
3. Model Selection:
Choose a suitable NLP model for word generation. Options may include recurrent
neural networks (RNNs), long short-term memory networks (LSTMs), or
transformer models like GPT-3.
Fine-tune or train the model on the email dataset to understand the specific
language patterns used in emails.
4. Context Integration:
Design a mechanism to incorporate contextual information from the email, such
as the subject, previous sentences, and the relationship between the sender and
recipient.
Implement a way for the model to understand the context shift within the email
body.
5. User Interface:
Develop a user-friendly interface that integrates with popular email clients or
standalone applications.
Allow users to enable or disable the autocomplete feature as needed.
Provide visual cues to indicate suggested words or phrases.
6. Model Evaluation:
Evaluate the model's performance on the test dataset using metrics like perplexity,
accuracy, and precision.
Gather user feedback on the effectiveness and usability of the autocomplete
system.
7. Fine-Tuning and Iteration:
Analyze user feedback and performance metrics to identify areas for
improvement.
Consider refining the model based on user suggestions and addressing any
limitations.
8. Deployment:
Deploy the trained model as a service that can be accessed by the email
application.
Ensure scalability and reliability of the autocomplete system.
Potential Challenges:
Context Understanding: Ensuring the model effectively understands and incorporates
the context of the email.
Ambiguity Handling: Dealing with ambiguous phrases and understanding the user's
intended meaning.
Personalization: Tailoring the system to individual writing styles and preferences.
Success Criteria:
Improved email composition efficiency and speed.
Positive user feedback on the accuracy and relevance of autocomplete suggestions.
Reduction in typing errors and improved overall user experience.
By successfully developing and implementing this word generation program, the company aims
to enhance the productivity and user experience of individuals engaged in email communication.
Program :
!pip install transformers
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer
class EmailAutocompleteSystem:
def __init__(self):
self.model_name = "gpt2"
self.tokenizer = GPT2Tokenizer.from_pretrained(self.model_name)
self.model = GPT2LMHeadModel.from_pretrained(self.model_name)
with torch.no_grad():
output = self.model.generate(input_ids, max_length=50, num_return_sequences=1,
no_repeat_ngram_size=2)
# Example usage
if __name__ == "__main__":
autocomplete_system = EmailAutocompleteSystem()
while True:
user_input = input("Enter your sentence (type 'exit' to end): ")
if user_input.lower() == 'exit':
break
if suggestions:
print("Autocomplete Suggestions:", suggestions)
else:
print("No suggestions available.")
Output:
Enter your sentence (type 'exit' to end): hello, how are you ? How's
everything going on !
The attention mask and the pad token id were not set. As a consequence, you
may observe unexpected behavior. Please pass your input's `attention_mask` to
obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Autocomplete Suggestions: ["How's", 'everything', 'going', 'on!', "I'm", 'a',
'programmer', 'and', "I've", 'been', 'working', 'on', 'a', 'project', 'for',
'a', 'while', 'now.', 'I', 'have', 'a', 'lot', 'of', 'ideas', 'for', 'the']
Enter your sentence (type 'exit' to end): exit
Result:
The result demonstrates the integration of a powerful language model for enhancing user experience in
composing emails through intelligent autocomplete suggestions.
4/23/24, 2:20 PM nlp.ipynb - Colab
#1
import spacy
# Load English tokenizer, tagger, parser, NER, and word vectors
nlp = spacy.load("en_core_web_sm")
# Sample text for analysis
text = "Natural Language Processing is a fascinating field of study."
# Process the text with spaCy
doc = nlp(text)
# Extracting tokens and lemmatization
tokens = [token.text for token in doc]
lemmas = [token.lemma_ for token in doc]
print("Tokens:", tokens)
print("Lemmas:", lemmas)
# Dependency parsing
print("\nDependency Parsing:")
for token in doc:
print(token.text, token.dep_, token.head.text, token.head.pos_,
[child for child in token.children])
Tokens: ['Natural', 'Language', 'Processing', 'is', 'a', 'fascinating', 'field', 'of', 'study', '.']
Lemmas: ['Natural', 'Language', 'Processing', 'be', 'a', 'fascinating', 'field', 'of', 'study', '.']
Dependency Parsing:
Natural compound Language PROPN []
Language compound Processing PROPN [Natural]
Processing nsubj is AUX [Language]
is ROOT is AUX [Processing, field, .]
a det field NOUN []
fascinating amod field NOUN []
field attr is AUX [a, fascinating, of]
of prep field NOUN [study]
study pobj of ADP []
. punct is AUX []
#1 case study
import spacy
# Load English tokenizer, tagger, parser, NER, and word vectors
nlp = spacy.load("en_core_web_sm")
# Sample customer feedback data
customer_feedback = [
"The product is amazing! I love the quality.",
"The customer service was terrible, very disappointed.",
"Great experience overall, highly recommended.",
"The delivery was late, very frustrating."
]
def analyze_feedback(feedback):
for idx, text in enumerate(feedback, start=1):
print(f"\nAnalyzing Feedback {idx}: '{text}'")
doc = nlp(text)
tokens = [token.text for token in doc]
lemmas = [token.lemma_ for token in doc]
print("Tokens:", tokens)
print("Lemmas:", lemmas)
print("\nDependency Parsing:")
for token in doc:
print(token.text, token.dep_, token.head.text, token.head.pos_,
[child for child in token.children])
if __name__ == "__main__":
analyze_feedback(customer_feedback)
Dependency Parsing:
The det delivery NOUN []
delivery nsubj was AUX [The]
was ROOT was AUX [delivery, frustrating, .]
late advmod frustrating ADJ []
, punct frustrating ADJ []
very advmod frustrating ADJ []
frustrating acomp was AUX [late, ,, very]
. punct was AUX []
https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 1/9
4/23/24, 2:20 PM nlp.ipynb - Colab
#2
import nltk
import random
nltk.download('punkt')
nltk.download('gutenberg')
words = nltk.corpus.gutenberg.words()
bigrams = list(nltk.bigrams(words))
starting_word = "the"
generated_text = [starting_word]
for _ in range(20):
next_word = random.choice(possible_words)
generated_text.append(next_word)
print(' '.join(generated_text))
#2 Case study
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer
class EmailAutocompleteSystem:
def __init__(self):
self.model_name = "gpt2"
self.tokenizer = GPT2Tokenizer.from_pretrained(self.model_name)
self.model = GPT2LMHeadModel.from_pretrained(self.model_name)
def generate_suggestions(self, user_input, context):
input_text = f"{context} {user_input}"
input_ids = self.tokenizer.encode(input_text, return_tensors="pt")
with torch.no_grad():
output = self.model.generate(input_ids, max_length=50, num_return_sequences=1,no_repeat_ngram_size=2)
generated_text = self.tokenizer.decode(output[0], skip_special_tokens=True)
suggestions = generated_text.split()[len(user_input.split()):]
return suggestions
if __name__ == "__main__":
autocomplete_system = EmailAutocompleteSystem()
email_context = "Subject: Discussing Project Proposal\nHi [Recipient],"
while True:
user_input = input("Enter your sentence (type 'exit' to end): ")
if user_input.lower() == 'exit':
break
suggestions = autocomplete_system.generate_suggestions(user_input, email_context)
if suggestions:
print("Autocomplete Suggestions:", suggestions)
else:
print("No suggestions available.")
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:88: UserWarni
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public mo
warnings.warn(
tokenizer_config.json: 100% 26.0/26.0 [00:00<00:00, 636B/s]
https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 2/9
4/23/24, 2:20 PM nlp.ipynb - Colab
#3
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
# Load the 20 Newsgroups dataset
categories = ['sci.med', 'sci.space', 'comp.graphics', 'talk.politics.mideast']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories)
# Split the data into training and testing sets
X_train = newsgroups_train.data
X_test = newsgroups_test.data
y_train = newsgroups_train.target
y_test = newsgroups_test.target
# Create a pipeline with TF-IDF vectorizer and LinearSVC classifier
model = make_pipeline(
TfidfVectorizer(),
LinearSVC()
)
# Train the model
model.fit(X_train, y_train)
# Predict labels for the test set
predictions = model.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)
print("\nClassification Report:")
print(classification_report(y_test, predictions))
Accuracy: 0.9504823151125402
Classification Report:
precision recall f1-score support
#3 Case study
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report
# Load the 20 Newsgroups dataset as a proxy for customer support emails
newsgroups = fetch_20newsgroups(subset='all', categories=['comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'rec.autos', 'rec.motorcy
# Prepare data and target labels
X = newsgroups.data
y = newsgroups.target
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create TF-IDF vectorizer
vectorizer = TfidfVectorizer(stop_words='english', max_features=10000)
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)
# Train the LinearSVC classifier
classifier = LinearSVC()
classifier.fit(X_train, y_train)
# Predict labels for the test set
predictions = classifier.predict(X_test)
# Evaluate the classifier
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)
print("\nClassification Report:")
print(classification_report(y_test, predictions, target_names=newsgroups.target_names))
Accuracy: 0.9389623601220752
Classification Report:
precision recall f1-score support
https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 3/9
4/23/24, 2:20 PM nlp.ipynb - Colab
rec.motorcycles 0.96 0.99 0.97 205
sci.electronics 0.92 0.93 0.92 189
#4
# Install necessary libraries
!pip install gensim
!pip install nltk
# Import required libraries
import gensim.downloader as api
from nltk.tokenize import word_tokenize
# Download pre-trained word vectors (Word2Vec)
word_vectors = api.load("word2vec-google-news-300")
# Sample sentences
sentences = [
"Natural language processing is a challenging but fascinating field.",
"Word embeddings capture semantic meanings of words in a vector space."
]
# Tokenize sentences
tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in sentences]
# Perform semantic analysis using pre-trained word vectors
for tokenized_sentence in tokenized_sentences:
for word in tokenized_sentence:
if word in word_vectors:
similar_words = word_vectors.most_similar(word)
print(f"Words similar to '{word}': {similar_words}")
else:
print(f"'{word}' is not in the pre-trained Word2Vec model.")
https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 4/9
4/23/24, 2:20 PM nlp.ipynb - Colab
#4 case study
import nltk
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
# Initialize NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
# Function to perform semantic analysis
def semantic_analysis(text):
tokens = word_tokenize(text)
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]
synonyms = set()
for token in lemmatized_tokens:
for syn in wordnet.synsets(token):
for lemma in syn.lemmas():
synonyms.add(lemma.name())
return list(synonyms)
# Example customer queries
customer_queries = [
"I received a damaged product. Can I get a refund?",
"I'm having trouble accessing my account.",
"How can I track my order status?",
"The item I received doesn't match the description.",
"Is there a discount available for bulk orders?"
]
# Semantic analysis for each query
for query in customer_queries:
print("Customer Query:", query)
synonyms = semantic_analysis(query)
print("Semantic Analysis (Synonyms):", synonyms)
print("\n")
https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 5/9
4/23/24, 2:20 PM nlp.ipynb - Colab
#5
# Install necessary libraries
!pip install scikit-learn
!pip install nltk
# Import required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
from nltk.corpus import movie_reviews # Sample dataset from NLTK
# Download NLTK resources (run only once if not downloaded)
import nltk
nltk.download('movie_reviews')
# Load the movie_reviews dataset
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 6/9
4/23/24, 2:20 PM nlp.ipynb - Colab
#5 case study
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
# Download NLTK resources (only required once)
nltk.download('vader_lexicon')
# Sample reviews
reviews = [
"This product is amazing! I love it.",
"The product was good, but the packaging was damaged.",
"Very disappointing experience. Would not recommend.",
"Neutral feedback on the product.",
]
# Initialize Sentiment Intensity Analyzer
sid = SentimentIntensityAnalyzer()
# Analyze sentiment for each review
for review in reviews:
print("Review:", review)
scores = sid.polarity_scores(review)
print("Sentiment:", end=' ')
if scores['compound'] > 0.05:
print("Positive")
elif scores['compound'] < -0.05:
print("Negative")
else:
print("Neutral")
print()
#6
# Install NLTK (if not already installed)
!pip install nltk
# Import necessary libraries
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
# Sample text for POS tagging
text = "Parts of speech tagging helps to understand the function of each word in a sentence."
# Tokenize the text into words
tokens = nltk.word_tokenize(text)
# Perform POS tagging
pos_tags = nltk.pos_tag(tokens)
# Display the POS tags
print("POS tags:", pos_tags)
https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 7/9
4/23/24, 2:20 PM nlp.ipynb - Colab
#6 Case study
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
# Download NLTK resources (if not already downloaded)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
def pos_tagging(text):
sentences = sent_tokenize(text)
tagged_tokens = []
for sentence in sentences:
tokens = word_tokenize(sentence)
tagged_tokens.extend(nltk.pos_tag(tokens))
return tagged_tokens
def main():
article_text = """Manchester United secured a 3-1 victory over Chelsea in yesterday's match.
Goals from Rashford, Greenwood, and Fernandes sealed the win for United.
Chelsea's only goal came from Pulisic in the first half.
The victory boosts United's chances in the Premier League title race.
"""
tagged_tokens = pos_tagging(article_text)
print("Original Article Text:\n", article_text)
print("\nParts of Speech Tagging:")
for token, pos_tag in tagged_tokens:
print(f"{token}: {pos_tag}")
if __name__ == "__main__":
main()
https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 8/9
4/23/24, 2:20 PM nlp.ipynb - Colab
#7
!pip install nltk
import nltk
from nltk import RegexpParser
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
# Download NLTK resources (run only once if not downloaded)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
# Sample sentence
sentence = "The quick brown fox jumps over the lazy dog"
# Tokenize the sentence
tokens = word_tokenize(sentence)
# POS tagging
tagged = pos_tag(tokens)
# Define a chunk grammar using regular expressions
# NP (noun phrase) chunking: "NP: {<DT>?<JJ>*<NN>}"
# This grammar captures optional determiner (DT), adjectives (JJ), and nouns (NN) as a noun phrase
chunk_grammar = r"""
NP: {<DT>?<JJ>*<NN>}
"""
# Create a chunk parser with the defined grammar
chunk_parser = RegexpParser(chunk_grammar)
# Parse the tagged sentence to extract chunks
chunks = chunk_parser.parse(tagged)
# Display the chunks
for subtree in chunks.subtrees():
if subtree.label() == 'NP':
print(subtree)
https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 9/9
SATHYABAMA INSTITUTE OF SCIENCE & TECHNOLOGY
SCHOOL OF COMPUTING
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
SCSA 2604 NATURAL LANGUAGE PROCESSING LAB
PROBLEM STATEMENT:
A company receives a large volume of customer feedback across various channels such as
emails, social media, and surveys. Understanding and categorizing this feedback manually is
time-consuming and inefficient. The goal is to develop an NLP-based program to automatically
process and analyze customer feedback to extract valuable insights.
OBJECTIVE:
Utilize spaCy and NLP techniques to process customer feedback text, extract tokens, perform
lemmatization, and conduct dependency parsing to uncover underlying relationships between
words.
APPROACH:
Data Collection:
Gather a dataset containing customer feedback from different sources, including emails, social
media comments, and survey responses.
Insight Generation:
Categorize the feedback based on sentiment, identify frequently occurring topics, or extract
key phrases related to specific issues or praises mentioned by customers.
Implementation:
Use Python and spaCy to develop the program for text processing and analysis.
Incorporate visualization techniques (e.g., graphs, word clouds) to represent the findings and
insights derived from the processed feedback.
Evaluation:
Evaluate the accuracy and efficiency of tokenization, lemmatization, and dependency parsing
in handling different types of customer feedback.
Measure the program's ability to extract meaningful insights and categorize feedback
accurately.
PROGRAM:
import spacy
def analyze_feedback(feedback):
for idx, text in enumerate(feedback, start=1):
print(f"\nAnalyzing Feedback {idx}: '{text}'")
doc = nlp(text)
# Dependency parsing
print("\nDependency Parsing:")
for token in doc:
print(token.text, token.dep_, token.head.text, token.head.pos_,
[child for child in token.children])
if __name__ == "__main__":
analyze_feedback(customer_feedback)
OUTPUT:
Analyzing Feedback 1: 'The product is amazing! I love the quality.'
Tokens: ['The', 'product', 'is', 'amazing', '!', 'I', 'love', 'the', 'quality', '.']
Lemmas: ['the', 'product', 'be', 'amazing', '!', 'I', 'love', 'the', 'quality', '.']
Dependency Parsing:
The det product NOUN []
3
Dependency Parsing:
The det service NOUN []
customer compound service NOUN []
service nsubj was AUX [The, customer]
was ROOT was AUX [service, disappointed, .]
terrible amod disappointed ADJ []
, punct disappointed ADJ []
very advmod disappointed ADJ []
disappointed acomp was AUX [terrible, ,, very]
. punct was AUX []
Dependency Parsing:
4
Dependency Parsing:
The det delivery NOUN [ ]
delivery nsubj was AUX [The]
was ROOT was AUX [delivery, frustrating, .]
late advmod frustrating ADJ [ ]
, punct frustrating ADJ [ ]
very advmod frustrating ADJ [ ]
frustrating acomp was AUX [late, ,, very]
. punct was AUX [ ]
CONCLUSION:
The developed NLP-based program utilizing spaCy proves to be an efficient solution for
processing and analyzing customer feedback. Its capability to extract tokens, perform
lemmatization, and conduct dependency parsing aids in understanding the sentiment,
identifying key topics, and establishing relationships within the feedback data. This enables
companies to derive actionable insights, prioritize issues, and enhance customer satisfaction
based on the analysis of their feedback.
RESULT:
This case study demonstrates the practical application of the provided code snippet using
5
spaCy in a business context, specifically for customer feedback analysis, showcasing how
Page
NLP techniques can be employed to extract valuable insights from unstructured text data.