0% found this document useful (0 votes)
14 views

Combine PDF

Uploaded by

rsdhiva22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Combine PDF

Uploaded by

rsdhiva22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 124

SATHYABAMA INSTITUTE OF SCIENCE & TECHNOLOGY

SCHOOL OF COMPUTING
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
SCSA 2604 NATURAL LANGUAGE PROCESSING LAB

LAB 5: SENTIMENT ANALYSIS

AIM: To perform sentiment analysis program using an SVM classifier with TF-IDF
vectorization.

PROCEDURE:
Data Preparation: Downloading the dataset, converting it into a suitable format (words and
sentiments), and structuring it into a DataFrame.
Splitting Data: Dividing the dataset into training and testing sets to train the model on a
portion and evaluate it on another.
TF-IDF Vectorization: Converting text data into numerical vectors using TF-IDF (Term
Frequency-Inverse Document Frequency) representation.
SVM Initialization and Training: Setting up an SVM classifier and training it using the TF-
IDF vectors obtained from the training text data.
Prediction and Evaluation: Transforming test data into TF-IDF vectors, predicting sentiment
labels, and evaluating the model's performance by comparing predicted labels with actual
labels using accuracy and a classification report.
The following algorithm outlines the process of building a sentiment analysis model using an
SVM classifier with TF-IDF vectorization in Python. Adjustments can be made to use
different datasets, vectorization techniques, or machine learning models based on specific
requirements.

ALGORITHM:
1. Library Installation and Import: Install required libraries (scikit-learn and nltk).
Import necessary modules from these libraries.
2. Download NLTK Resources: Download the movie_reviews dataset from NLTK.
3. Load and Prepare Dataset: Load the movie_reviews dataset.
Convert the dataset into a suitable format (list of words and corresponding sentiments)
and create a DataFrame.
1

4. Split Data into Train and Test Sets: Split the dataset into training and testing sets (e.g.,
Page

80% training, 20% testing).

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


5. TF-IDF Vectorization: Initialize a TF-IDF vectorizer.
Fit and transform the training text data to convert it into numerical TF-IDF vectors.
6. Initialize and Train SVM Classifier: Initialize an SVM classifier (using a linear kernel
for this example).
Train the SVM classifier using the TF-IDF vectors and corresponding sentiment
labels.
7. Prediction and Evaluation: Transform the test text data into TF-IDF vectors using the
trained vectorizer.
Predict sentiment labels for the test data using the trained SVM classifier.
Calculate the accuracy score to evaluate the model's performance.
Generate a classification report showing precision, recall, and F1-score for each class.

PROGRAM:
# Install necessary libraries
!pip install scikit-learn
!pip install nltk

# Import required libraries


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
from nltk.corpus import movie_reviews # Sample dataset from NLTK

# Download NLTK resources (run only once if not downloaded)


import nltk
nltk.download('movie_reviews')

# Load the movie_reviews dataset


documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
2
Page

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


# Convert data to DataFrame
df = pd.DataFrame(documents, columns=['text', 'sentiment'])

# Split data into train and test sets


X_train, X_test, y_train, y_test = train_test_split(df['text'], df['sentiment'], test_size=0.2,
random_state=42)

# Initialize TF-IDF vectorizer


tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the training data


X_train_tfidf = tfidf_vectorizer.fit_transform(X_train.apply(' '.join))

# Initialize SVM classifier


svm_classifier = SVC(kernel='linear')

# Train the classifier


svm_classifier.fit(X_train_tfidf, y_train)

# Transform the test data


X_test_tfidf = tfidf_vectorizer.transform(X_test.apply(' '.join))

# Predict on the test data


y_pred = svm_classifier.predict(X_test_tfidf)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

# Display classification report


3
Page

print(classification_report(y_test, y_pred))

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


OUTPUT:
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (1.2.2)
Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.10/dist-packages
(from scikit-learn) (1.23.5)
Requirement already satisfied: scipy>=1.3.2 in /usr/local/lib/python3.10/dist-packages (from
scikit-learn) (1.11.3)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from
scikit-learn) (1.3.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-
packages (from scikit-learn) (3.2.0)
Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (3.8.1)
Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk)
(8.1.7)
Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk)
(1.3.2)
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.10/dist-packages
(from nltk) (2023.6.3)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from nltk)
(4.66.1)
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data] Unzipping corpora/movie_reviews.zip.
Accuracy: 0.84
precision recall f1-score support

neg 0.83 0.85 0.84 199


pos 0.85 0.82 0.84 201

accuracy 0.84 400


macro avg 0.84 0.84 0.84 400
weighted avg 0.84 0.84 0.84 400

4
Page

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


SATHYABAMA INSTITUTE OF SCIENCE & TECHNOLOGY
SCHOOL OF COMPUTING
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
SCSA 2604 NATURAL LANGUAGE PROCESSING LAB

LAB 6: PARTS OF SPEECH TAGGING

AIM: To perform Parts of Speech (POS) tagging program using NLTK

PROCEDURE:
Library Installation and Import: Ensures the NLTK library is available for use and imports
the necessary modules for text processing.
Download NLTK Resources: Downloads essential resources (punkt for tokenization,
averaged_perceptron_tagger for POS tagging) required by NLTK.
Sample Text: Defines a piece of text to demonstrate POS tagging.
Tokenization: Divides the text into individual words or tokens, making it suitable for further
analysis.
POS Tagging: Assigns each word in the text its respective grammatical category or POS tag
using NLTK's POS tagging functionality.
Display POS Tags: Prints or displays the words along with their associated POS tags obtained
from the tagging process.
The following algorithm outlines the steps involved in performing Parts of Speech (POS)
tagging using NLTK in Python. It demonstrates how to tokenize a text and assign
grammatical categories to individual words, providing insight into the linguistic structure of
the text.

ALGORITHM:
1. Library Installation and Import: Install NLTK library if not already installed.
Import the necessary NLTK library for text processing and POS tagging.
2. Download NLTK Resources: Download NLTK resources required for tokenization
and POS tagging (punkt for tokenization, averaged_perceptron_tagger for POS
tagging).
3. Sample Text: Define a sample text for POS tagging.
1

4. Tokenization: Break down the provided text into individual words (tokens) using
Page

NLTK's word_tokenize() method.

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


5. POS Tagging: Perform POS tagging on the tokens obtained from the text using
NLTK's pos_tag() method.
Assign POS tags to each word in the text based on its grammatical category (noun,
verb, adjective, etc.).
6. Display POS Tags: Print or display the words along with their respective POS tags
generated by the POS tagging process.

PROGRAM:
# Install NLTK (if not already installed)
!pip install nltk

# Import necessary libraries


import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Sample text for POS tagging


text = "Parts of speech tagging helps to understand the function of each word in a sentence."

# Tokenize the text into words


tokens = nltk.word_tokenize(text)

# Perform POS tagging


pos_tags = nltk.pos_tag(tokens)

# Display the POS tags


print("POS tags:", pos_tags)

OUTPUT:
Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (3.8.1)
Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk)
2

(8.1.7)
Page

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk)
(1.3.2)
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.10/dist-packages
(from nltk) (2023.12.25)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from nltk)
(4.66.2)
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /root/nltk_data...
[nltk_data] Unzipping taggers/averaged_perceptron_tagger.zip.
POS tags: [('Parts', 'NNS'), ('of', 'IN'), ('speech', 'NN'), ('tagging', 'VBG'), ('helps', 'NNS'), ('to',
'TO'), ('understand', 'VB'), ('the', 'DT'), ('function', 'NN'), ('of', 'IN'), ('each', 'DT'), ('word',
'NN'), ('in', 'IN'), ('a', 'DT'), ('sentence', 'NN'), ('.', '.')]

3
Page

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


SATHYABAMA INSTITUTE OF SCIENCE & TECHNOLOGY
SCHOOL OF COMPUTING
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
SCSA 2604 NATURAL LANGUAGE PROCESSING LAB

LAB 7: CHUNKING

AIM: To perform Noun Phrase chunking

PROCEDURE:
In Natural Language Processing (NLP), chunking is the process of extracting short, meaningful
phrases (chunks) from a sentence based on specific patterns of parts of speech (POS). Python
provides tools like NLTK (Natural Language Toolkit) to perform chunking. This example
demonstrates a basic noun phrase (NP) and verb phrase (VP) chunking using NLTK. You can
adjust the chunk grammar patterns to capture different types of phrases or entities based on
your specific needs.
The chunk_grammar variable contains patterns defined using regular expressions for
identifying noun phrases and verb phrases. Adjusting these patterns can help extract different
types of chunks like prepositional phrases, named entities, etc.
Tokenization: Breaking the sentence into individual tokens or words.
POS Tagging: Assigning part-of-speech tags to each token (identifying whether it's a noun,
verb, adjective, etc.).
Chunking: Grouping tokens into larger structures (noun phrases, verb phrases) based on
defined grammar rules.
Chunk Grammar: Regular expressions defining patterns for identifying specific chunk
structures (like noun phrases).
Chunk Parser: Utilizing the chunk grammar to parse and extract chunks based on the
provided POS-tagged tokens.
The following algorithm outlines the steps involved in the noun phrase chunking process
using NLTK in Python, highlighting the key processes and the role of chunk grammar in
identifying and extracting specific syntactic structures from text data.
1
Page

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


ALGORITHM:

1. Import Necessary Libraries: Import required modules from NLTK for tokenization,
POS tagging, and chunking.

2. Download NLTK Resources (if needed): Ensure NLTK resources like tokenizers and
POS taggers are downloaded (nltk.download('punkt'),
nltk.download('averaged_perceptron_tagger')).

3. Define a Sample Sentence: Set a sample sentence that will be used for chunking.

4. Tokenization: Break the sentence into individual words or tokens using NLTK's
word_tokenize() function.

5. Part-of-Speech (POS) Tagging: Tag each token with its corresponding part-of-speech
using NLTK's pos_tag() function.

6. Chunk Grammar Definition: Define a chunk grammar using regular expressions to


identify noun phrases (NP). For example, NP: {<DT>?<JJ>*<NN>} captures
sequences with optional determiners (DT), adjectives (JJ), and nouns (NN) as noun
phrases.

7. Chunk Parser Creation: Create a chunk parser using RegexpParser() and provide the
defined chunk grammar.

8. Chunking: Parse the tagged sentence using the created chunk parser to extract chunks
based on the defined grammar.

9. Display Chunks: Iterate through the parsed chunks and print the subtrees labeled as
'NP', which represent the identified noun phrases.
2
Page

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


PROGRAM:
!pip install nltk
import nltk
from nltk import RegexpParser
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

# Download NLTK resources (run only once if not downloaded)


nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Sample sentence
sentence = "The quick brown fox jumps over the lazy dog"

# Tokenize the sentence


tokens = word_tokenize(sentence)

# POS tagging
tagged = pos_tag(tokens)

# Define a chunk grammar using regular expressions


# NP (noun phrase) chunking: "NP: {<DT>?<JJ>*<NN>}"
# This grammar captures optional determiner (DT), adjectives (JJ), and nouns (NN) as a noun
phrase
chunk_grammar = r"""
NP: {<DT>?<JJ>*<NN>}
"""

# Create a chunk parser with the defined grammar


chunk_parser = RegexpParser(chunk_grammar)
3
Page

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


# Parse the tagged sentence to extract chunks
chunks = chunk_parser.parse(tagged)

# Display the chunks


for subtree in chunks.subtrees():
if subtree.label() == 'NP': # Print only noun phrases
print(subtree)

OUTPUT:
Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (3.8.1)
Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk)
(8.1.7)
Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk)
(1.3.2)
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.10/dist-packages
(from nltk) (2023.6.3)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from nltk)
(4.66.1)
(NP The/DT quick/JJ brown/NN)
(NP fox/NN)
(NP the/DT lazy/JJ dog/NN)
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /root/nltk_data...
[nltk_data] Package averaged_perceptron_tagger is already up-to-
[nltk_data] date!
4
Page

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


SATHYABAMA INSTITUTE OF SCIENCE & TECHNOLOGY
SCHOOL OF COMPUTING
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
SCSA 2604 NATURAL LANGUAGE PROCESSING LAB

LAB 6: CASE STUDY

AIM: Parts of Speech Tagging


Problem Statement:
An online news aggregator wants to improve its recommendation system by analyzing the
content of news articles. To achieve this, they need to perform parts of speech tagging on the
article text to extract relevant information such as key topics, sentiments, and entities
mentioned.
Objectives :
1. Develop a parts of speech tagging system to analyze the content of news articles.
2. Extract key information such as nouns, verbs, adjectives, and other parts of speech to
understand the structure of the articles.
3. Enhance the recommendation system by incorporating the extracted information to
provide more accurate and personalized recommendations to users.

Dataset:
The dataset consists of a collection of news articles in text format. Each article is labeled with
its category (e.g., politics, sports, entertainment) and contains textual content for analysis.

Approach:
1. Preprocess the dataset by tokenizing the text into words and sentences.
2. Perform parts of speech tagging using a pre-trained model or a custom-trained model.
3. Extract relevant parts of speech such as nouns, verbs, adjectives, and adverbs from the
tagged text.
4. Analyze the distribution of different parts of speech across the articles to understand
their linguistic characteristics.
5. Integrate the extracted information into the recommendation system to improve the
relevance of recommended articles for users.
Program :
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

# Download NLTK resources (if not already downloaded)


nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

def pos_tagging(text):
sentences = sent_tokenize(text)
tagged_tokens = []
for sentence in sentences:
tokens = word_tokenize(sentence)
tagged_tokens.extend(nltk.pos_tag(tokens))
return tagged_tokens

def main():
# Example news article
article_text = """
Manchester United secured a 3-1 victory over Chelsea in yesterday's
match.
Goals from Rashford, Greenwood, and Fernandes sealed the win for
United.
Chelsea's only goal came from Pulisic in the first half.
The victory boosts United's chances in the Premier League title
race.
"""

tagged_tokens = pos_tagging(article_text)
print("Original Article Text:\n", article_text)
print("\nParts of Speech Tagging:")
for token, pos_tag in tagged_tokens:
print(f"{token}: {pos_tag}")

if __name__ == "__main__":
main()

Output:
Original Article Text:

Manchester United secured a 3-1 victory over Chelsea in yesterday's match.


Goals from Rashford, Greenwood, and Fernandes sealed the win for United.
Chelsea's only goal came from Pulisic in the first half.
The victory boosts United's chances in the Premier League title race.

Parts of Speech Tagging:


Manchester: NNP
United: NNP
secured: VBD
a: DT
3-1: JJ
victory: NN
over: IN
Chelsea: NNP
in: IN
yesterday: NN
's: POS
match: NN
.: .
Goals: NNS
from: IN
Rashford: NNP
,: ,
Greenwood: NNP
,: ,
and: CC
Fernandes: NNP
sealed: VBD
the: DT
win: NN
for: IN
United: NNP
.: .
Chelsea: NN
's: POS
only: JJ
goal: NN
came: VBD
from: IN
Pulisic: NNP
in: IN
the: DT
first: JJ
half: NN
.: .
The: DT
victory: NN
boosts: VBZ
United: NNP
's: POS
chances: NNS
in: IN
the: DT
Premier: NNP
League: NNP
title: NN
race: NN
.: .
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /root/nltk_data...
[nltk_data] Package averaged_perceptron_tagger is already up-to-
[nltk_data] date!
Result:
This program demonstrates the parts of speech tagging process on a sample news
article. Each word in the article is followed by its corresponding part of speech tag. This
information can be further utilized for analysis and decision-making in the recommendation
system.
SATHYABAMA INSTITUTE OF SCIENCE & TECHNOLOGY
SCHOOL OF COMPUTING
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
SCSA 2604 NATURAL LANGUAGE PROCESSING LAB

LAB 7: CASE STUDY

AIM: The aim of this case study is to demonstrate the extraction of noun phrases from a
given text using chunking, a technique in Natural Language Processing (NLP). We will
utilize Python's NLTK library to implement chunking and extract meaningful noun phrases
from the text.
Problem Statement:
Given a sample text, our goal is to identify and extract noun phrases, which are
sequences of words containing a noun and optionally other words like adjectives or
determiners. The problem involves implementing a program that tokenizes the text, performs
part-of-speech tagging, applies chunking to identify noun phrases, and finally outputs the
extracted noun phrases.
Objectives :
1. Tokenize the input text into words.
2. Perform part-of-speech tagging to assign grammatical tags to each word.
3. Define a chunk grammar to identify noun phrases.
4. Apply chunking to extract noun phrases from the text.
5. Display the extracted noun phrases.
Dataset:
For this case study, we will use a sample text: "The quick brown fox jumps over the lazy
dog."
Approach:
The approach involves several steps to extract noun phrases from the given text using
chunking in Natural Language Processing (NLP). Firstly, the input text is tokenized into
individual words to prepare it for further processing. Following tokenization, each word is
tagged with its part-of-speech using NLTK's pos_tag function, which assigns grammatical
tags to each word based on its context. Next, a chunk grammar is defined to specify the
patterns that identify noun phrases. This grammar is then utilized to apply chunking, which
groups consecutive words that match the defined patterns into noun phrases. Finally, the
extracted noun phrases are outputted, providing meaningful insights into the structure and
content of the text. This approach allows for the identification and extraction of important
linguistic units, facilitating various NLP tasks such as information extraction, text
summarization, and sentiment analysis.

Program :
import nltk
import os

# Set NLTK data path


nltk.data.path.append("/usr/local/share/nltk_data")

# Download the 'punkt' tokenizer model


nltk.download('punkt')

# Download the 'averaged_perceptron_tagger' model


nltk.download('averaged_perceptron_tagger')

# Sample text
text = "The quick brown fox jumps over the lazy dog."

# Tokenize the text into words


words = nltk.word_tokenize(text)

# Perform part-of-speech tagging


pos_tags = nltk.pos_tag(words)

# Define chunk grammar


chunk_grammar = r"""
NP: {<DT>?<JJ>*<NN>} # Chunk sequences of DT, JJ, NN
"""

# Create chunk parser


chunk_parser = nltk.RegexpParser(chunk_grammar)

# Apply chunking
chunked_text = chunk_parser.parse(pos_tags)

# Extract noun phrases


noun_phrases = []
for subtree in chunked_text.subtrees(filter=lambda t: t.label() ==
'NP'):
noun_phrases.append(' '.join(word for word, tag in
subtree.leaves()))

# Output
print("Original Text:", text)
print("Noun Phrases:")
for phrase in noun_phrases:
print("-", phrase)

Output:
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /root/nltk_data...
[nltk_data] Unzipping taggers/averaged_perceptron_tagger.zip.
Original Text: The quick brown fox jumps over the lazy dog.
Noun Phrases:
- The quick brown
- fox
- the lazy dog

Result:
Chunking is a valuable technique in NLP for identifying and extracting meaningful
phrases from text. In this case study, we successfully implemented chunking using Python's
NLTK library to extract noun phrases from a given text. By identifying and extracting noun
phrases, we gained insights into the structure and semantics of the text, which can be
beneficial for various NLP applications such as information extraction, sentiment analysis,
and text summarization.
Lab 2 – Case Study

Aim: Autocomplete System for Email Composition


Problem Statement:
A software company wants to develop an intelligent autocomplete system for email composition.
The goal is to assist users in generating coherent and contextually appropriate sentences while
composing emails. The system should predict the next word or phrase based on the user's input
and the context of the email.

Objectives:

The objectives of the provided program are to implement a simple email autocomplete system using the
GPT-2 language model. The program aims to facilitate user interaction by suggesting autocompletions
based on the context provided and the user's input. Key objectives include initializing and integrating
the GPT-2 model and tokenizer from the Hugging Face Transformers library, defining a class structure
(EmailAutocompleteSystem) to encapsulate the autocomplete system, and creating a method
(generate_suggestions) to generate context-aware suggestions. The program encourages user
engagement by incorporating a user input loop, allowing continuous interaction until the user chooses
to exit. The ultimate goal is to demonstrate the practical use of a pre-trained language model for
generating relevant suggestions in the context of email composition, showcasing the capabilities of the
GPT-2 model for natural language processing tasks.

Approach:
1. Data Collection:
 Collect a diverse dataset of emails, including different writing styles, topics, and
formality levels.
 Annotate the dataset with proper context information, such as sender, recipient,
subject, and the body of the email.
2. Data Preprocessing:
 Clean and tokenize the text data.
 Handle issues like punctuation, capitalization, and special characters.
 Split the dataset into training and testing sets.
3. Model Selection:
 Choose a suitable NLP model for word generation. Options may include recurrent
neural networks (RNNs), long short-term memory networks (LSTMs), or
transformer models like GPT-3.
 Fine-tune or train the model on the email dataset to understand the specific
language patterns used in emails.
4. Context Integration:
 Design a mechanism to incorporate contextual information from the email, such
as the subject, previous sentences, and the relationship between the sender and
recipient.
 Implement a way for the model to understand the context shift within the email
body.
5. User Interface:
 Develop a user-friendly interface that integrates with popular email clients or
standalone applications.
 Allow users to enable or disable the autocomplete feature as needed.
 Provide visual cues to indicate suggested words or phrases.
6. Model Evaluation:
 Evaluate the model's performance on the test dataset using metrics like perplexity,
accuracy, and precision.
 Gather user feedback on the effectiveness and usability of the autocomplete
system.
7. Fine-Tuning and Iteration:
 Analyze user feedback and performance metrics to identify areas for
improvement.
 Consider refining the model based on user suggestions and addressing any
limitations.
8. Deployment:
 Deploy the trained model as a service that can be accessed by the email
application.
 Ensure scalability and reliability of the autocomplete system.
Potential Challenges:
 Context Understanding: Ensuring the model effectively understands and incorporates
the context of the email.
 Ambiguity Handling: Dealing with ambiguous phrases and understanding the user's
intended meaning.
 Personalization: Tailoring the system to individual writing styles and preferences.
Success Criteria:
 Improved email composition efficiency and speed.
 Positive user feedback on the accuracy and relevance of autocomplete suggestions.
 Reduction in typing errors and improved overall user experience.
By successfully developing and implementing this word generation program, the company aims
to enhance the productivity and user experience of individuals engaged in email communication.

Program :
!pip install transformers
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

class EmailAutocompleteSystem:
def __init__(self):
self.model_name = "gpt2"
self.tokenizer = GPT2Tokenizer.from_pretrained(self.model_name)
self.model = GPT2LMHeadModel.from_pretrained(self.model_name)

def generate_suggestions(self, user_input, context):


input_text = f"{context} {user_input}"
input_ids = self.tokenizer.encode(input_text, return_tensors="pt")

with torch.no_grad():
output = self.model.generate(input_ids, max_length=50, num_return_sequences=1,
no_repeat_ngram_size=2)

generated_text = self.tokenizer.decode(output[0], skip_special_tokens=True)


suggestions = generated_text.split()[len(user_input.split()):]
return suggestions

# Example usage
if __name__ == "__main__":
autocomplete_system = EmailAutocompleteSystem()

# Assume user is composing an email with some context


email_context = "Subject: Discussing Project Proposal\nHi [Recipient],"

while True:
user_input = input("Enter your sentence (type 'exit' to end): ")

if user_input.lower() == 'exit':
break

suggestions = autocomplete_system.generate_suggestions(user_input, email_context)

if suggestions:
print("Autocomplete Suggestions:", suggestions)
else:
print("No suggestions available.")

Output:
Enter your sentence (type 'exit' to end): hello, how are you ? How's
everything going on !
The attention mask and the pad token id were not set. As a consequence, you
may observe unexpected behavior. Please pass your input's `attention_mask` to
obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Autocomplete Suggestions: ["How's", 'everything', 'going', 'on!', "I'm", 'a',
'programmer', 'and', "I've", 'been', 'working', 'on', 'a', 'project', 'for',
'a', 'while', 'now.', 'I', 'have', 'a', 'lot', 'of', 'ideas', 'for', 'the']
Enter your sentence (type 'exit' to end): exit

Result:
The result demonstrates the integration of a powerful language model for enhancing user experience in
composing emails through intelligent autocomplete suggestions.
SATHYABAMA INSTITUTE OF SCIENCE & TECHNOLOGY
SCHOOL OF COMPUTING
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
SCSA 2604 NATURAL LANGUAGE PROCESSING LAB

LAB 3: TEXT CLASSIFICATION

AIM: To perform Text classification using python and scikit-learn

PROCEDURE:
This algorithm outlines the steps involved in the text classification task using
LinearSVC on the 20 Newsgroups dataset. It provides a structured approach to implementing
the program and understanding the workflow.
ALGORITHM:
Algorithm: Text Classification using LinearSVC

1. Load the 20 Newsgroups dataset with specified categories.


- Import the necessary libraries: fetch_20newsgroups from sklearn.datasets.
- Specify the categories of interest for classification.
- Use fetch_20newsgroups to load the dataset for both training and testing sets.

2. Split the dataset into training and testing sets.


- Import train_test_split from sklearn.model_selection.
- Split the dataset into X_train, X_test, y_train, and y_test.

3. Create a pipeline for text classification.


- Import make_pipeline from sklearn.pipeline.
- Create a pipeline with TF-IDF Vectorizer and LinearSVC classifier.

4. Train the model on the training data.


- Call the fit method on the pipeline with X_train and y_train as input.

5. Predict labels for the testing data.


- Use the trained model to predict labels for X_test.

6. Evaluate the model's performance.


- Calculate accuracy_score to measure the accuracy of the model.
- Print classification_report to see precision, recall, and F1-score for each class.

End Algorithm
PROGRAM:
# Install scikit-learn if not already installed
!pip install scikit-learn

# Import necessary libraries


import pandas as pd

from sklearn.datasets import fetch_20newsgroups


from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Load the 20 Newsgroups dataset


categories = ['sci.med', 'sci.space', 'comp.graphics', 'talk.politics.mideast']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories)

# Split the data into training and testing sets


X_train = newsgroups_train.data
X_test = newsgroups_test.data
y_train = newsgroups_train.target
y_test = newsgroups_test.target

# Create a pipeline with TF-IDF vectorizer and LinearSVC classifier


model = make_pipeline(
TfidfVectorizer(),
LinearSVC()
)

# Train the model


model.fit(X_train, y_train)

# Predict labels for the test set


predictions = model.predict(X_test)

# Evaluate the model


accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)
print("\nClassification Report:")
print(classification_report(y_test, predictions))

OUTPUT:
Requirement already satisfied: scikit-learn in
/usr/local/lib/python3.10/dist-packages (1.2.2)
Requirement already satisfied: numpy>=1.17.3 in
/usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.23.5)
Requirement already satisfied: scipy>=1.3.2 in
/usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.11.4)
Requirement already satisfied: joblib>=1.1.1 in
/usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.3.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in
/usr/local/lib/python3.10/dist-packages (from scikit-learn) (3.2.0)

Accuracy: 0.9504823151125402
Classification Report:
precision recall f1-score support

0 0.89 0.97 0.93 389


1 0.96 0.91 0.94 396
2 0.98 0.94 0.96 394
3 0.98 0.98 0.98 376

accuracy 0.95 1555


macro avg 0.95 0.95 0.95 1555
weighted avg 0.95 0.95 0.95 1555
SATHYABAMA INSTITUTE OF SCIENCE & TECHNOLOGY
SCHOOL OF COMPUTING
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
SCSA 2604 NATURAL LANGUAGE PROCESSING LAB

LAB 3: CASE STUDY

AIM: Customer Support Email Classification

Problem Statement:
A customer support company receives a large volume of incoming emails from customers
with various inquiries, complaints, and feedback. Manually categorizing and prioritizing
these emails is time-consuming and inefficient. The company wants to develop a text
classification system to automatically classify incoming emails into predefined categories,
allowing for faster response times and better customer service.

Objectives :
 The text classification system successfully categorizes incoming customer emails into
predefined categories.
 It improves the efficiency of the customer support team by automating email
classification and prioritization.
 The company can respond to customer inquiries and issues more promptly, leading to
higher customer satisfaction and retention.

Dataset:
The company has a dataset of past customer emails along with their corresponding
categories. Each email is labeled with one or more categories, indicating the type of inquiry
or issue raised by the customer. For demonstration purposes, we will use the
fetch_20newsgroups dataset from scikit-learn, which contains a collection of newsgroup
documents, spanning 20 different newsgroups. We'll simulate this dataset as if it were
customer support emails categorized into predefined categories.

Approach:
Data Preparation:
 Load the 20 Newsgroups dataset as a proxy for customer support emails.
 Select a subset of categories that represent different types of customer inquiries,
complaints, and feedback.
 Prepare the data and target labels from the dataset.
Data Preprocessing:
 Clean the email text data by removing unnecessary information such as email headers,
signatures, and HTML tags.
 Tokenize the text and convert it to lowercase.
 Remove stopwords and apply techniques like stemming or lemmatization to reduce
words to their base forms.
Feature Extraction:
Use TF-IDF Vectorizer to convert text data into numerical features, limiting the maximum
number of features to 10,000 and removing English stopwords.
Model Selection:
 Choose a suitable classification algorithm such as Linear Support Vector Classifier
(LinearSVC) for text classification.
 Train the chosen model on the training data.
Model Evaluation:
 Predict labels for the test set using the trained model.
 Evaluate the classifier's performance using accuracy and a classification report, which
includes precision, recall, and F1-score for each category.
Future Enhancements:
 Continuous monitoring and updating of the model to adapt to evolving customer
inquiries and language patterns.
 Integration of sentiment analysis to assess the sentiment of customer emails and
prioritize urgent or critical issues.
 Expansion of the model to handle multiclass classification and a wider range of
customer inquiry categories.
Program :
# Import necessary libraries
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report
# Load the 20 Newsgroups dataset as a proxy for customer support emails
newsgroups = fetch_20newsgroups(subset='all', categories=['comp.sys.ibm.pc.hardware',
'comp.sys.mac.hardware', 'rec.autos', 'rec.motorcycles', 'sci.electronics'])

# Prepare data and target labels


X = newsgroups.data
y = newsgroups.target

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create TF-IDF vectorizer


vectorizer = TfidfVectorizer(stop_words='english', max_features=10000)
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

# Train the LinearSVC classifier


classifier = LinearSVC()
classifier.fit(X_train, y_train)

# Predict labels for the test set


predictions = classifier.predict(X_test)

# Evaluate the classifier


accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)
print("\nClassification Report:")
print(classification_report(y_test, predictions, target_names=newsgroups.target_names))
Output:
Accuracy: 0.9389623601220752

Classification Report:
precision recall f1-score support

comp.sys.ibm.pc.hardware 0.92 0.91 0.91 212


comp.sys.mac.hardware 0.94 0.93 0.94 198
rec.autos 0.97 0.93 0.95 179
rec.motorcycles 0.96 0.99 0.97 205
sci.electronics 0.92 0.93 0.92 189

accuracy 0.94 983


macro avg 0.94 0.94 0.94 983
weighted avg 0.94 0.94 0.94 983

Result:
This case study outlines the problem statement, dataset, approach, expected outcome,
and future enhancements for developing a text classification system for customer support
email classification. It demonstrates the application of machine learning techniques to
automate and improve customer service processes.
SATHYABAMA INSTITUTE OF SCIENCE & TECHNOLOGY
SCHOOL OF COMPUTING
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
SCSA 2604 NATURAL LANGUAGE PROCESSING LAB

LAB 4: SEMANTIC ANALYSIS

AIM: To perform Semantic Analysis using Gensim

PROCEDURE:
Semantic analysis is a broad area in NLP. This program demonstrates semantic analysis by
leveraging pre-trained word vectors using Word2Vec from Gensim. It utilizes word
embeddings to find words similar to each word in the provided sentences.
Library Installation: Ensure the necessary libraries (Gensim and NLTK) are installed.
Library Import: Import the required libraries (gensim for word vectors and nltk for
tokenization).
Pre-trained Word Vectors: Load pre-trained word vectors (Word2Vec) using Gensim's
api.load() method.
Sample Sentences: Define sample sentences for semantic analysis.
Tokenization: Break down the sentences into individual words using NLTK's word_tokenize()
method.
Semantic Analysis: Iterate through each word in the tokenized sentences and:
Check if the word exists in the pre-trained Word2Vec model.
If the word exists, find similar words using the most_similar() method from the
word vectors model.
Display or store the similar words for each word in the sentence.
If the word doesn't exist in the pre-trained model, indicate that it's not present.
The following algorithm outlines the steps involved in performing semantic analysis using pre-
trained word vectors (Word2Vec) in Python, demonstrating how to find similar words for each
word in the provided sentences based on the loaded word vectors.
1
Page

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


ALGORITHM:
1. Install Necessary Libraries: Install Gensim and NLTK libraries (!pip install gensim,
!pip install nltk).
2. Import Libraries: Import required libraries: gensim for word vectors and nltk for
tokenization.
3. Download Pre-trained Word Vectors: Download pre-trained word vectors (Word2Vec)
using Gensim's api.load() method.
4. Define Sample Sentences: Create sample sentences for semantic analysis.
5. Tokenization: Tokenize the sentences into words using NLTK's word_tokenize()
method.
6. Semantic Analysis with Word Vectors: Iterate through each tokenized sentence.
For each word in the sentence:
Check if the word exists in the pre-trained Word2Vec model.
If the word exists:
Find words similar to the current word using word_vectors.most_similar(word).
Display or store the similar words.
If the word doesn't exist in the model:
Print a message indicating that the word is not in the pre-trained model.

PROGRAM:
# Install necessary libraries
!pip install gensim
!pip install nltk

# Import required libraries


import gensim.downloader as api
from nltk.tokenize import word_tokenize

# Download pre-trained word vectors (Word2Vec)


word_vectors = api.load("word2vec-google-news-300")

# Sample sentences
sentences = [
"Natural language processing is a challenging but fascinating field.",
"Word embeddings capture semantic meanings of words in a vector space."
2
Page

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


# Tokenize sentences
tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in sentences]

# Perform semantic analysis using pre-trained word vectors


for tokenized_sentence in tokenized_sentences:
for word in tokenized_sentence:
if word in word_vectors:
similar_words = word_vectors.most_similar(word)
print(f"Words similar to '{word}': {similar_words}")
else:
print(f"'{word}' is not in the pre-trained Word2Vec model.")

OUTPUT:
Requirement already satisfied: gensim in /usr/local/lib/python3.10/dist-packages (4.3.2)
Requirement already satisfied: numpy>=1.18.5 in /usr/local/lib/python3.10/dist-packages
(from gensim) (1.23.5)
Requirement already satisfied: scipy>=1.7.0 in /usr/local/lib/python3.10/dist-packages (from
gensim) (1.11.3)
Requirement already satisfied: smart-open>=1.8.1 in /usr/local/lib/python3.10/dist-packages
(from gensim) (6.4.0)
Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (3.8.1)
Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk)
(8.1.7)
Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk)
(1.3.2)
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.10/dist-packages
(from nltk) (2023.6.3)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from nltk)
(4.66.1)
[==================================================] 100.0%
1662.8/1662.8MB downloaded
3

Words similar to 'natural': [('Splittorff_lacked', 0.636509358882904), ('Natural',


Page

0.58078932762146), ('Mike_Taugher_covers', 0.577259361743927), ('manmade',

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


0.5276211500167847), ('shell_salted_pistachios', 0.5084421634674072), ('unnatural',
0.5030758380889893), ('naturally', 0.49992606043815613), ('Intraparty_squabbles',
0.4988228678703308), ('Burt_Bees_®', 0.49539363384246826), ('causes_Buxeda',
0.4935200810432434)]
Words similar to 'language': [('langauge', 0.7476695775985718), ('Language',
0.6695356369018555), ('languages', 0.6341332197189331), ('English',
0.6120712757110596), ('CMPB_Spanish', 0.6083104610443115), ('nonnative_speakers',
0.6063109636306763), ('idiomatic_expressions', 0.5889801979064941), ('verb_tenses',
0.58415687084198), ('Kumeyaay_Diegueno', 0.5798824429512024), ('dialect',
0.5724600553512573)]
Words similar to 'processing': [('Processing', 0.7285515666007996), ('processed',
0.6519132852554321), ('processor', 0.636760413646698), ('warden_Dominick_DeRose',
0.6166526675224304), ('processors', 0.5953895449638367),
('Discoverer_Enterprise_resumed', 0.5376213192939758), ('LSI_Tarari',
0.520267903804779), ('processer', 0.5166687369346619), ('remittance_processing',
0.5144169926643372), ('Farmland_Foods_pork', 0.5071728825569153)]
Words similar to 'is': [('was', 0.6549733281135559), ("isn'ta", 0.6439523100852966), ('seems',
0.634029746055603), ('Is', 0.6085968613624573), ('becomes', 0.5841935276985168),
('appears', 0.5822900533676147), ('remains', 0.5796942114830017), ('іѕ',
0.5695518255233765), ('makes', 0.5567088723182678), ('isn_`_t', 0.5513144135475159)]
'a' is not in the pre-trained Word2Vec model.
Words similar to 'challenging': [('difficult', 0.6388775110244751), ('challenge',
0.5953003764152527), ('daunting', 0.569800615310669), ('tough', 0.5689979791641235),
('challenges', 0.5471934676170349), ('challenged', 0.5449535846710205), ('Challenging',
0.5242965817451477), ('tricky', 0.5236554741859436), ('toughest', 0.5169045329093933),
('diffi_cult', 0.5010539889335632)]
Words similar to 'but': [('although', 0.8104525804519653), ('though', 0.7285684943199158),
('because', 0.7225914597511292), ('so', 0.6865807771682739), ('But', 0.6826984882354736),
('Although', 0.6188263297080994), ('Though', 0.6153667569160461), ('Unfortunately',
0.6031029224395752), ('Of_course', 0.593142032623291), ('anyway', 0.5869061350822449)]
Words similar to 'fascinating': [('interesting', 0.7623067498207092), ('intriguing',
0.7245113253593445), ('enlightening', 0.6644250154495239), ('captivating',
0.6459898352622986), ('facinating', 0.6416683793067932), ('riveting',
0.6324825286865234), ('instructive', 0.6210989356040955), ('endlessly_fascinating',
0.6188612580299377), ('revelatory', 0.6170244216918945), ('engrossing',
0.6126049160957336)]
Words similar to 'field': [('fields', 0.5582526326179504), ('fi_eld', 0.5188260078430176),
('Keith_Toogood', 0.49749255180358887), ('Mackenzie_Hoambrecker',
0.49514278769493103), ('Josh_Arauco_kicked', 0.48817265033721924), ('Nick_Cattoi',
0.4863145053386688), ('Armando_Cuko', 0.4853871166706085), ('Jon_Striefsky',
0.48322004079818726), ('kicker_Nico_Grasu', 0.47572532296180725),
4
Page

('Chris_Manfredini_kicked', 0.47327715158462524)]

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


'.' is not in the pre-trained Word2Vec model.
Words similar to 'word': [('phrase', 0.6777030825614929), ('words', 0.5864380598068237),
('verb', 0.5517287254333496), ('Word', 0.54575115442276), ('adjective',
0.5290762186050415), ('cuss_word', 0.5272089242935181), ('colloquialism',
0.5160348415374756), ('noun', 0.5129537582397461), ('astrology_#/##/##',
0.5039082765579224), ('synonym', 0.49379870295524597)]
'embeddings' is not in the pre-trained Word2Vec model.
Words similar to 'capture': [('capturing', 0.7563897371292114), ('captured',
0.7155306935310364), ('captures', 0.6099075078964233), ('Capturing',
0.6023245453834534), ('recapture', 0.5498639941215515), ('Capture', 0.5493018627166748),
('nab', 0.4941576421260834), ('Captured', 0.45745959877967834), ('apprehend',
0.4357919692993164), ('seize', 0.4338296055793762)]
Words similar to 'semantic': [('semantics', 0.6644964814186096), ('Semantic',
0.6464474201202393), ('contextual', 0.5909127593040466), ('meta', 0.5905876755714417),
('ontology', 0.5880525708198547), ('Semantic_Web', 0.5612248778343201), ('semantically',
0.5600483417510986), ('microformat', 0.5582399368286133), ('inferencing',
0.5541478991508484), ('terminological', 0.5533202290534973)]
Words similar to 'meanings': [('grammatical_constructions', 0.594986081123352), ('idioms',
0.5938195586204529), ('connotations', 0.5836683511734009), ('symbolic_meanings',
0.5806494951248169), ('meaning', 0.5785343647003174), ('literal_meanings',
0.5743482112884521), ('denotative', 0.5730364918708801), ('phrasal_verbs',
0.5697917342185974), ('contexts', 0.5609514713287354), ('adjectives_adverbs',
0.5569407343864441)]
'of' is not in the pre-trained Word2Vec model.
Words similar to 'words': [('phrases', 0.7100036144256592), ('phrase', 0.6408688426017761),
('Words', 0.6160537600517273), ('word', 0.5864380598068237), ('adjectives',
0.5812757015228271), ('uttered', 0.5724518299102783), ('plate_umpire_Tony_Randozzo',
0.5642045140266418), ('expletives', 0.5539036989212036), ('Mayor_Cirilo_Pena',
0.553884744644165), ('Tele_prompter', 0.5441114902496338)]
Words similar to 'in': [('inthe', 0.5891957879066467), ('where', 0.5662435293197632), ('the',
0.5429296493530273), ('In', 0.5415117144584656), ('during', 0.5188906192779541), ('iin',
0.48737412691116333), ('at', 0.484235554933548), ('from', 0.48268404603004456),
('outside', 0.47092658281326294), ('for', 0.4566476047039032)]
'a' is not in the pre-trained Word2Vec model.
Words similar to 'vector': [('vectors', 0.750322163105011), ('adeno_associated_viral_AAV',
0.5999537110328674), ('bitmap_graphics', 0.5428463220596313), ('Sindbis',
0.5353653430938721), ('bitmap_images', 0.5318013429641724), ('signal_analyzer_VSA',
0.5276671051979065), ('analyzer_VNA', 0.5184376239776611), ('vectorial',
0.5084835886955261), ('nonviral_gene_therapy', 0.5036363005638123), ('shellcode',
5

0.5015827417373657)]
Page

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


Words similar to 'space': [('spaces', 0.6570690870285034), ('music_concept_ShockHound',
0.5850345492362976), ('Shuttle_docks', 0.5566749572753906), ('Space',
0.5478203296661377), ('Soviet_Union_Yuri_Gagarin', 0.5417766571044922),
('Shuttle_Discovery_blasts', 0.5352603197097778), ('Shuttle_Discovery_docks',
0.534925103187561), ('Shuttle_Endeavour_undocks', 0.532420814037323),
('Shuttle_Discovery_arrives', 0.5323426723480225), ('Shuttle_undocks',
0.523307740688324)]
'.' is not in the pre-trained Word2Vec model.

6
Page

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


SATHYABAMA INSTITUTE OF SCIENCE & TECHNOLOGY
SCHOOL OF COMPUTING
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
SCSA 2604 NATURAL LANGUAGE PROCESSING LAB

LAB 4: CASE STUDY

AIM: Enhancing Customer Service with Semantic Analysis


Problem Statement:
A multinational e-commerce company, "E-Shop Inc.," is looking to improve its customer
service operations by leveraging advanced natural language processing techniques. They have
a vast repository of customer interactions, including emails, chat transcripts, and social media
messages. E-Shop Inc. aims to implement semantic analysis to better understand customer
queries and sentiment, ultimately enhancing the overall customer experience..
Objectives :
 Semantic Analysis: The primary objective is to perform semantic analysis on customer
queries to understand the underlying meaning and extract relevant information. By
identifying synonyms and related terms, the program aims to capture the semantic
nuances of the input text.
 Improving Customer Service: The program aims to enhance customer service
operations by providing insights into customer queries. By analyzing the semantics of
the queries, the program can help identify common issues, extract key information, and
facilitate more effective responses.
Dataset:
In the provided program, there isn't a specific dataset used for semantic analysis. Instead, the
program demonstrates a basic approach to perform semantic analysis on a set of example
customer queries. However, in a real-world scenario, the dataset used for semantic analysis
could consist of a collection of text data relevant to the domain of interest, such as customer
support tickets, product reviews, social media interactions, or any other type of textual data
where semantic analysis is applicable.
Approach:
 Tokenization: The program starts by tokenizing the input text into individual words or
tokens. Tokenization is a fundamental step in natural language processing (NLP) for
breaking down text into its constituent parts.
 Stopword Removal: Stopwords, such as "is", "the", "and", etc., are removed from the
tokens to filter out irrelevant words that do not carry much semantic meaning.
 Lemmatization: The program lemmatizes the remaining tokens to reduce them to their
base or dictionary form. Lemmatization helps in normalizing words and reducing
inflectional forms to a common base, improving the accuracy of semantic analysis.
 Synonym Generation: Using the WordNet database, the program retrieves synonyms
for each lemmatized token. WordNet is a lexical database of the English language that
provides semantic relationships between words, including synonyms, hypernyms,
hyponyms, etc.
 Output Generation: Finally, the program outputs the synonyms generated for each
customer query, providing insights into the semantic content of the queries.
Program :
import nltk
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Initialize NLTK resources


nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Function to perform semantic analysis


def semantic_analysis(text):
# Tokenize text
tokens = word_tokenize(text)

# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in
stop_words]

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in
filtered_tokens]

# Synonyms generation
synonyms = set()
for token in lemmatized_tokens:
for syn in wordnet.synsets(token):
for lemma in syn.lemmas():
synonyms.add(lemma.name())

return list(synonyms)

# Example customer queries


customer_queries = [
"I received a damaged product. Can I get a refund?",
"I'm having trouble accessing my account.",
"How can I track my order status?",
"The item I received doesn't match the description.",
"Is there a discount available for bulk orders?"
]

# Semantic analysis for each query


for query in customer_queries:
print("Customer Query:", query)
synonyms = semantic_analysis(query)
print("Semantic Analysis (Synonyms):", synonyms)
print("\n")

Output:
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
Customer Query: I received a damaged product. Can I get a refund?
Semantic Analysis (Synonyms): ['refund', 'grow', 'baffle', 'pay_off', 'Cartesian_product',
'arrive', 'engender', 'standard', 'have', 'damaged', 'experience', 'develop', 'sustain', 'product',
'acquire', 'encounter', 'take_in', 'find', 'stupefy', 'bugger_off', 'draw', 'pose', 'aim', 'nonplus',
'induce', 'mother', 'stimulate', 'make', 'repayment', 'convey', 'cause', 'mathematical_product',
'get', 'damage', 'produce', 'set_out', 'merchandise', 'buzz_off', 'beat', 'meet', 'start', 'commence',
'return', 'pick_up', 'production', 'fix', 'stick', "get_under_one's_skin", 'go', 'mystify', 'take',
'perplex', 'welcome', 'vex', 'begin', 'come', 'fuck_off', 'bring', 'contract', 'capture', 'generate',
'give_back', 'incur', 'repay', 'let', 'become', 'start_out', 'gravel', 'scram', 'obtain', 'pay_back',
'amaze', 'catch', 'beget', 'get_down', 'set_about', 'invite', 'bring_forth', 'drive', 'sire', 'intersection',
'discredited', 'suffer', 'received', 'ware', 'dumbfound', 'fetch', 'father', 'arrest', 'flummox', 'puzzle',
'bewilder', 'receive']

Customer Query: I'm having trouble accessing my account.


Semantic Analysis (Synonyms): ['write_up', 'report', 'invoice', 'trouble', 'describe', 'answer_for',
'access', 'pain', 'get_at', 'distract', 'disorder', 'unhinge', 'account_statement', 'calculate',
'disoblige', 'fuss', 'bill', 'disquiet', 'inconvenience', 'incommode', 'news_report', 'disturb',
'explanation', 'problem', 'perturb', 'cark', 'account', 'accounting', 'business_relationship', 'score',
'bother', 'history', 'story', 'difficulty', 'worry', 'inconvenience_oneself', 'hassle', 'chronicle',
'discommode', 'ail', 'put_out', 'upset', 'trouble_oneself']
Customer Query: How can I track my order status?
Semantic Analysis (Synonyms): ['cover', 'rank', 'cart_track', 'purchase_order', 'pass_over',
'parliamentary_law', 'cartroad', 'status', 'get_across', 'traverse', 'prescribe', 'order', 'orderliness',
'rails', 'fiat', 'gild', 'regularise', 'runway', 'govern', 'ordination', 'give_chase', 'put', 'cut_across',
'chase', 'consecrate', 'cut', 'grade', 'social_club', 'order_of_magnitude', 'path', 'rules_of_order',
'caterpillar_tread', 'tail', 'Holy_Order', 'cross', 'data_track', 'monastic_order', 'rate', 'go_after',
'say', 'edict', 'regularize', 'Order', 'caterpillar_track', 'parliamentary_procedure', 'cut_through',
'rail', 'enjoin', 'course', 'racecourse', 'arrange', 'club', 'society', 'ordinate', 'set_up', 'rescript',
'chase_after', 'place', 'dictate', 'tell', 'range', 'decree', 'regulate', 'lodge', 'condition', 'track',
'raceway', 'ordain', 'racetrack', 'get_over', 'lead', 'guild', 'running', 'tag', 'ordering', 'position',
'trail', 'dog']

Customer Query: The item I received doesn't match the description.


Semantic Analysis (Synonyms): ['point', 'mate', 'catch', 'equalize', 'detail', 'touch', 'invite',
'welcome', 'get', 'standard', 'jibe', 'description', 'verbal_description', 'gibe', 'have', 'rival', 'equal',
'pit', 'experience', 'fit', 'check', 'item', 'mates', 'correspond', 'oppose', 'pair', 'lucifer', 'received',
'peer', 'cope_with', 'encounter', 'play_off', 'meet', 'take_in', 'friction_match', 'find', 'particular',
'equate', 'match', 'couple', 'equalise', 'pick_up', 'agree', 'incur', 'compeer', 'twin', 'tally', 'obtain',
'token', 'receive']

Customer Query: Is there a discount available for bulk orders?


Semantic Analysis (Synonyms): ['tell', 'put', 'rank', 'parliamentary_procedure',
'purchase_order', 'consecrate', 'majority', 'range', 'useable', 'decree', 'parliamentary_law',
'regulate', 'price_reduction', 'mass', 'dictate', 'social_club', 'lodge', 'usable', 'enjoin', 'discount',
'grade', 'order_of_magnitude', 'brush_off', 'rules_of_order', 'rebate', 'push_aside', 'prescribe',
'arrange', 'bulge', 'bank_discount', 'dismiss', 'club', 'ordain', 'order', 'society', 'bulk', 'ordinate',
'Holy_Order', 'brush_aside', 'orderliness', 'guild', 'set_up', 'fiat', 'gild', 'regularise',
'uncommitted', 'ordering', 'available', 'monastic_order', 'deduction', 'ordination', 'govern', 'rate',
'discount_rate', 'say', 'edict', 'regularize', 'rescript', 'disregard', 'ignore', 'volume', 'Order', 'place']

Result:
By following this approach, the program aims to achieve the objectives of performing
semantic analysis on customer queries and improving customer service operations by providing
valuable insights into the semantics of the queries.
SATHYABAMA INSTITUTE OF SCIENCE & TECHNOLOGY
SCHOOL OF COMPUTING
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
SCSA 2604 NATURAL LANGUAGE PROCESSING LAB

LAB – 1 : CASE STUDY

AIM: To Enhance Customer Feedback Analysis through NLP-based Text Processing

PROBLEM STATEMENT:

A company receives a large volume of customer feedback across various channels such as
emails, social media, and surveys. Understanding and categorizing this feedback manually is
time-consuming and inefficient. The goal is to develop an NLP-based program to automatically
process and analyze customer feedback to extract valuable insights.

OBJECTIVE:

Utilize spaCy and NLP techniques to process customer feedback text, extract tokens, perform
lemmatization, and conduct dependency parsing to uncover underlying relationships between
words.

APPROACH:

Data Collection:
Gather a dataset containing customer feedback from different sources, including emails, social
media comments, and survey responses.

Text Processing with spaCy:


Utilize spaCy to process the customer feedback text. Extract tokens to identify individual words
and perform lemmatization to obtain their base forms.
1
Page

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


Dependency Parsing Analysis:
Use spaCy's dependency parsing feature to identify the syntactic relationships between words.
Analyze the dependency tree to understand how different parts of the feedback sentences are
connected.

Insight Generation:
Categorize the feedback based on sentiment, identify frequently occurring topics, or extract
key phrases related to specific issues or praises mentioned by customers.

Implementation:

Use Python and spaCy to develop the program for text processing and analysis.
Incorporate visualization techniques (e.g., graphs, word clouds) to represent the findings and
insights derived from the processed feedback.
Evaluation:

Evaluate the accuracy and efficiency of tokenization, lemmatization, and dependency parsing
in handling different types of customer feedback.
Measure the program's ability to extract meaningful insights and categorize feedback
accurately.
PROGRAM:
import spacy

# Load English tokenizer, tagger, parser, NER, and word vectors


nlp = spacy.load("en_core_web_sm")

# Sample customer feedback data


customer_feedback = [
"The product is amazing! I love the quality.",
"The customer service was terrible, very disappointed.",
"Great experience overall, highly recommended.",
2

"The delivery was late, very frustrating."


Page

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


]

def analyze_feedback(feedback):
for idx, text in enumerate(feedback, start=1):
print(f"\nAnalyzing Feedback {idx}: '{text}'")
doc = nlp(text)

# Extract tokens and lemmatization


tokens = [token.text for token in doc]
lemmas = [token.lemma_ for token in doc]
print("Tokens:", tokens)
print("Lemmas:", lemmas)

# Dependency parsing
print("\nDependency Parsing:")
for token in doc:
print(token.text, token.dep_, token.head.text, token.head.pos_,
[child for child in token.children])

if __name__ == "__main__":
analyze_feedback(customer_feedback)

OUTPUT:
Analyzing Feedback 1: 'The product is amazing! I love the quality.'
Tokens: ['The', 'product', 'is', 'amazing', '!', 'I', 'love', 'the', 'quality', '.']
Lemmas: ['the', 'product', 'be', 'amazing', '!', 'I', 'love', 'the', 'quality', '.']

Dependency Parsing:
The det product NOUN []
3

product nsubj is AUX [The]


Page

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


is ROOT is AUX [product, amazing, !]
amazing acomp is AUX []
! punct is AUX []
I nsubj love VERB []
love ROOT love VERB [I, quality, .]
the det quality NOUN []
quality dobj love VERB [the]
. punct love VERB []

Analyzing Feedback 2: 'The customer service was terrible, very disappointed.'


Tokens: ['The', 'customer', 'service', 'was', 'terrible', ',', 'very', 'disappointed', '.']
Lemmas: ['the', 'customer', 'service', 'be', 'terrible', ',', 'very', 'disappointed', '.']

Dependency Parsing:
The det service NOUN []
customer compound service NOUN []
service nsubj was AUX [The, customer]
was ROOT was AUX [service, disappointed, .]
terrible amod disappointed ADJ []
, punct disappointed ADJ []
very advmod disappointed ADJ []
disappointed acomp was AUX [terrible, ,, very]
. punct was AUX []

Analyzing Feedback 3: 'Great experience overall, highly recommended.'


Tokens: ['Great', 'experience', 'overall', ',', 'highly', 'recommended', '.']
Lemmas: ['great', 'experience', 'overall', ',', 'highly', 'recommend', '.']

Dependency Parsing:
4

Great amod experience NOUN []


Page

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


experience nsubj recommended VERB [Great]
overall advmod recommended VERB []
, punct recommended VERB []
highly advmod recommended VERB []
recommended ROOT recommended VERB [experience, overall, ,, highly, .]
. punct recommended VERB []

Analyzing Feedback 4: 'The delivery was late, very frustrating.'


Tokens: ['The', 'delivery', 'was', 'late', ',', 'very', 'frustrating', '.']
Lemmas: ['the', 'delivery', 'be', 'late', ',', 'very', 'frustrating', '.']

Dependency Parsing:
The det delivery NOUN [ ]
delivery nsubj was AUX [The]
was ROOT was AUX [delivery, frustrating, .]
late advmod frustrating ADJ [ ]
, punct frustrating ADJ [ ]
very advmod frustrating ADJ [ ]
frustrating acomp was AUX [late, ,, very]
. punct was AUX [ ]

CONCLUSION:
The developed NLP-based program utilizing spaCy proves to be an efficient solution for
processing and analyzing customer feedback. Its capability to extract tokens, perform
lemmatization, and conduct dependency parsing aids in understanding the sentiment,
identifying key topics, and establishing relationships within the feedback data. This enables
companies to derive actionable insights, prioritize issues, and enhance customer satisfaction
based on the analysis of their feedback.

RESULT:
This case study demonstrates the practical application of the provided code snippet using
5

spaCy in a business context, specifically for customer feedback analysis, showcasing how
Page

NLP techniques can be employed to extract valuable insights from unstructured text data.

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


4/23/24, 2:20 PM nlp.ipynb - Colab

#1
import spacy
# Load English tokenizer, tagger, parser, NER, and word vectors
nlp = spacy.load("en_core_web_sm")
# Sample text for analysis
text = "Natural Language Processing is a fascinating field of study."
# Process the text with spaCy
doc = nlp(text)
# Extracting tokens and lemmatization
tokens = [token.text for token in doc]
lemmas = [token.lemma_ for token in doc]
print("Tokens:", tokens)
print("Lemmas:", lemmas)
# Dependency parsing
print("\nDependency Parsing:")
for token in doc:
print(token.text, token.dep_, token.head.text, token.head.pos_,
[child for child in token.children])

Tokens: ['Natural', 'Language', 'Processing', 'is', 'a', 'fascinating', 'field', 'of', 'study', '.']
Lemmas: ['Natural', 'Language', 'Processing', 'be', 'a', 'fascinating', 'field', 'of', 'study', '.']

Dependency Parsing:
Natural compound Language PROPN []
Language compound Processing PROPN [Natural]
Processing nsubj is AUX [Language]
is ROOT is AUX [Processing, field, .]
a det field NOUN []
fascinating amod field NOUN []
field attr is AUX [a, fascinating, of]
of prep field NOUN [study]
study pobj of ADP []
. punct is AUX []

#1 case study
import spacy
# Load English tokenizer, tagger, parser, NER, and word vectors
nlp = spacy.load("en_core_web_sm")
# Sample customer feedback data
customer_feedback = [
"The product is amazing! I love the quality.",
"The customer service was terrible, very disappointed.",
"Great experience overall, highly recommended.",
"The delivery was late, very frustrating."
]
def analyze_feedback(feedback):
for idx, text in enumerate(feedback, start=1):
print(f"\nAnalyzing Feedback {idx}: '{text}'")
doc = nlp(text)
tokens = [token.text for token in doc]
lemmas = [token.lemma_ for token in doc]
print("Tokens:", tokens)
print("Lemmas:", lemmas)
print("\nDependency Parsing:")
for token in doc:
print(token.text, token.dep_, token.head.text, token.head.pos_,
[child for child in token.children])
if __name__ == "__main__":
analyze_feedback(customer_feedback)

Analyzing Feedback 1: 'The product is amazing! I love the quality.'

Analyzing Feedback 2: 'The customer service was terrible, very disappointed.'

Analyzing Feedback 3: 'Great experience overall, highly recommended.'

Analyzing Feedback 4: 'The delivery was late, very frustrating.'


Tokens: ['The', 'delivery', 'was', 'late', ',', 'very', 'frustrating', '.']
Lemmas: ['the', 'delivery', 'be', 'late', ',', 'very', 'frustrating', '.']

Dependency Parsing:
The det delivery NOUN []
delivery nsubj was AUX [The]
was ROOT was AUX [delivery, frustrating, .]
late advmod frustrating ADJ []
, punct frustrating ADJ []
very advmod frustrating ADJ []
frustrating acomp was AUX [late, ,, very]
. punct was AUX []

https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 1/9
4/23/24, 2:20 PM nlp.ipynb - Colab
#2
import nltk
import random

nltk.download('punkt')
nltk.download('gutenberg')

words = nltk.corpus.gutenberg.words()

bigrams = list(nltk.bigrams(words))

starting_word = "the"
generated_text = [starting_word]

for _ in range(20):

possible_words = [word2 for (word1, word2) in bigrams if word1.lower() == generated_text[-1].lower()]

next_word = random.choice(possible_words)
generated_text.append(next_word)

print(' '.join(generated_text))

[nltk_data] Downloading package punkt to /root/nltk_data...


[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data] Unzipping corpora/gutenberg.zip.
the mast head and the son , " If you can afford it doesn ' s eye spare them more step

#2 Case study
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer
class EmailAutocompleteSystem:
def __init__(self):
self.model_name = "gpt2"
self.tokenizer = GPT2Tokenizer.from_pretrained(self.model_name)
self.model = GPT2LMHeadModel.from_pretrained(self.model_name)
def generate_suggestions(self, user_input, context):
input_text = f"{context} {user_input}"
input_ids = self.tokenizer.encode(input_text, return_tensors="pt")
with torch.no_grad():
output = self.model.generate(input_ids, max_length=50, num_return_sequences=1,no_repeat_ngram_size=2)
generated_text = self.tokenizer.decode(output[0], skip_special_tokens=True)
suggestions = generated_text.split()[len(user_input.split()):]
return suggestions

if __name__ == "__main__":
autocomplete_system = EmailAutocompleteSystem()
email_context = "Subject: Discussing Project Proposal\nHi [Recipient],"
while True:
user_input = input("Enter your sentence (type 'exit' to end): ")
if user_input.lower() == 'exit':
break
suggestions = autocomplete_system.generate_suggestions(user_input, email_context)
if suggestions:
print("Autocomplete Suggestions:", suggestions)
else:
print("No suggestions available.")

/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:88: UserWarni
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public mo
warnings.warn(
tokenizer_config.json: 100% 26.0/26.0 [00:00<00:00, 636B/s]

vocab.json: 100% 1.04M/1.04M [00:00<00:00, 6.15MB/s]

merges.txt: 100% 456k/456k [00:00<00:00, 2.20MB/s]

tokenizer.json: 100% 1.36M/1.36M [00:00<00:00, 8.90MB/s]

config.json: 100% 665/665 [00:00<00:00, 12.9kB/s]

model.safetensors: 100% 548M/548M [00:09<00:00, 41.4MB/s]

generation_config.json: 100% 124/124 [00:00<00:00, 588B/s]

https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 2/9
4/23/24, 2:20 PM nlp.ipynb - Colab
#3
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
# Load the 20 Newsgroups dataset
categories = ['sci.med', 'sci.space', 'comp.graphics', 'talk.politics.mideast']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories)
# Split the data into training and testing sets
X_train = newsgroups_train.data
X_test = newsgroups_test.data
y_train = newsgroups_train.target
y_test = newsgroups_test.target
# Create a pipeline with TF-IDF vectorizer and LinearSVC classifier
model = make_pipeline(
TfidfVectorizer(),
LinearSVC()
)
# Train the model
model.fit(X_train, y_train)
# Predict labels for the test set
predictions = model.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)
print("\nClassification Report:")
print(classification_report(y_test, predictions))

Accuracy: 0.9504823151125402

Classification Report:
precision recall f1-score support

0 0.89 0.97 0.93 389


1 0.96 0.91 0.94 396
2 0.98 0.94 0.96 394
3 0.98 0.98 0.98 376

accuracy 0.95 1555


macro avg 0.95 0.95 0.95 1555
weighted avg 0.95 0.95 0.95 1555

#3 Case study
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report
# Load the 20 Newsgroups dataset as a proxy for customer support emails
newsgroups = fetch_20newsgroups(subset='all', categories=['comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'rec.autos', 'rec.motorcy
# Prepare data and target labels
X = newsgroups.data
y = newsgroups.target
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create TF-IDF vectorizer
vectorizer = TfidfVectorizer(stop_words='english', max_features=10000)
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)
# Train the LinearSVC classifier
classifier = LinearSVC()
classifier.fit(X_train, y_train)
# Predict labels for the test set
predictions = classifier.predict(X_test)
# Evaluate the classifier
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)
print("\nClassification Report:")
print(classification_report(y_test, predictions, target_names=newsgroups.target_names))

Accuracy: 0.9389623601220752

Classification Report:
precision recall f1-score support

comp.sys.ibm.pc.hardware 0.92 0.91 0.91 212


comp.sys.mac.hardware 0.94 0.93 0.94 198
rec.autos 0.97 0.93 0.95 179

https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 3/9
4/23/24, 2:20 PM nlp.ipynb - Colab
rec.motorcycles 0.96 0.99 0.97 205
sci.electronics 0.92 0.93 0.92 189

accuracy 0.94 983


macro avg 0.94 0.94 0.94 983
weighted avg 0.94 0.94 0.94 983

#4
# Install necessary libraries
!pip install gensim
!pip install nltk
# Import required libraries
import gensim.downloader as api
from nltk.tokenize import word_tokenize
# Download pre-trained word vectors (Word2Vec)
word_vectors = api.load("word2vec-google-news-300")
# Sample sentences
sentences = [
"Natural language processing is a challenging but fascinating field.",
"Word embeddings capture semantic meanings of words in a vector space."
]
# Tokenize sentences
tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in sentences]
# Perform semantic analysis using pre-trained word vectors
for tokenized_sentence in tokenized_sentences:
for word in tokenized_sentence:
if word in word_vectors:
similar_words = word_vectors.most_similar(word)
print(f"Words similar to '{word}': {similar_words}")
else:
print(f"'{word}' is not in the pre-trained Word2Vec model.")

Requirement already satisfied: gensim in /usr/local/lib/python3.10/dist-packages (4.3.2)


Requirement already satisfied: numpy>=1.18.5 in /usr/local/lib/python3.10/dist-packages (from gensim) (1.25.2)
Requirement already satisfied: scipy>=1.7.0 in /usr/local/lib/python3.10/dist-packages (from gensim) (1.11.4)
Requirement already satisfied: smart-open>=1.8.1 in /usr/local/lib/python3.10/dist-packages (from gensim) (6.4.0)
Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (3.8.1)
Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk) (8.1.7)
Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk) (1.4.0)
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.10/dist-packages (from nltk) (2023.12.25)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from nltk) (4.66.2)
Words similar to 'natural': [('Splittorff_lacked', 0.636509358882904), ('Natural', 0.58078932762146), ('Mike_Taugher_covers', 0.5772
Words similar to 'language': [('langauge', 0.7476695775985718), ('Language', 0.6695356369018555), ('languages', 0.6341332197189331),
Words similar to 'processing': [('Processing', 0.7285515666007996), ('processed', 0.6519132852554321), ('processor', 0.6367604136466
Words similar to 'is': [('was', 0.6549733281135559), ("isn'ta", 0.6439523100852966), ('seems', 0.634029746055603), ('Is', 0.60859686
'a' is not in the pre-trained Word2Vec model.
Words similar to 'challenging': [('difficult', 0.6388775110244751), ('challenge', 0.5953003764152527), ('daunting', 0.56980061531066
Words similar to 'but': [('although', 0.8104525804519653), ('though', 0.7285684943199158), ('because', 0.7225914597511292), ('so', 0
Words similar to 'fascinating': [('interesting', 0.7623067498207092), ('intriguing', 0.7245113253593445), ('enlightening', 0.6644250
Words similar to 'field': [('fields', 0.5582526326179504), ('fi_eld', 0.5188260078430176), ('Keith_Toogood', 0.49749255180358887),
'.' is not in the pre-trained Word2Vec model.
Words similar to 'word': [('phrase', 0.6777030825614929), ('words', 0.5864380598068237), ('verb', 0.5517287254333496), ('Word', 0.54
'embeddings' is not in the pre-trained Word2Vec model.
Words similar to 'capture': [('capturing', 0.7563897371292114), ('captured', 0.7155306935310364), ('captures', 0.6099075078964233),
Words similar to 'semantic': [('semantics', 0.6644964814186096), ('Semantic', 0.6464474201202393), ('contextual', 0.5909127593040466
Words similar to 'meanings': [('grammatical_constructions', 0.594986081123352), ('idioms', 0.5938195586204529), ('connotations', 0.5
'of' is not in the pre-trained Word2Vec model.
Words similar to 'words': [('phrases', 0.7100036144256592), ('phrase', 0.6408688426017761), ('Words', 0.6160537600517273), ('word',
Words similar to 'in': [('inthe', 0.5891957879066467), ('where', 0.5662435293197632), ('the', 0.5429296493530273), ('In', 0.54151171
'a' is not in the pre-trained Word2Vec model.
Words similar to 'vector': [('vectors', 0.750322163105011), ('adeno_associated_viral_AAV', 0.5999537110328674), ('bitmap_graphics',
Words similar to 'space': [('spaces', 0.6570690870285034), ('music_concept_ShockHound', 0.5850345492362976), ('Shuttle_docks', 0.556
'.' is not in the pre-trained Word2Vec model.

https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 4/9
4/23/24, 2:20 PM nlp.ipynb - Colab
#4 case study
import nltk
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
# Initialize NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
# Function to perform semantic analysis
def semantic_analysis(text):
tokens = word_tokenize(text)
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]
synonyms = set()
for token in lemmatized_tokens:
for syn in wordnet.synsets(token):
for lemma in syn.lemmas():
synonyms.add(lemma.name())
return list(synonyms)
# Example customer queries
customer_queries = [
"I received a damaged product. Can I get a refund?",
"I'm having trouble accessing my account.",
"How can I track my order status?",
"The item I received doesn't match the description.",
"Is there a discount available for bulk orders?"
]
# Semantic analysis for each query
for query in customer_queries:
print("Customer Query:", query)
synonyms = semantic_analysis(query)
print("Semantic Analysis (Synonyms):", synonyms)
print("\n")

[nltk_data] Downloading package punkt to /root/nltk_data...


[nltk_data] Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
Customer Query: I received a damaged product. Can I get a refund?
Semantic Analysis (Synonyms): ['generate', 'fuck_off', 'damaged', 'bewilder', 'get', 'stick', 'draw', 'pick_up', 'take_in', 'develop

Customer Query: I'm having trouble accessing my account.


Semantic Analysis (Synonyms): ['disturb', 'news_report', 'calculate', 'answer_for', 'accounting', 'invoice', 'account_statement', 'b

Customer Query: How can I track my order status?


Semantic Analysis (Synonyms): ['racetrack', 'position', 'parliamentary_procedure', 'society', 'gild', 'tell', 'order_of_magnitude',

Customer Query: The item I received doesn't match the description.


Semantic Analysis (Synonyms): ['oppose', 'invite', 'received', 'couple', 'agree', 'check', 'receive', 'get', 'friction_match', 'meet

Customer Query: Is there a discount available for bulk orders?


Semantic Analysis (Synonyms): ['orderliness', 'dictate', 'ordination', 'edict', 'bank_discount', 'parliamentary_procedure', 'set_up

https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 5/9
4/23/24, 2:20 PM nlp.ipynb - Colab
#5
# Install necessary libraries
!pip install scikit-learn
!pip install nltk
# Import required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
from nltk.corpus import movie_reviews # Sample dataset from NLTK
# Download NLTK resources (run only once if not downloaded)
import nltk
nltk.download('movie_reviews')
# Load the movie_reviews dataset
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]

# Convert data to DataFrame


df = pd.DataFrame(documents, columns=['text', 'sentiment'])
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['sentiment'], test_size=0.2,
random_state=42)
# Initialize TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer()
# Fit and transform the training data
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train.apply(' '.join))
# Initialize SVM classifier
svm_classifier = SVC(kernel='linear')
# Train the classifier
svm_classifier.fit(X_train_tfidf, y_train)
# Transform the test data
X_test_tfidf = tfidf_vectorizer.transform(X_test.apply(' '.join))
# Predict on the test data
y_pred = svm_classifier.predict(X_test_tfidf)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')
# Display classification report
print(classification_report(y_test, y_pred))

Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (1.2.2)


Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.25.2)
Requirement already satisfied: scipy>=1.3.2 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.11.4)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.4.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (3.4.0)
Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (3.8.1)
Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk) (8.1.7)
Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk) (1.4.0)
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.10/dist-packages (from nltk) (2023.12.25)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from nltk) (4.66.2)
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data] Unzipping corpora/movie_reviews.zip.
Accuracy: 0.84
precision recall f1-score support

neg 0.83 0.85 0.84 199


pos 0.85 0.82 0.84 201

accuracy 0.84 400


macro avg 0.84 0.84 0.84 400
weighted avg 0.84 0.84 0.84 400

https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 6/9
4/23/24, 2:20 PM nlp.ipynb - Colab
#5 case study
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
# Download NLTK resources (only required once)
nltk.download('vader_lexicon')
# Sample reviews
reviews = [
"This product is amazing! I love it.",
"The product was good, but the packaging was damaged.",
"Very disappointing experience. Would not recommend.",
"Neutral feedback on the product.",
]
# Initialize Sentiment Intensity Analyzer
sid = SentimentIntensityAnalyzer()
# Analyze sentiment for each review
for review in reviews:
print("Review:", review)
scores = sid.polarity_scores(review)
print("Sentiment:", end=' ')
if scores['compound'] > 0.05:
print("Positive")
elif scores['compound'] < -0.05:
print("Negative")
else:
print("Neutral")
print()

Review: This product is amazing! I love it.


Sentiment: Positive
Review: The product was good, but the packaging was damaged.
Sentiment: Negative
Review: Very disappointing experience. Would not recommend.
Sentiment: Negative
Review: Neutral feedback on the product.
Sentiment: Neutral

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...

#6
# Install NLTK (if not already installed)
!pip install nltk
# Import necessary libraries
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
# Sample text for POS tagging
text = "Parts of speech tagging helps to understand the function of each word in a sentence."
# Tokenize the text into words
tokens = nltk.word_tokenize(text)
# Perform POS tagging
pos_tags = nltk.pos_tag(tokens)
# Display the POS tags
print("POS tags:", pos_tags)

Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (3.8.1)


Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk) (8.1.7)
Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk) (1.4.0)
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.10/dist-packages (from nltk) (2023.12.25)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from nltk) (4.66.2)
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /root/nltk_data...
[nltk_data] Unzipping taggers/averaged_perceptron_tagger.zip.
POS tags: [('Parts', 'NNS'), ('of', 'IN'), ('speech', 'NN'), ('tagging', 'VBG'), ('helps', 'NNS'), ('to', 'TO'), ('understand', 'VB

https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 7/9
4/23/24, 2:20 PM nlp.ipynb - Colab
#6 Case study
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
# Download NLTK resources (if not already downloaded)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
def pos_tagging(text):
sentences = sent_tokenize(text)
tagged_tokens = []
for sentence in sentences:
tokens = word_tokenize(sentence)
tagged_tokens.extend(nltk.pos_tag(tokens))
return tagged_tokens
def main():
article_text = """Manchester United secured a 3-1 victory over Chelsea in yesterday's match.
Goals from Rashford, Greenwood, and Fernandes sealed the win for United.
Chelsea's only goal came from Pulisic in the first half.
The victory boosts United's chances in the Premier League title race.
"""
tagged_tokens = pos_tagging(article_text)
print("Original Article Text:\n", article_text)
print("\nParts of Speech Tagging:")
for token, pos_tag in tagged_tokens:
print(f"{token}: {pos_tag}")
if __name__ == "__main__":
main()

Original Article Text:


Manchester United secured a 3-1 victory over Chelsea in yesterday's match.
Goals from Rashford, Greenwood, and Fernandes sealed the win for United.
Chelsea's only goal came from Pulisic in the first half.
The victory boosts United's chances in the Premier League title race.

Parts of Speech Tagging:


Manchester: NNP
United: NNP
secured: VBD
a: DT
3-1: JJ
victory: NN
over: IN
Chelsea: NNP
in: IN
yesterday: NN
's: POS
match: NN
.: .
Goals: NNS
from: IN
Rashford: NNP
,: ,
Greenwood: NNP
,: ,
and: CC
Fernandes: NNP
sealed: VBD
the: DT
win: NN
for: IN
United: NNP
.: .
Chelsea: NN
's: POS
only: JJ
goal: NN
came: VBD
from: IN
Pulisic: NNP
in: IN
the: DT
first: JJ
half: NN
.: .
The: DT
victory: NN
boosts: VBZ
United: NNP
's: POS
chances: NNS
in: IN
the: DT
Premier: NNP
League: NNP
title: NN

https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 8/9
4/23/24, 2:20 PM nlp.ipynb - Colab
#7
!pip install nltk
import nltk
from nltk import RegexpParser
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
# Download NLTK resources (run only once if not downloaded)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
# Sample sentence
sentence = "The quick brown fox jumps over the lazy dog"
# Tokenize the sentence
tokens = word_tokenize(sentence)
# POS tagging
tagged = pos_tag(tokens)
# Define a chunk grammar using regular expressions
# NP (noun phrase) chunking: "NP: {<DT>?<JJ>*<NN>}"
# This grammar captures optional determiner (DT), adjectives (JJ), and nouns (NN) as a noun phrase
chunk_grammar = r"""
NP: {<DT>?<JJ>*<NN>}
"""
# Create a chunk parser with the defined grammar
chunk_parser = RegexpParser(chunk_grammar)
# Parse the tagged sentence to extract chunks
chunks = chunk_parser.parse(tagged)
# Display the chunks
for subtree in chunks.subtrees():
if subtree.label() == 'NP':
print(subtree)

Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (3.8.1)


Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk) (8.1.7)
Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk) (1.4.0)
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.10/dist-packages (from nltk) (2023.12.25)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from nltk) (4.66.2)
(NP The/DT quick/JJ brown/NN)
(NP f /NN)

https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 9/9
4/23/24, 2:20 PM nlp.ipynb - Colab

#1
import spacy
# Load English tokenizer, tagger, parser, NER, and word vectors
nlp = spacy.load("en_core_web_sm")
# Sample text for analysis
text = "Natural Language Processing is a fascinating field of study."
# Process the text with spaCy
doc = nlp(text)
# Extracting tokens and lemmatization
tokens = [token.text for token in doc]
lemmas = [token.lemma_ for token in doc]
print("Tokens:", tokens)
print("Lemmas:", lemmas)
# Dependency parsing
print("\nDependency Parsing:")
for token in doc:
print(token.text, token.dep_, token.head.text, token.head.pos_,
[child for child in token.children])

Tokens: ['Natural', 'Language', 'Processing', 'is', 'a', 'fascinating', 'field', 'of', 'study', '.']
Lemmas: ['Natural', 'Language', 'Processing', 'be', 'a', 'fascinating', 'field', 'of', 'study', '.']

Dependency Parsing:
Natural compound Language PROPN []
Language compound Processing PROPN [Natural]
Processing nsubj is AUX [Language]
is ROOT is AUX [Processing, field, .]
a det field NOUN []
fascinating amod field NOUN []
field attr is AUX [a, fascinating, of]
of prep field NOUN [study]
study pobj of ADP []
. punct is AUX []

#1 case study
import spacy
# Load English tokenizer, tagger, parser, NER, and word vectors
nlp = spacy.load("en_core_web_sm")
# Sample customer feedback data
customer_feedback = [
"The product is amazing! I love the quality.",
"The customer service was terrible, very disappointed.",
"Great experience overall, highly recommended.",
"The delivery was late, very frustrating."
]
def analyze_feedback(feedback):
for idx, text in enumerate(feedback, start=1):
print(f"\nAnalyzing Feedback {idx}: '{text}'")
doc = nlp(text)
tokens = [token.text for token in doc]
lemmas = [token.lemma_ for token in doc]
print("Tokens:", tokens)
print("Lemmas:", lemmas)
print("\nDependency Parsing:")
for token in doc:
print(token.text, token.dep_, token.head.text, token.head.pos_,
[child for child in token.children])
if __name__ == "__main__":
analyze_feedback(customer_feedback)

Analyzing Feedback 1: 'The product is amazing! I love the quality.'

Analyzing Feedback 2: 'The customer service was terrible, very disappointed.'

Analyzing Feedback 3: 'Great experience overall, highly recommended.'

Analyzing Feedback 4: 'The delivery was late, very frustrating.'


Tokens: ['The', 'delivery', 'was', 'late', ',', 'very', 'frustrating', '.']
Lemmas: ['the', 'delivery', 'be', 'late', ',', 'very', 'frustrating', '.']

Dependency Parsing:
The det delivery NOUN []
delivery nsubj was AUX [The]
was ROOT was AUX [delivery, frustrating, .]
late advmod frustrating ADJ []
, punct frustrating ADJ []
very advmod frustrating ADJ []
frustrating acomp was AUX [late, ,, very]
. punct was AUX []

https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 1/9
4/23/24, 2:20 PM nlp.ipynb - Colab
#2
import nltk
import random

nltk.download('punkt')
nltk.download('gutenberg')

words = nltk.corpus.gutenberg.words()

bigrams = list(nltk.bigrams(words))

starting_word = "the"
generated_text = [starting_word]

for _ in range(20):

possible_words = [word2 for (word1, word2) in bigrams if word1.lower() == generated_text[-1].lower()]

next_word = random.choice(possible_words)
generated_text.append(next_word)

print(' '.join(generated_text))

[nltk_data] Downloading package punkt to /root/nltk_data...


[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data] Unzipping corpora/gutenberg.zip.
the mast head and the son , " If you can afford it doesn ' s eye spare them more step

#2 Case study
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer
class EmailAutocompleteSystem:
def __init__(self):
self.model_name = "gpt2"
self.tokenizer = GPT2Tokenizer.from_pretrained(self.model_name)
self.model = GPT2LMHeadModel.from_pretrained(self.model_name)
def generate_suggestions(self, user_input, context):
input_text = f"{context} {user_input}"
input_ids = self.tokenizer.encode(input_text, return_tensors="pt")
with torch.no_grad():
output = self.model.generate(input_ids, max_length=50, num_return_sequences=1,no_repeat_ngram_size=2)
generated_text = self.tokenizer.decode(output[0], skip_special_tokens=True)
suggestions = generated_text.split()[len(user_input.split()):]
return suggestions

if __name__ == "__main__":
autocomplete_system = EmailAutocompleteSystem()
email_context = "Subject: Discussing Project Proposal\nHi [Recipient],"
while True:
user_input = input("Enter your sentence (type 'exit' to end): ")
if user_input.lower() == 'exit':
break
suggestions = autocomplete_system.generate_suggestions(user_input, email_context)
if suggestions:
print("Autocomplete Suggestions:", suggestions)
else:
print("No suggestions available.")

/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:88: UserWarni
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public mo
warnings.warn(
tokenizer_config.json: 100% 26.0/26.0 [00:00<00:00, 636B/s]

vocab.json: 100% 1.04M/1.04M [00:00<00:00, 6.15MB/s]

merges.txt: 100% 456k/456k [00:00<00:00, 2.20MB/s]

tokenizer.json: 100% 1.36M/1.36M [00:00<00:00, 8.90MB/s]

config.json: 100% 665/665 [00:00<00:00, 12.9kB/s]

model.safetensors: 100% 548M/548M [00:09<00:00, 41.4MB/s]

generation_config.json: 100% 124/124 [00:00<00:00, 588B/s]

https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 2/9
4/23/24, 2:20 PM nlp.ipynb - Colab
#3
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
# Load the 20 Newsgroups dataset
categories = ['sci.med', 'sci.space', 'comp.graphics', 'talk.politics.mideast']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories)
# Split the data into training and testing sets
X_train = newsgroups_train.data
X_test = newsgroups_test.data
y_train = newsgroups_train.target
y_test = newsgroups_test.target
# Create a pipeline with TF-IDF vectorizer and LinearSVC classifier
model = make_pipeline(
TfidfVectorizer(),
LinearSVC()
)
# Train the model
model.fit(X_train, y_train)
# Predict labels for the test set
predictions = model.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)
print("\nClassification Report:")
print(classification_report(y_test, predictions))

Accuracy: 0.9504823151125402

Classification Report:
precision recall f1-score support

0 0.89 0.97 0.93 389


1 0.96 0.91 0.94 396
2 0.98 0.94 0.96 394
3 0.98 0.98 0.98 376

accuracy 0.95 1555


macro avg 0.95 0.95 0.95 1555
weighted avg 0.95 0.95 0.95 1555

#3 Case study
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report
# Load the 20 Newsgroups dataset as a proxy for customer support emails
newsgroups = fetch_20newsgroups(subset='all', categories=['comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'rec.autos', 'rec.motorcy
# Prepare data and target labels
X = newsgroups.data
y = newsgroups.target
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create TF-IDF vectorizer
vectorizer = TfidfVectorizer(stop_words='english', max_features=10000)
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)
# Train the LinearSVC classifier
classifier = LinearSVC()
classifier.fit(X_train, y_train)
# Predict labels for the test set
predictions = classifier.predict(X_test)
# Evaluate the classifier
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)
print("\nClassification Report:")
print(classification_report(y_test, predictions, target_names=newsgroups.target_names))

Accuracy: 0.9389623601220752

Classification Report:
precision recall f1-score support

comp.sys.ibm.pc.hardware 0.92 0.91 0.91 212


comp.sys.mac.hardware 0.94 0.93 0.94 198
rec.autos 0.97 0.93 0.95 179

https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 3/9
4/23/24, 2:20 PM nlp.ipynb - Colab
rec.motorcycles 0.96 0.99 0.97 205
sci.electronics 0.92 0.93 0.92 189

accuracy 0.94 983


macro avg 0.94 0.94 0.94 983
weighted avg 0.94 0.94 0.94 983

#4
# Install necessary libraries
!pip install gensim
!pip install nltk
# Import required libraries
import gensim.downloader as api
from nltk.tokenize import word_tokenize
# Download pre-trained word vectors (Word2Vec)
word_vectors = api.load("word2vec-google-news-300")
# Sample sentences
sentences = [
"Natural language processing is a challenging but fascinating field.",
"Word embeddings capture semantic meanings of words in a vector space."
]
# Tokenize sentences
tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in sentences]
# Perform semantic analysis using pre-trained word vectors
for tokenized_sentence in tokenized_sentences:
for word in tokenized_sentence:
if word in word_vectors:
similar_words = word_vectors.most_similar(word)
print(f"Words similar to '{word}': {similar_words}")
else:
print(f"'{word}' is not in the pre-trained Word2Vec model.")

Requirement already satisfied: gensim in /usr/local/lib/python3.10/dist-packages (4.3.2)


Requirement already satisfied: numpy>=1.18.5 in /usr/local/lib/python3.10/dist-packages (from gensim) (1.25.2)
Requirement already satisfied: scipy>=1.7.0 in /usr/local/lib/python3.10/dist-packages (from gensim) (1.11.4)
Requirement already satisfied: smart-open>=1.8.1 in /usr/local/lib/python3.10/dist-packages (from gensim) (6.4.0)
Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (3.8.1)
Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk) (8.1.7)
Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk) (1.4.0)
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.10/dist-packages (from nltk) (2023.12.25)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from nltk) (4.66.2)
Words similar to 'natural': [('Splittorff_lacked', 0.636509358882904), ('Natural', 0.58078932762146), ('Mike_Taugher_covers', 0.5772
Words similar to 'language': [('langauge', 0.7476695775985718), ('Language', 0.6695356369018555), ('languages', 0.6341332197189331),
Words similar to 'processing': [('Processing', 0.7285515666007996), ('processed', 0.6519132852554321), ('processor', 0.6367604136466
Words similar to 'is': [('was', 0.6549733281135559), ("isn'ta", 0.6439523100852966), ('seems', 0.634029746055603), ('Is', 0.60859686
'a' is not in the pre-trained Word2Vec model.
Words similar to 'challenging': [('difficult', 0.6388775110244751), ('challenge', 0.5953003764152527), ('daunting', 0.56980061531066
Words similar to 'but': [('although', 0.8104525804519653), ('though', 0.7285684943199158), ('because', 0.7225914597511292), ('so', 0
Words similar to 'fascinating': [('interesting', 0.7623067498207092), ('intriguing', 0.7245113253593445), ('enlightening', 0.6644250
Words similar to 'field': [('fields', 0.5582526326179504), ('fi_eld', 0.5188260078430176), ('Keith_Toogood', 0.49749255180358887),
'.' is not in the pre-trained Word2Vec model.
Words similar to 'word': [('phrase', 0.6777030825614929), ('words', 0.5864380598068237), ('verb', 0.5517287254333496), ('Word', 0.54
'embeddings' is not in the pre-trained Word2Vec model.
Words similar to 'capture': [('capturing', 0.7563897371292114), ('captured', 0.7155306935310364), ('captures', 0.6099075078964233),
Words similar to 'semantic': [('semantics', 0.6644964814186096), ('Semantic', 0.6464474201202393), ('contextual', 0.5909127593040466
Words similar to 'meanings': [('grammatical_constructions', 0.594986081123352), ('idioms', 0.5938195586204529), ('connotations', 0.5
'of' is not in the pre-trained Word2Vec model.
Words similar to 'words': [('phrases', 0.7100036144256592), ('phrase', 0.6408688426017761), ('Words', 0.6160537600517273), ('word',
Words similar to 'in': [('inthe', 0.5891957879066467), ('where', 0.5662435293197632), ('the', 0.5429296493530273), ('In', 0.54151171
'a' is not in the pre-trained Word2Vec model.
Words similar to 'vector': [('vectors', 0.750322163105011), ('adeno_associated_viral_AAV', 0.5999537110328674), ('bitmap_graphics',
Words similar to 'space': [('spaces', 0.6570690870285034), ('music_concept_ShockHound', 0.5850345492362976), ('Shuttle_docks', 0.556
'.' is not in the pre-trained Word2Vec model.

https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 4/9
4/23/24, 2:20 PM nlp.ipynb - Colab
#4 case study
import nltk
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
# Initialize NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
# Function to perform semantic analysis
def semantic_analysis(text):
tokens = word_tokenize(text)
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]
synonyms = set()
for token in lemmatized_tokens:
for syn in wordnet.synsets(token):
for lemma in syn.lemmas():
synonyms.add(lemma.name())
return list(synonyms)
# Example customer queries
customer_queries = [
"I received a damaged product. Can I get a refund?",
"I'm having trouble accessing my account.",
"How can I track my order status?",
"The item I received doesn't match the description.",
"Is there a discount available for bulk orders?"
]
# Semantic analysis for each query
for query in customer_queries:
print("Customer Query:", query)
synonyms = semantic_analysis(query)
print("Semantic Analysis (Synonyms):", synonyms)
print("\n")

[nltk_data] Downloading package punkt to /root/nltk_data...


[nltk_data] Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
Customer Query: I received a damaged product. Can I get a refund?
Semantic Analysis (Synonyms): ['generate', 'fuck_off', 'damaged', 'bewilder', 'get', 'stick', 'draw', 'pick_up', 'take_in', 'develop

Customer Query: I'm having trouble accessing my account.


Semantic Analysis (Synonyms): ['disturb', 'news_report', 'calculate', 'answer_for', 'accounting', 'invoice', 'account_statement', 'b

Customer Query: How can I track my order status?


Semantic Analysis (Synonyms): ['racetrack', 'position', 'parliamentary_procedure', 'society', 'gild', 'tell', 'order_of_magnitude',

Customer Query: The item I received doesn't match the description.


Semantic Analysis (Synonyms): ['oppose', 'invite', 'received', 'couple', 'agree', 'check', 'receive', 'get', 'friction_match', 'meet

Customer Query: Is there a discount available for bulk orders?


Semantic Analysis (Synonyms): ['orderliness', 'dictate', 'ordination', 'edict', 'bank_discount', 'parliamentary_procedure', 'set_up

https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 5/9
4/23/24, 2:20 PM nlp.ipynb - Colab
#5
# Install necessary libraries
!pip install scikit-learn
!pip install nltk
# Import required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
from nltk.corpus import movie_reviews # Sample dataset from NLTK
# Download NLTK resources (run only once if not downloaded)
import nltk
nltk.download('movie_reviews')
# Load the movie_reviews dataset
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]

# Convert data to DataFrame


df = pd.DataFrame(documents, columns=['text', 'sentiment'])
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['sentiment'], test_size=0.2,
random_state=42)
# Initialize TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer()
# Fit and transform the training data
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train.apply(' '.join))
# Initialize SVM classifier
svm_classifier = SVC(kernel='linear')
# Train the classifier
svm_classifier.fit(X_train_tfidf, y_train)
# Transform the test data
X_test_tfidf = tfidf_vectorizer.transform(X_test.apply(' '.join))
# Predict on the test data
y_pred = svm_classifier.predict(X_test_tfidf)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')
# Display classification report
print(classification_report(y_test, y_pred))

Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (1.2.2)


Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.25.2)
Requirement already satisfied: scipy>=1.3.2 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.11.4)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.4.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (3.4.0)
Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (3.8.1)
Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk) (8.1.7)
Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk) (1.4.0)
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.10/dist-packages (from nltk) (2023.12.25)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from nltk) (4.66.2)
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data] Unzipping corpora/movie_reviews.zip.
Accuracy: 0.84
precision recall f1-score support

neg 0.83 0.85 0.84 199


pos 0.85 0.82 0.84 201

accuracy 0.84 400


macro avg 0.84 0.84 0.84 400
weighted avg 0.84 0.84 0.84 400

https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 6/9
4/23/24, 2:20 PM nlp.ipynb - Colab
#5 case study
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
# Download NLTK resources (only required once)
nltk.download('vader_lexicon')
# Sample reviews
reviews = [
"This product is amazing! I love it.",
"The product was good, but the packaging was damaged.",
"Very disappointing experience. Would not recommend.",
"Neutral feedback on the product.",
]
# Initialize Sentiment Intensity Analyzer
sid = SentimentIntensityAnalyzer()
# Analyze sentiment for each review
for review in reviews:
print("Review:", review)
scores = sid.polarity_scores(review)
print("Sentiment:", end=' ')
if scores['compound'] > 0.05:
print("Positive")
elif scores['compound'] < -0.05:
print("Negative")
else:
print("Neutral")
print()

Review: This product is amazing! I love it.


Sentiment: Positive
Review: The product was good, but the packaging was damaged.
Sentiment: Negative
Review: Very disappointing experience. Would not recommend.
Sentiment: Negative
Review: Neutral feedback on the product.
Sentiment: Neutral

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...

#6
# Install NLTK (if not already installed)
!pip install nltk
# Import necessary libraries
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
# Sample text for POS tagging
text = "Parts of speech tagging helps to understand the function of each word in a sentence."
# Tokenize the text into words
tokens = nltk.word_tokenize(text)
# Perform POS tagging
pos_tags = nltk.pos_tag(tokens)
# Display the POS tags
print("POS tags:", pos_tags)

Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (3.8.1)


Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk) (8.1.7)
Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk) (1.4.0)
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.10/dist-packages (from nltk) (2023.12.25)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from nltk) (4.66.2)
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /root/nltk_data...
[nltk_data] Unzipping taggers/averaged_perceptron_tagger.zip.
POS tags: [('Parts', 'NNS'), ('of', 'IN'), ('speech', 'NN'), ('tagging', 'VBG'), ('helps', 'NNS'), ('to', 'TO'), ('understand', 'VB

https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 7/9
4/23/24, 2:20 PM nlp.ipynb - Colab
#6 Case study
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
# Download NLTK resources (if not already downloaded)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
def pos_tagging(text):
sentences = sent_tokenize(text)
tagged_tokens = []
for sentence in sentences:
tokens = word_tokenize(sentence)
tagged_tokens.extend(nltk.pos_tag(tokens))
return tagged_tokens
def main():
article_text = """Manchester United secured a 3-1 victory over Chelsea in yesterday's match.
Goals from Rashford, Greenwood, and Fernandes sealed the win for United.
Chelsea's only goal came from Pulisic in the first half.
The victory boosts United's chances in the Premier League title race.
"""
tagged_tokens = pos_tagging(article_text)
print("Original Article Text:\n", article_text)
print("\nParts of Speech Tagging:")
for token, pos_tag in tagged_tokens:
print(f"{token}: {pos_tag}")
if __name__ == "__main__":
main()

Original Article Text:


Manchester United secured a 3-1 victory over Chelsea in yesterday's match.
Goals from Rashford, Greenwood, and Fernandes sealed the win for United.
Chelsea's only goal came from Pulisic in the first half.
The victory boosts United's chances in the Premier League title race.

Parts of Speech Tagging:


Manchester: NNP
United: NNP
secured: VBD
a: DT
3-1: JJ
victory: NN
over: IN
Chelsea: NNP
in: IN
yesterday: NN
's: POS
match: NN
.: .
Goals: NNS
from: IN
Rashford: NNP
,: ,
Greenwood: NNP
,: ,
and: CC
Fernandes: NNP
sealed: VBD
the: DT
win: NN
for: IN
United: NNP
.: .
Chelsea: NN
's: POS
only: JJ
goal: NN
came: VBD
from: IN
Pulisic: NNP
in: IN
the: DT
first: JJ
half: NN
.: .
The: DT
victory: NN
boosts: VBZ
United: NNP
's: POS
chances: NNS
in: IN
the: DT
Premier: NNP
League: NNP
title: NN

https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 8/9
4/23/24, 2:20 PM nlp.ipynb - Colab
#7
!pip install nltk
import nltk
from nltk import RegexpParser
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
# Download NLTK resources (run only once if not downloaded)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
# Sample sentence
sentence = "The quick brown fox jumps over the lazy dog"
# Tokenize the sentence
tokens = word_tokenize(sentence)
# POS tagging
tagged = pos_tag(tokens)
# Define a chunk grammar using regular expressions
# NP (noun phrase) chunking: "NP: {<DT>?<JJ>*<NN>}"
# This grammar captures optional determiner (DT), adjectives (JJ), and nouns (NN) as a noun phrase
chunk_grammar = r"""
NP: {<DT>?<JJ>*<NN>}
"""
# Create a chunk parser with the defined grammar
chunk_parser = RegexpParser(chunk_grammar)
# Parse the tagged sentence to extract chunks
chunks = chunk_parser.parse(tagged)
# Display the chunks
for subtree in chunks.subtrees():
if subtree.label() == 'NP':
print(subtree)

Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (3.8.1)


Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk) (8.1.7)
Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk) (1.4.0)
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.10/dist-packages (from nltk) (2023.12.25)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from nltk) (4.66.2)
(NP The/DT quick/JJ brown/NN)
(NP f /NN)

https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 9/9
SATHYABAMA INSTITUTE OF SCIENCE & TECHNOLOGY
SCHOOL OF COMPUTING
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
SCSA 2604 NATURAL LANGUAGE PROCESSING LAB

LAB – 1 : CASE STUDY

AIM: To Enhance Customer Feedback Analysis through NLP-based Text Processing

PROBLEM STATEMENT:

A company receives a large volume of customer feedback across various channels such as
emails, social media, and surveys. Understanding and categorizing this feedback manually is
time-consuming and inefficient. The goal is to develop an NLP-based program to automatically
process and analyze customer feedback to extract valuable insights.

OBJECTIVE:

Utilize spaCy and NLP techniques to process customer feedback text, extract tokens, perform
lemmatization, and conduct dependency parsing to uncover underlying relationships between
words.

APPROACH:

Data Collection:
Gather a dataset containing customer feedback from different sources, including emails, social
media comments, and survey responses.

Text Processing with spaCy:


Utilize spaCy to process the customer feedback text. Extract tokens to identify individual words
and perform lemmatization to obtain their base forms.
1
Page

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


Dependency Parsing Analysis:
Use spaCy's dependency parsing feature to identify the syntactic relationships between words.
Analyze the dependency tree to understand how different parts of the feedback sentences are
connected.

Insight Generation:
Categorize the feedback based on sentiment, identify frequently occurring topics, or extract
key phrases related to specific issues or praises mentioned by customers.

Implementation:

Use Python and spaCy to develop the program for text processing and analysis.
Incorporate visualization techniques (e.g., graphs, word clouds) to represent the findings and
insights derived from the processed feedback.
Evaluation:

Evaluate the accuracy and efficiency of tokenization, lemmatization, and dependency parsing
in handling different types of customer feedback.
Measure the program's ability to extract meaningful insights and categorize feedback
accurately.
PROGRAM:
import spacy

# Load English tokenizer, tagger, parser, NER, and word vectors


nlp = spacy.load("en_core_web_sm")

# Sample customer feedback data


customer_feedback = [
"The product is amazing! I love the quality.",
"The customer service was terrible, very disappointed.",
"Great experience overall, highly recommended.",
2

"The delivery was late, very frustrating."


Page

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


]

def analyze_feedback(feedback):
for idx, text in enumerate(feedback, start=1):
print(f"\nAnalyzing Feedback {idx}: '{text}'")
doc = nlp(text)

# Extract tokens and lemmatization


tokens = [token.text for token in doc]
lemmas = [token.lemma_ for token in doc]
print("Tokens:", tokens)
print("Lemmas:", lemmas)

# Dependency parsing
print("\nDependency Parsing:")
for token in doc:
print(token.text, token.dep_, token.head.text, token.head.pos_,
[child for child in token.children])

if __name__ == "__main__":
analyze_feedback(customer_feedback)

OUTPUT:
Analyzing Feedback 1: 'The product is amazing! I love the quality.'
Tokens: ['The', 'product', 'is', 'amazing', '!', 'I', 'love', 'the', 'quality', '.']
Lemmas: ['the', 'product', 'be', 'amazing', '!', 'I', 'love', 'the', 'quality', '.']

Dependency Parsing:
The det product NOUN []
3

product nsubj is AUX [The]


Page

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


is ROOT is AUX [product, amazing, !]
amazing acomp is AUX []
! punct is AUX []
I nsubj love VERB []
love ROOT love VERB [I, quality, .]
the det quality NOUN []
quality dobj love VERB [the]
. punct love VERB []

Analyzing Feedback 2: 'The customer service was terrible, very disappointed.'


Tokens: ['The', 'customer', 'service', 'was', 'terrible', ',', 'very', 'disappointed', '.']
Lemmas: ['the', 'customer', 'service', 'be', 'terrible', ',', 'very', 'disappointed', '.']

Dependency Parsing:
The det service NOUN []
customer compound service NOUN []
service nsubj was AUX [The, customer]
was ROOT was AUX [service, disappointed, .]
terrible amod disappointed ADJ []
, punct disappointed ADJ []
very advmod disappointed ADJ []
disappointed acomp was AUX [terrible, ,, very]
. punct was AUX []

Analyzing Feedback 3: 'Great experience overall, highly recommended.'


Tokens: ['Great', 'experience', 'overall', ',', 'highly', 'recommended', '.']
Lemmas: ['great', 'experience', 'overall', ',', 'highly', 'recommend', '.']

Dependency Parsing:
4

Great amod experience NOUN []


Page

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


experience nsubj recommended VERB [Great]
overall advmod recommended VERB []
, punct recommended VERB []
highly advmod recommended VERB []
recommended ROOT recommended VERB [experience, overall, ,, highly, .]
. punct recommended VERB []

Analyzing Feedback 4: 'The delivery was late, very frustrating.'


Tokens: ['The', 'delivery', 'was', 'late', ',', 'very', 'frustrating', '.']
Lemmas: ['the', 'delivery', 'be', 'late', ',', 'very', 'frustrating', '.']

Dependency Parsing:
The det delivery NOUN [ ]
delivery nsubj was AUX [The]
was ROOT was AUX [delivery, frustrating, .]
late advmod frustrating ADJ [ ]
, punct frustrating ADJ [ ]
very advmod frustrating ADJ [ ]
frustrating acomp was AUX [late, ,, very]
. punct was AUX [ ]

CONCLUSION:
The developed NLP-based program utilizing spaCy proves to be an efficient solution for
processing and analyzing customer feedback. Its capability to extract tokens, perform
lemmatization, and conduct dependency parsing aids in understanding the sentiment,
identifying key topics, and establishing relationships within the feedback data. This enables
companies to derive actionable insights, prioritize issues, and enhance customer satisfaction
based on the analysis of their feedback.

RESULT:
This case study demonstrates the practical application of the provided code snippet using
5

spaCy in a business context, specifically for customer feedback analysis, showcasing how
Page

NLP techniques can be employed to extract valuable insights from unstructured text data.

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


SATHYABAMA INSTITUTE OF SCIENCE & TECHNOLOGY
SCHOOL OF COMPUTING
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
SCSA 2604 NATURAL LANGUAGE PROCESSING LAB

LAB 7: CASE STUDY

AIM: The aim of this case study is to demonstrate the extraction of noun phrases from a
given text using chunking, a technique in Natural Language Processing (NLP). We will
utilize Python's NLTK library to implement chunking and extract meaningful noun phrases
from the text.
Problem Statement:
Given a sample text, our goal is to identify and extract noun phrases, which are
sequences of words containing a noun and optionally other words like adjectives or
determiners. The problem involves implementing a program that tokenizes the text, performs
part-of-speech tagging, applies chunking to identify noun phrases, and finally outputs the
extracted noun phrases.
Objectives :
1. Tokenize the input text into words.
2. Perform part-of-speech tagging to assign grammatical tags to each word.
3. Define a chunk grammar to identify noun phrases.
4. Apply chunking to extract noun phrases from the text.
5. Display the extracted noun phrases.
Dataset:
For this case study, we will use a sample text: "The quick brown fox jumps over the lazy
dog."
Approach:
The approach involves several steps to extract noun phrases from the given text using
chunking in Natural Language Processing (NLP). Firstly, the input text is tokenized into
individual words to prepare it for further processing. Following tokenization, each word is
tagged with its part-of-speech using NLTK's pos_tag function, which assigns grammatical
tags to each word based on its context. Next, a chunk grammar is defined to specify the
patterns that identify noun phrases. This grammar is then utilized to apply chunking, which
groups consecutive words that match the defined patterns into noun phrases. Finally, the
extracted noun phrases are outputted, providing meaningful insights into the structure and
content of the text. This approach allows for the identification and extraction of important
linguistic units, facilitating various NLP tasks such as information extraction, text
summarization, and sentiment analysis.

Program :
import nltk
import os

# Set NLTK data path


nltk.data.path.append("/usr/local/share/nltk_data")

# Download the 'punkt' tokenizer model


nltk.download('punkt')

# Download the 'averaged_perceptron_tagger' model


nltk.download('averaged_perceptron_tagger')

# Sample text
text = "The quick brown fox jumps over the lazy dog."

# Tokenize the text into words


words = nltk.word_tokenize(text)

# Perform part-of-speech tagging


pos_tags = nltk.pos_tag(words)

# Define chunk grammar


chunk_grammar = r"""
NP: {<DT>?<JJ>*<NN>} # Chunk sequences of DT, JJ, NN
"""

# Create chunk parser


chunk_parser = nltk.RegexpParser(chunk_grammar)

# Apply chunking
chunked_text = chunk_parser.parse(pos_tags)

# Extract noun phrases


noun_phrases = []
for subtree in chunked_text.subtrees(filter=lambda t: t.label() ==
'NP'):
noun_phrases.append(' '.join(word for word, tag in
subtree.leaves()))

# Output
print("Original Text:", text)
print("Noun Phrases:")
for phrase in noun_phrases:
print("-", phrase)

Output:
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /root/nltk_data...
[nltk_data] Unzipping taggers/averaged_perceptron_tagger.zip.
Original Text: The quick brown fox jumps over the lazy dog.
Noun Phrases:
- The quick brown
- fox
- the lazy dog

Result:
Chunking is a valuable technique in NLP for identifying and extracting meaningful
phrases from text. In this case study, we successfully implemented chunking using Python's
NLTK library to extract noun phrases from a given text. By identifying and extracting noun
phrases, we gained insights into the structure and semantics of the text, which can be
beneficial for various NLP applications such as information extraction, sentiment analysis,
and text summarization.
SATHYABAMA INSTITUTE OF SCIENCE & TECHNOLOGY
SCHOOL OF COMPUTING
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
SCSA 2604 NATURAL LANGUAGE PROCESSING LAB

LAB 5: SENTIMENT ANALYSIS

AIM: To perform sentiment analysis program using an SVM classifier with TF-IDF
vectorization.

PROCEDURE:
Data Preparation: Downloading the dataset, converting it into a suitable format (words and
sentiments), and structuring it into a DataFrame.
Splitting Data: Dividing the dataset into training and testing sets to train the model on a
portion and evaluate it on another.
TF-IDF Vectorization: Converting text data into numerical vectors using TF-IDF (Term
Frequency-Inverse Document Frequency) representation.
SVM Initialization and Training: Setting up an SVM classifier and training it using the TF-
IDF vectors obtained from the training text data.
Prediction and Evaluation: Transforming test data into TF-IDF vectors, predicting sentiment
labels, and evaluating the model's performance by comparing predicted labels with actual
labels using accuracy and a classification report.
The following algorithm outlines the process of building a sentiment analysis model using an
SVM classifier with TF-IDF vectorization in Python. Adjustments can be made to use
different datasets, vectorization techniques, or machine learning models based on specific
requirements.

ALGORITHM:
1. Library Installation and Import: Install required libraries (scikit-learn and nltk).
Import necessary modules from these libraries.
2. Download NLTK Resources: Download the movie_reviews dataset from NLTK.
3. Load and Prepare Dataset: Load the movie_reviews dataset.
Convert the dataset into a suitable format (list of words and corresponding sentiments)
and create a DataFrame.
1

4. Split Data into Train and Test Sets: Split the dataset into training and testing sets (e.g.,
Page

80% training, 20% testing).

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


5. TF-IDF Vectorization: Initialize a TF-IDF vectorizer.
Fit and transform the training text data to convert it into numerical TF-IDF vectors.
6. Initialize and Train SVM Classifier: Initialize an SVM classifier (using a linear kernel
for this example).
Train the SVM classifier using the TF-IDF vectors and corresponding sentiment
labels.
7. Prediction and Evaluation: Transform the test text data into TF-IDF vectors using the
trained vectorizer.
Predict sentiment labels for the test data using the trained SVM classifier.
Calculate the accuracy score to evaluate the model's performance.
Generate a classification report showing precision, recall, and F1-score for each class.

PROGRAM:
# Install necessary libraries
!pip install scikit-learn
!pip install nltk

# Import required libraries


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
from nltk.corpus import movie_reviews # Sample dataset from NLTK

# Download NLTK resources (run only once if not downloaded)


import nltk
nltk.download('movie_reviews')

# Load the movie_reviews dataset


documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
2
Page

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


# Convert data to DataFrame
df = pd.DataFrame(documents, columns=['text', 'sentiment'])

# Split data into train and test sets


X_train, X_test, y_train, y_test = train_test_split(df['text'], df['sentiment'], test_size=0.2,
random_state=42)

# Initialize TF-IDF vectorizer


tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the training data


X_train_tfidf = tfidf_vectorizer.fit_transform(X_train.apply(' '.join))

# Initialize SVM classifier


svm_classifier = SVC(kernel='linear')

# Train the classifier


svm_classifier.fit(X_train_tfidf, y_train)

# Transform the test data


X_test_tfidf = tfidf_vectorizer.transform(X_test.apply(' '.join))

# Predict on the test data


y_pred = svm_classifier.predict(X_test_tfidf)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

# Display classification report


3
Page

print(classification_report(y_test, y_pred))

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


OUTPUT:
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (1.2.2)
Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.10/dist-packages
(from scikit-learn) (1.23.5)
Requirement already satisfied: scipy>=1.3.2 in /usr/local/lib/python3.10/dist-packages (from
scikit-learn) (1.11.3)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from
scikit-learn) (1.3.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-
packages (from scikit-learn) (3.2.0)
Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (3.8.1)
Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk)
(8.1.7)
Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk)
(1.3.2)
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.10/dist-packages
(from nltk) (2023.6.3)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from nltk)
(4.66.1)
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data] Unzipping corpora/movie_reviews.zip.
Accuracy: 0.84
precision recall f1-score support

neg 0.83 0.85 0.84 199


pos 0.85 0.82 0.84 201

accuracy 0.84 400


macro avg 0.84 0.84 0.84 400
weighted avg 0.84 0.84 0.84 400

4
Page

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


SATHYABAMA INSTITUTE OF SCIENCE & TECHNOLOGY
SCHOOL OF COMPUTING
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
SCSA 2604 NATURAL LANGUAGE PROCESSING LAB

LAB 3: TEXT CLASSIFICATION

AIM: To perform Text classification using python and scikit-learn

PROCEDURE:
This algorithm outlines the steps involved in the text classification task using
LinearSVC on the 20 Newsgroups dataset. It provides a structured approach to implementing
the program and understanding the workflow.
ALGORITHM:
Algorithm: Text Classification using LinearSVC

1. Load the 20 Newsgroups dataset with specified categories.


- Import the necessary libraries: fetch_20newsgroups from sklearn.datasets.
- Specify the categories of interest for classification.
- Use fetch_20newsgroups to load the dataset for both training and testing sets.

2. Split the dataset into training and testing sets.


- Import train_test_split from sklearn.model_selection.
- Split the dataset into X_train, X_test, y_train, and y_test.

3. Create a pipeline for text classification.


- Import make_pipeline from sklearn.pipeline.
- Create a pipeline with TF-IDF Vectorizer and LinearSVC classifier.

4. Train the model on the training data.


- Call the fit method on the pipeline with X_train and y_train as input.

5. Predict labels for the testing data.


- Use the trained model to predict labels for X_test.

6. Evaluate the model's performance.


- Calculate accuracy_score to measure the accuracy of the model.
- Print classification_report to see precision, recall, and F1-score for each class.

End Algorithm
PROGRAM:
# Install scikit-learn if not already installed
!pip install scikit-learn

# Import necessary libraries


import pandas as pd

from sklearn.datasets import fetch_20newsgroups


from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Load the 20 Newsgroups dataset


categories = ['sci.med', 'sci.space', 'comp.graphics', 'talk.politics.mideast']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories)

# Split the data into training and testing sets


X_train = newsgroups_train.data
X_test = newsgroups_test.data
y_train = newsgroups_train.target
y_test = newsgroups_test.target

# Create a pipeline with TF-IDF vectorizer and LinearSVC classifier


model = make_pipeline(
TfidfVectorizer(),
LinearSVC()
)

# Train the model


model.fit(X_train, y_train)

# Predict labels for the test set


predictions = model.predict(X_test)

# Evaluate the model


accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)
print("\nClassification Report:")
print(classification_report(y_test, predictions))

OUTPUT:
Requirement already satisfied: scikit-learn in
/usr/local/lib/python3.10/dist-packages (1.2.2)
Requirement already satisfied: numpy>=1.17.3 in
/usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.23.5)
Requirement already satisfied: scipy>=1.3.2 in
/usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.11.4)
Requirement already satisfied: joblib>=1.1.1 in
/usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.3.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in
/usr/local/lib/python3.10/dist-packages (from scikit-learn) (3.2.0)

Accuracy: 0.9504823151125402
Classification Report:
precision recall f1-score support

0 0.89 0.97 0.93 389


1 0.96 0.91 0.94 396
2 0.98 0.94 0.96 394
3 0.98 0.98 0.98 376

accuracy 0.95 1555


macro avg 0.95 0.95 0.95 1555
weighted avg 0.95 0.95 0.95 1555
SATHYABAMA INSTITUTE OF SCIENCE & TECHNOLOGY
SCHOOL OF COMPUTING
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
SCSA 2604 NATURAL LANGUAGE PROCESSING LAB

LAB 6: CASE STUDY

AIM: Parts of Speech Tagging


Problem Statement:
An online news aggregator wants to improve its recommendation system by analyzing the
content of news articles. To achieve this, they need to perform parts of speech tagging on the
article text to extract relevant information such as key topics, sentiments, and entities
mentioned.
Objectives :
1. Develop a parts of speech tagging system to analyze the content of news articles.
2. Extract key information such as nouns, verbs, adjectives, and other parts of speech to
understand the structure of the articles.
3. Enhance the recommendation system by incorporating the extracted information to
provide more accurate and personalized recommendations to users.

Dataset:
The dataset consists of a collection of news articles in text format. Each article is labeled with
its category (e.g., politics, sports, entertainment) and contains textual content for analysis.

Approach:
1. Preprocess the dataset by tokenizing the text into words and sentences.
2. Perform parts of speech tagging using a pre-trained model or a custom-trained model.
3. Extract relevant parts of speech such as nouns, verbs, adjectives, and adverbs from the
tagged text.
4. Analyze the distribution of different parts of speech across the articles to understand
their linguistic characteristics.
5. Integrate the extracted information into the recommendation system to improve the
relevance of recommended articles for users.
Program :
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

# Download NLTK resources (if not already downloaded)


nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

def pos_tagging(text):
sentences = sent_tokenize(text)
tagged_tokens = []
for sentence in sentences:
tokens = word_tokenize(sentence)
tagged_tokens.extend(nltk.pos_tag(tokens))
return tagged_tokens

def main():
# Example news article
article_text = """
Manchester United secured a 3-1 victory over Chelsea in yesterday's
match.
Goals from Rashford, Greenwood, and Fernandes sealed the win for
United.
Chelsea's only goal came from Pulisic in the first half.
The victory boosts United's chances in the Premier League title
race.
"""

tagged_tokens = pos_tagging(article_text)
print("Original Article Text:\n", article_text)
print("\nParts of Speech Tagging:")
for token, pos_tag in tagged_tokens:
print(f"{token}: {pos_tag}")

if __name__ == "__main__":
main()

Output:
Original Article Text:

Manchester United secured a 3-1 victory over Chelsea in yesterday's match.


Goals from Rashford, Greenwood, and Fernandes sealed the win for United.
Chelsea's only goal came from Pulisic in the first half.
The victory boosts United's chances in the Premier League title race.

Parts of Speech Tagging:


Manchester: NNP
United: NNP
secured: VBD
a: DT
3-1: JJ
victory: NN
over: IN
Chelsea: NNP
in: IN
yesterday: NN
's: POS
match: NN
.: .
Goals: NNS
from: IN
Rashford: NNP
,: ,
Greenwood: NNP
,: ,
and: CC
Fernandes: NNP
sealed: VBD
the: DT
win: NN
for: IN
United: NNP
.: .
Chelsea: NN
's: POS
only: JJ
goal: NN
came: VBD
from: IN
Pulisic: NNP
in: IN
the: DT
first: JJ
half: NN
.: .
The: DT
victory: NN
boosts: VBZ
United: NNP
's: POS
chances: NNS
in: IN
the: DT
Premier: NNP
League: NNP
title: NN
race: NN
.: .
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /root/nltk_data...
[nltk_data] Package averaged_perceptron_tagger is already up-to-
[nltk_data] date!
Result:
This program demonstrates the parts of speech tagging process on a sample news
article. Each word in the article is followed by its corresponding part of speech tag. This
information can be further utilized for analysis and decision-making in the recommendation
system.
SATHYABAMA INSTITUTE OF SCIENCE & TECHNOLOGY
SCHOOL OF COMPUTING
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
SCSA 2604 NATURAL LANGUAGE PROCESSING LAB

LAB 7: CHUNKING

AIM: To perform Noun Phrase chunking

PROCEDURE:
In Natural Language Processing (NLP), chunking is the process of extracting short, meaningful
phrases (chunks) from a sentence based on specific patterns of parts of speech (POS). Python
provides tools like NLTK (Natural Language Toolkit) to perform chunking. This example
demonstrates a basic noun phrase (NP) and verb phrase (VP) chunking using NLTK. You can
adjust the chunk grammar patterns to capture different types of phrases or entities based on
your specific needs.
The chunk_grammar variable contains patterns defined using regular expressions for
identifying noun phrases and verb phrases. Adjusting these patterns can help extract different
types of chunks like prepositional phrases, named entities, etc.
Tokenization: Breaking the sentence into individual tokens or words.
POS Tagging: Assigning part-of-speech tags to each token (identifying whether it's a noun,
verb, adjective, etc.).
Chunking: Grouping tokens into larger structures (noun phrases, verb phrases) based on
defined grammar rules.
Chunk Grammar: Regular expressions defining patterns for identifying specific chunk
structures (like noun phrases).
Chunk Parser: Utilizing the chunk grammar to parse and extract chunks based on the
provided POS-tagged tokens.
The following algorithm outlines the steps involved in the noun phrase chunking process
using NLTK in Python, highlighting the key processes and the role of chunk grammar in
identifying and extracting specific syntactic structures from text data.
1
Page

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


ALGORITHM:

1. Import Necessary Libraries: Import required modules from NLTK for tokenization,
POS tagging, and chunking.

2. Download NLTK Resources (if needed): Ensure NLTK resources like tokenizers and
POS taggers are downloaded (nltk.download('punkt'),
nltk.download('averaged_perceptron_tagger')).

3. Define a Sample Sentence: Set a sample sentence that will be used for chunking.

4. Tokenization: Break the sentence into individual words or tokens using NLTK's
word_tokenize() function.

5. Part-of-Speech (POS) Tagging: Tag each token with its corresponding part-of-speech
using NLTK's pos_tag() function.

6. Chunk Grammar Definition: Define a chunk grammar using regular expressions to


identify noun phrases (NP). For example, NP: {<DT>?<JJ>*<NN>} captures
sequences with optional determiners (DT), adjectives (JJ), and nouns (NN) as noun
phrases.

7. Chunk Parser Creation: Create a chunk parser using RegexpParser() and provide the
defined chunk grammar.

8. Chunking: Parse the tagged sentence using the created chunk parser to extract chunks
based on the defined grammar.

9. Display Chunks: Iterate through the parsed chunks and print the subtrees labeled as
'NP', which represent the identified noun phrases.
2
Page

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


PROGRAM:
!pip install nltk
import nltk
from nltk import RegexpParser
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

# Download NLTK resources (run only once if not downloaded)


nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Sample sentence
sentence = "The quick brown fox jumps over the lazy dog"

# Tokenize the sentence


tokens = word_tokenize(sentence)

# POS tagging
tagged = pos_tag(tokens)

# Define a chunk grammar using regular expressions


# NP (noun phrase) chunking: "NP: {<DT>?<JJ>*<NN>}"
# This grammar captures optional determiner (DT), adjectives (JJ), and nouns (NN) as a noun
phrase
chunk_grammar = r"""
NP: {<DT>?<JJ>*<NN>}
"""

# Create a chunk parser with the defined grammar


chunk_parser = RegexpParser(chunk_grammar)
3
Page

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


# Parse the tagged sentence to extract chunks
chunks = chunk_parser.parse(tagged)

# Display the chunks


for subtree in chunks.subtrees():
if subtree.label() == 'NP': # Print only noun phrases
print(subtree)

OUTPUT:
Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (3.8.1)
Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk)
(8.1.7)
Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk)
(1.3.2)
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.10/dist-packages
(from nltk) (2023.6.3)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from nltk)
(4.66.1)
(NP The/DT quick/JJ brown/NN)
(NP fox/NN)
(NP the/DT lazy/JJ dog/NN)
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /root/nltk_data...
[nltk_data] Package averaged_perceptron_tagger is already up-to-
[nltk_data] date!
4
Page

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


SATHYABAMA INSTITUTE OF SCIENCE & TECHNOLOGY
SCHOOL OF COMPUTING
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
SCSA 2604 NATURAL LANGUAGE PROCESSING LAB

LAB 6: PARTS OF SPEECH TAGGING

AIM: To perform Parts of Speech (POS) tagging program using NLTK

PROCEDURE:
Library Installation and Import: Ensures the NLTK library is available for use and imports
the necessary modules for text processing.
Download NLTK Resources: Downloads essential resources (punkt for tokenization,
averaged_perceptron_tagger for POS tagging) required by NLTK.
Sample Text: Defines a piece of text to demonstrate POS tagging.
Tokenization: Divides the text into individual words or tokens, making it suitable for further
analysis.
POS Tagging: Assigns each word in the text its respective grammatical category or POS tag
using NLTK's POS tagging functionality.
Display POS Tags: Prints or displays the words along with their associated POS tags obtained
from the tagging process.
The following algorithm outlines the steps involved in performing Parts of Speech (POS)
tagging using NLTK in Python. It demonstrates how to tokenize a text and assign
grammatical categories to individual words, providing insight into the linguistic structure of
the text.

ALGORITHM:
1. Library Installation and Import: Install NLTK library if not already installed.
Import the necessary NLTK library for text processing and POS tagging.
2. Download NLTK Resources: Download NLTK resources required for tokenization
and POS tagging (punkt for tokenization, averaged_perceptron_tagger for POS
tagging).
3. Sample Text: Define a sample text for POS tagging.
1

4. Tokenization: Break down the provided text into individual words (tokens) using
Page

NLTK's word_tokenize() method.

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


5. POS Tagging: Perform POS tagging on the tokens obtained from the text using
NLTK's pos_tag() method.
Assign POS tags to each word in the text based on its grammatical category (noun,
verb, adjective, etc.).
6. Display POS Tags: Print or display the words along with their respective POS tags
generated by the POS tagging process.

PROGRAM:
# Install NLTK (if not already installed)
!pip install nltk

# Import necessary libraries


import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Sample text for POS tagging


text = "Parts of speech tagging helps to understand the function of each word in a sentence."

# Tokenize the text into words


tokens = nltk.word_tokenize(text)

# Perform POS tagging


pos_tags = nltk.pos_tag(tokens)

# Display the POS tags


print("POS tags:", pos_tags)

OUTPUT:
Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (3.8.1)
Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk)
2

(8.1.7)
Page

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk)
(1.3.2)
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.10/dist-packages
(from nltk) (2023.12.25)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from nltk)
(4.66.2)
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /root/nltk_data...
[nltk_data] Unzipping taggers/averaged_perceptron_tagger.zip.
POS tags: [('Parts', 'NNS'), ('of', 'IN'), ('speech', 'NN'), ('tagging', 'VBG'), ('helps', 'NNS'), ('to',
'TO'), ('understand', 'VB'), ('the', 'DT'), ('function', 'NN'), ('of', 'IN'), ('each', 'DT'), ('word',
'NN'), ('in', 'IN'), ('a', 'DT'), ('sentence', 'NN'), ('.', '.')]

3
Page

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


SATHYABAMA INSTITUTE OF SCIENCE & TECHNOLOGY
SCHOOL OF COMPUTING
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
SCSA 2604 NATURAL LANGUAGE PROCESSING LAB

LAB 3: CASE STUDY

AIM: Customer Support Email Classification

Problem Statement:
A customer support company receives a large volume of incoming emails from customers
with various inquiries, complaints, and feedback. Manually categorizing and prioritizing
these emails is time-consuming and inefficient. The company wants to develop a text
classification system to automatically classify incoming emails into predefined categories,
allowing for faster response times and better customer service.

Objectives :
 The text classification system successfully categorizes incoming customer emails into
predefined categories.
 It improves the efficiency of the customer support team by automating email
classification and prioritization.
 The company can respond to customer inquiries and issues more promptly, leading to
higher customer satisfaction and retention.

Dataset:
The company has a dataset of past customer emails along with their corresponding
categories. Each email is labeled with one or more categories, indicating the type of inquiry
or issue raised by the customer. For demonstration purposes, we will use the
fetch_20newsgroups dataset from scikit-learn, which contains a collection of newsgroup
documents, spanning 20 different newsgroups. We'll simulate this dataset as if it were
customer support emails categorized into predefined categories.

Approach:
Data Preparation:
 Load the 20 Newsgroups dataset as a proxy for customer support emails.
 Select a subset of categories that represent different types of customer inquiries,
complaints, and feedback.
 Prepare the data and target labels from the dataset.
Data Preprocessing:
 Clean the email text data by removing unnecessary information such as email headers,
signatures, and HTML tags.
 Tokenize the text and convert it to lowercase.
 Remove stopwords and apply techniques like stemming or lemmatization to reduce
words to their base forms.
Feature Extraction:
Use TF-IDF Vectorizer to convert text data into numerical features, limiting the maximum
number of features to 10,000 and removing English stopwords.
Model Selection:
 Choose a suitable classification algorithm such as Linear Support Vector Classifier
(LinearSVC) for text classification.
 Train the chosen model on the training data.
Model Evaluation:
 Predict labels for the test set using the trained model.
 Evaluate the classifier's performance using accuracy and a classification report, which
includes precision, recall, and F1-score for each category.
Future Enhancements:
 Continuous monitoring and updating of the model to adapt to evolving customer
inquiries and language patterns.
 Integration of sentiment analysis to assess the sentiment of customer emails and
prioritize urgent or critical issues.
 Expansion of the model to handle multiclass classification and a wider range of
customer inquiry categories.
Program :
# Import necessary libraries
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report
# Load the 20 Newsgroups dataset as a proxy for customer support emails
newsgroups = fetch_20newsgroups(subset='all', categories=['comp.sys.ibm.pc.hardware',
'comp.sys.mac.hardware', 'rec.autos', 'rec.motorcycles', 'sci.electronics'])

# Prepare data and target labels


X = newsgroups.data
y = newsgroups.target

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create TF-IDF vectorizer


vectorizer = TfidfVectorizer(stop_words='english', max_features=10000)
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

# Train the LinearSVC classifier


classifier = LinearSVC()
classifier.fit(X_train, y_train)

# Predict labels for the test set


predictions = classifier.predict(X_test)

# Evaluate the classifier


accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)
print("\nClassification Report:")
print(classification_report(y_test, predictions, target_names=newsgroups.target_names))
Output:
Accuracy: 0.9389623601220752

Classification Report:
precision recall f1-score support

comp.sys.ibm.pc.hardware 0.92 0.91 0.91 212


comp.sys.mac.hardware 0.94 0.93 0.94 198
rec.autos 0.97 0.93 0.95 179
rec.motorcycles 0.96 0.99 0.97 205
sci.electronics 0.92 0.93 0.92 189

accuracy 0.94 983


macro avg 0.94 0.94 0.94 983
weighted avg 0.94 0.94 0.94 983

Result:
This case study outlines the problem statement, dataset, approach, expected outcome,
and future enhancements for developing a text classification system for customer support
email classification. It demonstrates the application of machine learning techniques to
automate and improve customer service processes.
SATHYABAMA INSTITUTE OF SCIENCE & TECHNOLOGY
SCHOOL OF COMPUTING
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
SCSA 2604 NATURAL LANGUAGE PROCESSING LAB

LAB 4: SEMANTIC ANALYSIS

AIM: To perform Semantic Analysis using Gensim

PROCEDURE:
Semantic analysis is a broad area in NLP. This program demonstrates semantic analysis by
leveraging pre-trained word vectors using Word2Vec from Gensim. It utilizes word
embeddings to find words similar to each word in the provided sentences.
Library Installation: Ensure the necessary libraries (Gensim and NLTK) are installed.
Library Import: Import the required libraries (gensim for word vectors and nltk for
tokenization).
Pre-trained Word Vectors: Load pre-trained word vectors (Word2Vec) using Gensim's
api.load() method.
Sample Sentences: Define sample sentences for semantic analysis.
Tokenization: Break down the sentences into individual words using NLTK's word_tokenize()
method.
Semantic Analysis: Iterate through each word in the tokenized sentences and:
Check if the word exists in the pre-trained Word2Vec model.
If the word exists, find similar words using the most_similar() method from the
word vectors model.
Display or store the similar words for each word in the sentence.
If the word doesn't exist in the pre-trained model, indicate that it's not present.
The following algorithm outlines the steps involved in performing semantic analysis using pre-
trained word vectors (Word2Vec) in Python, demonstrating how to find similar words for each
word in the provided sentences based on the loaded word vectors.
1
Page

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


ALGORITHM:
1. Install Necessary Libraries: Install Gensim and NLTK libraries (!pip install gensim,
!pip install nltk).
2. Import Libraries: Import required libraries: gensim for word vectors and nltk for
tokenization.
3. Download Pre-trained Word Vectors: Download pre-trained word vectors (Word2Vec)
using Gensim's api.load() method.
4. Define Sample Sentences: Create sample sentences for semantic analysis.
5. Tokenization: Tokenize the sentences into words using NLTK's word_tokenize()
method.
6. Semantic Analysis with Word Vectors: Iterate through each tokenized sentence.
For each word in the sentence:
Check if the word exists in the pre-trained Word2Vec model.
If the word exists:
Find words similar to the current word using word_vectors.most_similar(word).
Display or store the similar words.
If the word doesn't exist in the model:
Print a message indicating that the word is not in the pre-trained model.

PROGRAM:
# Install necessary libraries
!pip install gensim
!pip install nltk

# Import required libraries


import gensim.downloader as api
from nltk.tokenize import word_tokenize

# Download pre-trained word vectors (Word2Vec)


word_vectors = api.load("word2vec-google-news-300")

# Sample sentences
sentences = [
"Natural language processing is a challenging but fascinating field.",
"Word embeddings capture semantic meanings of words in a vector space."
2
Page

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


# Tokenize sentences
tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in sentences]

# Perform semantic analysis using pre-trained word vectors


for tokenized_sentence in tokenized_sentences:
for word in tokenized_sentence:
if word in word_vectors:
similar_words = word_vectors.most_similar(word)
print(f"Words similar to '{word}': {similar_words}")
else:
print(f"'{word}' is not in the pre-trained Word2Vec model.")

OUTPUT:
Requirement already satisfied: gensim in /usr/local/lib/python3.10/dist-packages (4.3.2)
Requirement already satisfied: numpy>=1.18.5 in /usr/local/lib/python3.10/dist-packages
(from gensim) (1.23.5)
Requirement already satisfied: scipy>=1.7.0 in /usr/local/lib/python3.10/dist-packages (from
gensim) (1.11.3)
Requirement already satisfied: smart-open>=1.8.1 in /usr/local/lib/python3.10/dist-packages
(from gensim) (6.4.0)
Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (3.8.1)
Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk)
(8.1.7)
Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk)
(1.3.2)
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.10/dist-packages
(from nltk) (2023.6.3)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from nltk)
(4.66.1)
[==================================================] 100.0%
1662.8/1662.8MB downloaded
3

Words similar to 'natural': [('Splittorff_lacked', 0.636509358882904), ('Natural',


Page

0.58078932762146), ('Mike_Taugher_covers', 0.577259361743927), ('manmade',

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


0.5276211500167847), ('shell_salted_pistachios', 0.5084421634674072), ('unnatural',
0.5030758380889893), ('naturally', 0.49992606043815613), ('Intraparty_squabbles',
0.4988228678703308), ('Burt_Bees_®', 0.49539363384246826), ('causes_Buxeda',
0.4935200810432434)]
Words similar to 'language': [('langauge', 0.7476695775985718), ('Language',
0.6695356369018555), ('languages', 0.6341332197189331), ('English',
0.6120712757110596), ('CMPB_Spanish', 0.6083104610443115), ('nonnative_speakers',
0.6063109636306763), ('idiomatic_expressions', 0.5889801979064941), ('verb_tenses',
0.58415687084198), ('Kumeyaay_Diegueno', 0.5798824429512024), ('dialect',
0.5724600553512573)]
Words similar to 'processing': [('Processing', 0.7285515666007996), ('processed',
0.6519132852554321), ('processor', 0.636760413646698), ('warden_Dominick_DeRose',
0.6166526675224304), ('processors', 0.5953895449638367),
('Discoverer_Enterprise_resumed', 0.5376213192939758), ('LSI_Tarari',
0.520267903804779), ('processer', 0.5166687369346619), ('remittance_processing',
0.5144169926643372), ('Farmland_Foods_pork', 0.5071728825569153)]
Words similar to 'is': [('was', 0.6549733281135559), ("isn'ta", 0.6439523100852966), ('seems',
0.634029746055603), ('Is', 0.6085968613624573), ('becomes', 0.5841935276985168),
('appears', 0.5822900533676147), ('remains', 0.5796942114830017), ('іѕ',
0.5695518255233765), ('makes', 0.5567088723182678), ('isn_`_t', 0.5513144135475159)]
'a' is not in the pre-trained Word2Vec model.
Words similar to 'challenging': [('difficult', 0.6388775110244751), ('challenge',
0.5953003764152527), ('daunting', 0.569800615310669), ('tough', 0.5689979791641235),
('challenges', 0.5471934676170349), ('challenged', 0.5449535846710205), ('Challenging',
0.5242965817451477), ('tricky', 0.5236554741859436), ('toughest', 0.5169045329093933),
('diffi_cult', 0.5010539889335632)]
Words similar to 'but': [('although', 0.8104525804519653), ('though', 0.7285684943199158),
('because', 0.7225914597511292), ('so', 0.6865807771682739), ('But', 0.6826984882354736),
('Although', 0.6188263297080994), ('Though', 0.6153667569160461), ('Unfortunately',
0.6031029224395752), ('Of_course', 0.593142032623291), ('anyway', 0.5869061350822449)]
Words similar to 'fascinating': [('interesting', 0.7623067498207092), ('intriguing',
0.7245113253593445), ('enlightening', 0.6644250154495239), ('captivating',
0.6459898352622986), ('facinating', 0.6416683793067932), ('riveting',
0.6324825286865234), ('instructive', 0.6210989356040955), ('endlessly_fascinating',
0.6188612580299377), ('revelatory', 0.6170244216918945), ('engrossing',
0.6126049160957336)]
Words similar to 'field': [('fields', 0.5582526326179504), ('fi_eld', 0.5188260078430176),
('Keith_Toogood', 0.49749255180358887), ('Mackenzie_Hoambrecker',
0.49514278769493103), ('Josh_Arauco_kicked', 0.48817265033721924), ('Nick_Cattoi',
0.4863145053386688), ('Armando_Cuko', 0.4853871166706085), ('Jon_Striefsky',
0.48322004079818726), ('kicker_Nico_Grasu', 0.47572532296180725),
4
Page

('Chris_Manfredini_kicked', 0.47327715158462524)]

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


'.' is not in the pre-trained Word2Vec model.
Words similar to 'word': [('phrase', 0.6777030825614929), ('words', 0.5864380598068237),
('verb', 0.5517287254333496), ('Word', 0.54575115442276), ('adjective',
0.5290762186050415), ('cuss_word', 0.5272089242935181), ('colloquialism',
0.5160348415374756), ('noun', 0.5129537582397461), ('astrology_#/##/##',
0.5039082765579224), ('synonym', 0.49379870295524597)]
'embeddings' is not in the pre-trained Word2Vec model.
Words similar to 'capture': [('capturing', 0.7563897371292114), ('captured',
0.7155306935310364), ('captures', 0.6099075078964233), ('Capturing',
0.6023245453834534), ('recapture', 0.5498639941215515), ('Capture', 0.5493018627166748),
('nab', 0.4941576421260834), ('Captured', 0.45745959877967834), ('apprehend',
0.4357919692993164), ('seize', 0.4338296055793762)]
Words similar to 'semantic': [('semantics', 0.6644964814186096), ('Semantic',
0.6464474201202393), ('contextual', 0.5909127593040466), ('meta', 0.5905876755714417),
('ontology', 0.5880525708198547), ('Semantic_Web', 0.5612248778343201), ('semantically',
0.5600483417510986), ('microformat', 0.5582399368286133), ('inferencing',
0.5541478991508484), ('terminological', 0.5533202290534973)]
Words similar to 'meanings': [('grammatical_constructions', 0.594986081123352), ('idioms',
0.5938195586204529), ('connotations', 0.5836683511734009), ('symbolic_meanings',
0.5806494951248169), ('meaning', 0.5785343647003174), ('literal_meanings',
0.5743482112884521), ('denotative', 0.5730364918708801), ('phrasal_verbs',
0.5697917342185974), ('contexts', 0.5609514713287354), ('adjectives_adverbs',
0.5569407343864441)]
'of' is not in the pre-trained Word2Vec model.
Words similar to 'words': [('phrases', 0.7100036144256592), ('phrase', 0.6408688426017761),
('Words', 0.6160537600517273), ('word', 0.5864380598068237), ('adjectives',
0.5812757015228271), ('uttered', 0.5724518299102783), ('plate_umpire_Tony_Randozzo',
0.5642045140266418), ('expletives', 0.5539036989212036), ('Mayor_Cirilo_Pena',
0.553884744644165), ('Tele_prompter', 0.5441114902496338)]
Words similar to 'in': [('inthe', 0.5891957879066467), ('where', 0.5662435293197632), ('the',
0.5429296493530273), ('In', 0.5415117144584656), ('during', 0.5188906192779541), ('iin',
0.48737412691116333), ('at', 0.484235554933548), ('from', 0.48268404603004456),
('outside', 0.47092658281326294), ('for', 0.4566476047039032)]
'a' is not in the pre-trained Word2Vec model.
Words similar to 'vector': [('vectors', 0.750322163105011), ('adeno_associated_viral_AAV',
0.5999537110328674), ('bitmap_graphics', 0.5428463220596313), ('Sindbis',
0.5353653430938721), ('bitmap_images', 0.5318013429641724), ('signal_analyzer_VSA',
0.5276671051979065), ('analyzer_VNA', 0.5184376239776611), ('vectorial',
0.5084835886955261), ('nonviral_gene_therapy', 0.5036363005638123), ('shellcode',
5

0.5015827417373657)]
Page

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


Words similar to 'space': [('spaces', 0.6570690870285034), ('music_concept_ShockHound',
0.5850345492362976), ('Shuttle_docks', 0.5566749572753906), ('Space',
0.5478203296661377), ('Soviet_Union_Yuri_Gagarin', 0.5417766571044922),
('Shuttle_Discovery_blasts', 0.5352603197097778), ('Shuttle_Discovery_docks',
0.534925103187561), ('Shuttle_Endeavour_undocks', 0.532420814037323),
('Shuttle_Discovery_arrives', 0.5323426723480225), ('Shuttle_undocks',
0.523307740688324)]
'.' is not in the pre-trained Word2Vec model.

6
Page

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


SATHYABAMA INSTITUTE OF SCIENCE & TECHNOLOGY
SCHOOL OF COMPUTING
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
SCSA 2604 NATURAL LANGUAGE PROCESSING LAB

LAB 4: CASE STUDY

AIM: Enhancing Customer Service with Semantic Analysis


Problem Statement:
A multinational e-commerce company, "E-Shop Inc.," is looking to improve its customer
service operations by leveraging advanced natural language processing techniques. They have
a vast repository of customer interactions, including emails, chat transcripts, and social media
messages. E-Shop Inc. aims to implement semantic analysis to better understand customer
queries and sentiment, ultimately enhancing the overall customer experience..
Objectives :
 Semantic Analysis: The primary objective is to perform semantic analysis on customer
queries to understand the underlying meaning and extract relevant information. By
identifying synonyms and related terms, the program aims to capture the semantic
nuances of the input text.
 Improving Customer Service: The program aims to enhance customer service
operations by providing insights into customer queries. By analyzing the semantics of
the queries, the program can help identify common issues, extract key information, and
facilitate more effective responses.
Dataset:
In the provided program, there isn't a specific dataset used for semantic analysis. Instead, the
program demonstrates a basic approach to perform semantic analysis on a set of example
customer queries. However, in a real-world scenario, the dataset used for semantic analysis
could consist of a collection of text data relevant to the domain of interest, such as customer
support tickets, product reviews, social media interactions, or any other type of textual data
where semantic analysis is applicable.
Approach:
 Tokenization: The program starts by tokenizing the input text into individual words or
tokens. Tokenization is a fundamental step in natural language processing (NLP) for
breaking down text into its constituent parts.
 Stopword Removal: Stopwords, such as "is", "the", "and", etc., are removed from the
tokens to filter out irrelevant words that do not carry much semantic meaning.
 Lemmatization: The program lemmatizes the remaining tokens to reduce them to their
base or dictionary form. Lemmatization helps in normalizing words and reducing
inflectional forms to a common base, improving the accuracy of semantic analysis.
 Synonym Generation: Using the WordNet database, the program retrieves synonyms
for each lemmatized token. WordNet is a lexical database of the English language that
provides semantic relationships between words, including synonyms, hypernyms,
hyponyms, etc.
 Output Generation: Finally, the program outputs the synonyms generated for each
customer query, providing insights into the semantic content of the queries.
Program :
import nltk
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Initialize NLTK resources


nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Function to perform semantic analysis


def semantic_analysis(text):
# Tokenize text
tokens = word_tokenize(text)

# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in
stop_words]

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in
filtered_tokens]

# Synonyms generation
synonyms = set()
for token in lemmatized_tokens:
for syn in wordnet.synsets(token):
for lemma in syn.lemmas():
synonyms.add(lemma.name())

return list(synonyms)

# Example customer queries


customer_queries = [
"I received a damaged product. Can I get a refund?",
"I'm having trouble accessing my account.",
"How can I track my order status?",
"The item I received doesn't match the description.",
"Is there a discount available for bulk orders?"
]

# Semantic analysis for each query


for query in customer_queries:
print("Customer Query:", query)
synonyms = semantic_analysis(query)
print("Semantic Analysis (Synonyms):", synonyms)
print("\n")

Output:
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
Customer Query: I received a damaged product. Can I get a refund?
Semantic Analysis (Synonyms): ['refund', 'grow', 'baffle', 'pay_off', 'Cartesian_product',
'arrive', 'engender', 'standard', 'have', 'damaged', 'experience', 'develop', 'sustain', 'product',
'acquire', 'encounter', 'take_in', 'find', 'stupefy', 'bugger_off', 'draw', 'pose', 'aim', 'nonplus',
'induce', 'mother', 'stimulate', 'make', 'repayment', 'convey', 'cause', 'mathematical_product',
'get', 'damage', 'produce', 'set_out', 'merchandise', 'buzz_off', 'beat', 'meet', 'start', 'commence',
'return', 'pick_up', 'production', 'fix', 'stick', "get_under_one's_skin", 'go', 'mystify', 'take',
'perplex', 'welcome', 'vex', 'begin', 'come', 'fuck_off', 'bring', 'contract', 'capture', 'generate',
'give_back', 'incur', 'repay', 'let', 'become', 'start_out', 'gravel', 'scram', 'obtain', 'pay_back',
'amaze', 'catch', 'beget', 'get_down', 'set_about', 'invite', 'bring_forth', 'drive', 'sire', 'intersection',
'discredited', 'suffer', 'received', 'ware', 'dumbfound', 'fetch', 'father', 'arrest', 'flummox', 'puzzle',
'bewilder', 'receive']

Customer Query: I'm having trouble accessing my account.


Semantic Analysis (Synonyms): ['write_up', 'report', 'invoice', 'trouble', 'describe', 'answer_for',
'access', 'pain', 'get_at', 'distract', 'disorder', 'unhinge', 'account_statement', 'calculate',
'disoblige', 'fuss', 'bill', 'disquiet', 'inconvenience', 'incommode', 'news_report', 'disturb',
'explanation', 'problem', 'perturb', 'cark', 'account', 'accounting', 'business_relationship', 'score',
'bother', 'history', 'story', 'difficulty', 'worry', 'inconvenience_oneself', 'hassle', 'chronicle',
'discommode', 'ail', 'put_out', 'upset', 'trouble_oneself']
Customer Query: How can I track my order status?
Semantic Analysis (Synonyms): ['cover', 'rank', 'cart_track', 'purchase_order', 'pass_over',
'parliamentary_law', 'cartroad', 'status', 'get_across', 'traverse', 'prescribe', 'order', 'orderliness',
'rails', 'fiat', 'gild', 'regularise', 'runway', 'govern', 'ordination', 'give_chase', 'put', 'cut_across',
'chase', 'consecrate', 'cut', 'grade', 'social_club', 'order_of_magnitude', 'path', 'rules_of_order',
'caterpillar_tread', 'tail', 'Holy_Order', 'cross', 'data_track', 'monastic_order', 'rate', 'go_after',
'say', 'edict', 'regularize', 'Order', 'caterpillar_track', 'parliamentary_procedure', 'cut_through',
'rail', 'enjoin', 'course', 'racecourse', 'arrange', 'club', 'society', 'ordinate', 'set_up', 'rescript',
'chase_after', 'place', 'dictate', 'tell', 'range', 'decree', 'regulate', 'lodge', 'condition', 'track',
'raceway', 'ordain', 'racetrack', 'get_over', 'lead', 'guild', 'running', 'tag', 'ordering', 'position',
'trail', 'dog']

Customer Query: The item I received doesn't match the description.


Semantic Analysis (Synonyms): ['point', 'mate', 'catch', 'equalize', 'detail', 'touch', 'invite',
'welcome', 'get', 'standard', 'jibe', 'description', 'verbal_description', 'gibe', 'have', 'rival', 'equal',
'pit', 'experience', 'fit', 'check', 'item', 'mates', 'correspond', 'oppose', 'pair', 'lucifer', 'received',
'peer', 'cope_with', 'encounter', 'play_off', 'meet', 'take_in', 'friction_match', 'find', 'particular',
'equate', 'match', 'couple', 'equalise', 'pick_up', 'agree', 'incur', 'compeer', 'twin', 'tally', 'obtain',
'token', 'receive']

Customer Query: Is there a discount available for bulk orders?


Semantic Analysis (Synonyms): ['tell', 'put', 'rank', 'parliamentary_procedure',
'purchase_order', 'consecrate', 'majority', 'range', 'useable', 'decree', 'parliamentary_law',
'regulate', 'price_reduction', 'mass', 'dictate', 'social_club', 'lodge', 'usable', 'enjoin', 'discount',
'grade', 'order_of_magnitude', 'brush_off', 'rules_of_order', 'rebate', 'push_aside', 'prescribe',
'arrange', 'bulge', 'bank_discount', 'dismiss', 'club', 'ordain', 'order', 'society', 'bulk', 'ordinate',
'Holy_Order', 'brush_aside', 'orderliness', 'guild', 'set_up', 'fiat', 'gild', 'regularise',
'uncommitted', 'ordering', 'available', 'monastic_order', 'deduction', 'ordination', 'govern', 'rate',
'discount_rate', 'say', 'edict', 'regularize', 'rescript', 'disregard', 'ignore', 'volume', 'Order', 'place']

Result:
By following this approach, the program aims to achieve the objectives of performing
semantic analysis on customer queries and improving customer service operations by providing
valuable insights into the semantics of the queries.
Lab 2 – Case Study

Aim: Autocomplete System for Email Composition


Problem Statement:
A software company wants to develop an intelligent autocomplete system for email composition.
The goal is to assist users in generating coherent and contextually appropriate sentences while
composing emails. The system should predict the next word or phrase based on the user's input
and the context of the email.

Objectives:

The objectives of the provided program are to implement a simple email autocomplete system using the
GPT-2 language model. The program aims to facilitate user interaction by suggesting autocompletions
based on the context provided and the user's input. Key objectives include initializing and integrating
the GPT-2 model and tokenizer from the Hugging Face Transformers library, defining a class structure
(EmailAutocompleteSystem) to encapsulate the autocomplete system, and creating a method
(generate_suggestions) to generate context-aware suggestions. The program encourages user
engagement by incorporating a user input loop, allowing continuous interaction until the user chooses
to exit. The ultimate goal is to demonstrate the practical use of a pre-trained language model for
generating relevant suggestions in the context of email composition, showcasing the capabilities of the
GPT-2 model for natural language processing tasks.

Approach:
1. Data Collection:
 Collect a diverse dataset of emails, including different writing styles, topics, and
formality levels.
 Annotate the dataset with proper context information, such as sender, recipient,
subject, and the body of the email.
2. Data Preprocessing:
 Clean and tokenize the text data.
 Handle issues like punctuation, capitalization, and special characters.
 Split the dataset into training and testing sets.
3. Model Selection:
 Choose a suitable NLP model for word generation. Options may include recurrent
neural networks (RNNs), long short-term memory networks (LSTMs), or
transformer models like GPT-3.
 Fine-tune or train the model on the email dataset to understand the specific
language patterns used in emails.
4. Context Integration:
 Design a mechanism to incorporate contextual information from the email, such
as the subject, previous sentences, and the relationship between the sender and
recipient.
 Implement a way for the model to understand the context shift within the email
body.
5. User Interface:
 Develop a user-friendly interface that integrates with popular email clients or
standalone applications.
 Allow users to enable or disable the autocomplete feature as needed.
 Provide visual cues to indicate suggested words or phrases.
6. Model Evaluation:
 Evaluate the model's performance on the test dataset using metrics like perplexity,
accuracy, and precision.
 Gather user feedback on the effectiveness and usability of the autocomplete
system.
7. Fine-Tuning and Iteration:
 Analyze user feedback and performance metrics to identify areas for
improvement.
 Consider refining the model based on user suggestions and addressing any
limitations.
8. Deployment:
 Deploy the trained model as a service that can be accessed by the email
application.
 Ensure scalability and reliability of the autocomplete system.
Potential Challenges:
 Context Understanding: Ensuring the model effectively understands and incorporates
the context of the email.
 Ambiguity Handling: Dealing with ambiguous phrases and understanding the user's
intended meaning.
 Personalization: Tailoring the system to individual writing styles and preferences.
Success Criteria:
 Improved email composition efficiency and speed.
 Positive user feedback on the accuracy and relevance of autocomplete suggestions.
 Reduction in typing errors and improved overall user experience.
By successfully developing and implementing this word generation program, the company aims
to enhance the productivity and user experience of individuals engaged in email communication.

Program :
!pip install transformers
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

class EmailAutocompleteSystem:
def __init__(self):
self.model_name = "gpt2"
self.tokenizer = GPT2Tokenizer.from_pretrained(self.model_name)
self.model = GPT2LMHeadModel.from_pretrained(self.model_name)

def generate_suggestions(self, user_input, context):


input_text = f"{context} {user_input}"
input_ids = self.tokenizer.encode(input_text, return_tensors="pt")

with torch.no_grad():
output = self.model.generate(input_ids, max_length=50, num_return_sequences=1,
no_repeat_ngram_size=2)

generated_text = self.tokenizer.decode(output[0], skip_special_tokens=True)


suggestions = generated_text.split()[len(user_input.split()):]
return suggestions

# Example usage
if __name__ == "__main__":
autocomplete_system = EmailAutocompleteSystem()

# Assume user is composing an email with some context


email_context = "Subject: Discussing Project Proposal\nHi [Recipient],"

while True:
user_input = input("Enter your sentence (type 'exit' to end): ")

if user_input.lower() == 'exit':
break

suggestions = autocomplete_system.generate_suggestions(user_input, email_context)

if suggestions:
print("Autocomplete Suggestions:", suggestions)
else:
print("No suggestions available.")

Output:
Enter your sentence (type 'exit' to end): hello, how are you ? How's
everything going on !
The attention mask and the pad token id were not set. As a consequence, you
may observe unexpected behavior. Please pass your input's `attention_mask` to
obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Autocomplete Suggestions: ["How's", 'everything', 'going', 'on!', "I'm", 'a',
'programmer', 'and', "I've", 'been', 'working', 'on', 'a', 'project', 'for',
'a', 'while', 'now.', 'I', 'have', 'a', 'lot', 'of', 'ideas', 'for', 'the']
Enter your sentence (type 'exit' to end): exit

Result:
The result demonstrates the integration of a powerful language model for enhancing user experience in
composing emails through intelligent autocomplete suggestions.
4/23/24, 2:20 PM nlp.ipynb - Colab

#1
import spacy
# Load English tokenizer, tagger, parser, NER, and word vectors
nlp = spacy.load("en_core_web_sm")
# Sample text for analysis
text = "Natural Language Processing is a fascinating field of study."
# Process the text with spaCy
doc = nlp(text)
# Extracting tokens and lemmatization
tokens = [token.text for token in doc]
lemmas = [token.lemma_ for token in doc]
print("Tokens:", tokens)
print("Lemmas:", lemmas)
# Dependency parsing
print("\nDependency Parsing:")
for token in doc:
print(token.text, token.dep_, token.head.text, token.head.pos_,
[child for child in token.children])

Tokens: ['Natural', 'Language', 'Processing', 'is', 'a', 'fascinating', 'field', 'of', 'study', '.']
Lemmas: ['Natural', 'Language', 'Processing', 'be', 'a', 'fascinating', 'field', 'of', 'study', '.']

Dependency Parsing:
Natural compound Language PROPN []
Language compound Processing PROPN [Natural]
Processing nsubj is AUX [Language]
is ROOT is AUX [Processing, field, .]
a det field NOUN []
fascinating amod field NOUN []
field attr is AUX [a, fascinating, of]
of prep field NOUN [study]
study pobj of ADP []
. punct is AUX []

#1 case study
import spacy
# Load English tokenizer, tagger, parser, NER, and word vectors
nlp = spacy.load("en_core_web_sm")
# Sample customer feedback data
customer_feedback = [
"The product is amazing! I love the quality.",
"The customer service was terrible, very disappointed.",
"Great experience overall, highly recommended.",
"The delivery was late, very frustrating."
]
def analyze_feedback(feedback):
for idx, text in enumerate(feedback, start=1):
print(f"\nAnalyzing Feedback {idx}: '{text}'")
doc = nlp(text)
tokens = [token.text for token in doc]
lemmas = [token.lemma_ for token in doc]
print("Tokens:", tokens)
print("Lemmas:", lemmas)
print("\nDependency Parsing:")
for token in doc:
print(token.text, token.dep_, token.head.text, token.head.pos_,
[child for child in token.children])
if __name__ == "__main__":
analyze_feedback(customer_feedback)

Analyzing Feedback 1: 'The product is amazing! I love the quality.'

Analyzing Feedback 2: 'The customer service was terrible, very disappointed.'

Analyzing Feedback 3: 'Great experience overall, highly recommended.'

Analyzing Feedback 4: 'The delivery was late, very frustrating.'


Tokens: ['The', 'delivery', 'was', 'late', ',', 'very', 'frustrating', '.']
Lemmas: ['the', 'delivery', 'be', 'late', ',', 'very', 'frustrating', '.']

Dependency Parsing:
The det delivery NOUN []
delivery nsubj was AUX [The]
was ROOT was AUX [delivery, frustrating, .]
late advmod frustrating ADJ []
, punct frustrating ADJ []
very advmod frustrating ADJ []
frustrating acomp was AUX [late, ,, very]
. punct was AUX []

https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 1/9
4/23/24, 2:20 PM nlp.ipynb - Colab
#2
import nltk
import random

nltk.download('punkt')
nltk.download('gutenberg')

words = nltk.corpus.gutenberg.words()

bigrams = list(nltk.bigrams(words))

starting_word = "the"
generated_text = [starting_word]

for _ in range(20):

possible_words = [word2 for (word1, word2) in bigrams if word1.lower() == generated_text[-1].lower()]

next_word = random.choice(possible_words)
generated_text.append(next_word)

print(' '.join(generated_text))

[nltk_data] Downloading package punkt to /root/nltk_data...


[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data] Unzipping corpora/gutenberg.zip.
the mast head and the son , " If you can afford it doesn ' s eye spare them more step

#2 Case study
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer
class EmailAutocompleteSystem:
def __init__(self):
self.model_name = "gpt2"
self.tokenizer = GPT2Tokenizer.from_pretrained(self.model_name)
self.model = GPT2LMHeadModel.from_pretrained(self.model_name)
def generate_suggestions(self, user_input, context):
input_text = f"{context} {user_input}"
input_ids = self.tokenizer.encode(input_text, return_tensors="pt")
with torch.no_grad():
output = self.model.generate(input_ids, max_length=50, num_return_sequences=1,no_repeat_ngram_size=2)
generated_text = self.tokenizer.decode(output[0], skip_special_tokens=True)
suggestions = generated_text.split()[len(user_input.split()):]
return suggestions

if __name__ == "__main__":
autocomplete_system = EmailAutocompleteSystem()
email_context = "Subject: Discussing Project Proposal\nHi [Recipient],"
while True:
user_input = input("Enter your sentence (type 'exit' to end): ")
if user_input.lower() == 'exit':
break
suggestions = autocomplete_system.generate_suggestions(user_input, email_context)
if suggestions:
print("Autocomplete Suggestions:", suggestions)
else:
print("No suggestions available.")

/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:88: UserWarni
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public mo
warnings.warn(
tokenizer_config.json: 100% 26.0/26.0 [00:00<00:00, 636B/s]

vocab.json: 100% 1.04M/1.04M [00:00<00:00, 6.15MB/s]

merges.txt: 100% 456k/456k [00:00<00:00, 2.20MB/s]

tokenizer.json: 100% 1.36M/1.36M [00:00<00:00, 8.90MB/s]

config.json: 100% 665/665 [00:00<00:00, 12.9kB/s]

model.safetensors: 100% 548M/548M [00:09<00:00, 41.4MB/s]

generation_config.json: 100% 124/124 [00:00<00:00, 588B/s]

https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 2/9
4/23/24, 2:20 PM nlp.ipynb - Colab
#3
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
# Load the 20 Newsgroups dataset
categories = ['sci.med', 'sci.space', 'comp.graphics', 'talk.politics.mideast']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories)
# Split the data into training and testing sets
X_train = newsgroups_train.data
X_test = newsgroups_test.data
y_train = newsgroups_train.target
y_test = newsgroups_test.target
# Create a pipeline with TF-IDF vectorizer and LinearSVC classifier
model = make_pipeline(
TfidfVectorizer(),
LinearSVC()
)
# Train the model
model.fit(X_train, y_train)
# Predict labels for the test set
predictions = model.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)
print("\nClassification Report:")
print(classification_report(y_test, predictions))

Accuracy: 0.9504823151125402

Classification Report:
precision recall f1-score support

0 0.89 0.97 0.93 389


1 0.96 0.91 0.94 396
2 0.98 0.94 0.96 394
3 0.98 0.98 0.98 376

accuracy 0.95 1555


macro avg 0.95 0.95 0.95 1555
weighted avg 0.95 0.95 0.95 1555

#3 Case study
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report
# Load the 20 Newsgroups dataset as a proxy for customer support emails
newsgroups = fetch_20newsgroups(subset='all', categories=['comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'rec.autos', 'rec.motorcy
# Prepare data and target labels
X = newsgroups.data
y = newsgroups.target
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create TF-IDF vectorizer
vectorizer = TfidfVectorizer(stop_words='english', max_features=10000)
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)
# Train the LinearSVC classifier
classifier = LinearSVC()
classifier.fit(X_train, y_train)
# Predict labels for the test set
predictions = classifier.predict(X_test)
# Evaluate the classifier
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)
print("\nClassification Report:")
print(classification_report(y_test, predictions, target_names=newsgroups.target_names))

Accuracy: 0.9389623601220752

Classification Report:
precision recall f1-score support

comp.sys.ibm.pc.hardware 0.92 0.91 0.91 212


comp.sys.mac.hardware 0.94 0.93 0.94 198
rec.autos 0.97 0.93 0.95 179

https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 3/9
4/23/24, 2:20 PM nlp.ipynb - Colab
rec.motorcycles 0.96 0.99 0.97 205
sci.electronics 0.92 0.93 0.92 189

accuracy 0.94 983


macro avg 0.94 0.94 0.94 983
weighted avg 0.94 0.94 0.94 983

#4
# Install necessary libraries
!pip install gensim
!pip install nltk
# Import required libraries
import gensim.downloader as api
from nltk.tokenize import word_tokenize
# Download pre-trained word vectors (Word2Vec)
word_vectors = api.load("word2vec-google-news-300")
# Sample sentences
sentences = [
"Natural language processing is a challenging but fascinating field.",
"Word embeddings capture semantic meanings of words in a vector space."
]
# Tokenize sentences
tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in sentences]
# Perform semantic analysis using pre-trained word vectors
for tokenized_sentence in tokenized_sentences:
for word in tokenized_sentence:
if word in word_vectors:
similar_words = word_vectors.most_similar(word)
print(f"Words similar to '{word}': {similar_words}")
else:
print(f"'{word}' is not in the pre-trained Word2Vec model.")

Requirement already satisfied: gensim in /usr/local/lib/python3.10/dist-packages (4.3.2)


Requirement already satisfied: numpy>=1.18.5 in /usr/local/lib/python3.10/dist-packages (from gensim) (1.25.2)
Requirement already satisfied: scipy>=1.7.0 in /usr/local/lib/python3.10/dist-packages (from gensim) (1.11.4)
Requirement already satisfied: smart-open>=1.8.1 in /usr/local/lib/python3.10/dist-packages (from gensim) (6.4.0)
Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (3.8.1)
Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk) (8.1.7)
Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk) (1.4.0)
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.10/dist-packages (from nltk) (2023.12.25)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from nltk) (4.66.2)
Words similar to 'natural': [('Splittorff_lacked', 0.636509358882904), ('Natural', 0.58078932762146), ('Mike_Taugher_covers', 0.5772
Words similar to 'language': [('langauge', 0.7476695775985718), ('Language', 0.6695356369018555), ('languages', 0.6341332197189331),
Words similar to 'processing': [('Processing', 0.7285515666007996), ('processed', 0.6519132852554321), ('processor', 0.6367604136466
Words similar to 'is': [('was', 0.6549733281135559), ("isn'ta", 0.6439523100852966), ('seems', 0.634029746055603), ('Is', 0.60859686
'a' is not in the pre-trained Word2Vec model.
Words similar to 'challenging': [('difficult', 0.6388775110244751), ('challenge', 0.5953003764152527), ('daunting', 0.56980061531066
Words similar to 'but': [('although', 0.8104525804519653), ('though', 0.7285684943199158), ('because', 0.7225914597511292), ('so', 0
Words similar to 'fascinating': [('interesting', 0.7623067498207092), ('intriguing', 0.7245113253593445), ('enlightening', 0.6644250
Words similar to 'field': [('fields', 0.5582526326179504), ('fi_eld', 0.5188260078430176), ('Keith_Toogood', 0.49749255180358887),
'.' is not in the pre-trained Word2Vec model.
Words similar to 'word': [('phrase', 0.6777030825614929), ('words', 0.5864380598068237), ('verb', 0.5517287254333496), ('Word', 0.54
'embeddings' is not in the pre-trained Word2Vec model.
Words similar to 'capture': [('capturing', 0.7563897371292114), ('captured', 0.7155306935310364), ('captures', 0.6099075078964233),
Words similar to 'semantic': [('semantics', 0.6644964814186096), ('Semantic', 0.6464474201202393), ('contextual', 0.5909127593040466
Words similar to 'meanings': [('grammatical_constructions', 0.594986081123352), ('idioms', 0.5938195586204529), ('connotations', 0.5
'of' is not in the pre-trained Word2Vec model.
Words similar to 'words': [('phrases', 0.7100036144256592), ('phrase', 0.6408688426017761), ('Words', 0.6160537600517273), ('word',
Words similar to 'in': [('inthe', 0.5891957879066467), ('where', 0.5662435293197632), ('the', 0.5429296493530273), ('In', 0.54151171
'a' is not in the pre-trained Word2Vec model.
Words similar to 'vector': [('vectors', 0.750322163105011), ('adeno_associated_viral_AAV', 0.5999537110328674), ('bitmap_graphics',
Words similar to 'space': [('spaces', 0.6570690870285034), ('music_concept_ShockHound', 0.5850345492362976), ('Shuttle_docks', 0.556
'.' is not in the pre-trained Word2Vec model.

https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 4/9
4/23/24, 2:20 PM nlp.ipynb - Colab
#4 case study
import nltk
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
# Initialize NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
# Function to perform semantic analysis
def semantic_analysis(text):
tokens = word_tokenize(text)
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]
synonyms = set()
for token in lemmatized_tokens:
for syn in wordnet.synsets(token):
for lemma in syn.lemmas():
synonyms.add(lemma.name())
return list(synonyms)
# Example customer queries
customer_queries = [
"I received a damaged product. Can I get a refund?",
"I'm having trouble accessing my account.",
"How can I track my order status?",
"The item I received doesn't match the description.",
"Is there a discount available for bulk orders?"
]
# Semantic analysis for each query
for query in customer_queries:
print("Customer Query:", query)
synonyms = semantic_analysis(query)
print("Semantic Analysis (Synonyms):", synonyms)
print("\n")

[nltk_data] Downloading package punkt to /root/nltk_data...


[nltk_data] Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
Customer Query: I received a damaged product. Can I get a refund?
Semantic Analysis (Synonyms): ['generate', 'fuck_off', 'damaged', 'bewilder', 'get', 'stick', 'draw', 'pick_up', 'take_in', 'develop

Customer Query: I'm having trouble accessing my account.


Semantic Analysis (Synonyms): ['disturb', 'news_report', 'calculate', 'answer_for', 'accounting', 'invoice', 'account_statement', 'b

Customer Query: How can I track my order status?


Semantic Analysis (Synonyms): ['racetrack', 'position', 'parliamentary_procedure', 'society', 'gild', 'tell', 'order_of_magnitude',

Customer Query: The item I received doesn't match the description.


Semantic Analysis (Synonyms): ['oppose', 'invite', 'received', 'couple', 'agree', 'check', 'receive', 'get', 'friction_match', 'meet

Customer Query: Is there a discount available for bulk orders?


Semantic Analysis (Synonyms): ['orderliness', 'dictate', 'ordination', 'edict', 'bank_discount', 'parliamentary_procedure', 'set_up

https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 5/9
4/23/24, 2:20 PM nlp.ipynb - Colab
#5
# Install necessary libraries
!pip install scikit-learn
!pip install nltk
# Import required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
from nltk.corpus import movie_reviews # Sample dataset from NLTK
# Download NLTK resources (run only once if not downloaded)
import nltk
nltk.download('movie_reviews')
# Load the movie_reviews dataset
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]

# Convert data to DataFrame


df = pd.DataFrame(documents, columns=['text', 'sentiment'])
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['sentiment'], test_size=0.2,
random_state=42)
# Initialize TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer()
# Fit and transform the training data
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train.apply(' '.join))
# Initialize SVM classifier
svm_classifier = SVC(kernel='linear')
# Train the classifier
svm_classifier.fit(X_train_tfidf, y_train)
# Transform the test data
X_test_tfidf = tfidf_vectorizer.transform(X_test.apply(' '.join))
# Predict on the test data
y_pred = svm_classifier.predict(X_test_tfidf)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')
# Display classification report
print(classification_report(y_test, y_pred))

Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (1.2.2)


Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.25.2)
Requirement already satisfied: scipy>=1.3.2 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.11.4)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.4.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (3.4.0)
Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (3.8.1)
Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk) (8.1.7)
Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk) (1.4.0)
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.10/dist-packages (from nltk) (2023.12.25)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from nltk) (4.66.2)
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data] Unzipping corpora/movie_reviews.zip.
Accuracy: 0.84
precision recall f1-score support

neg 0.83 0.85 0.84 199


pos 0.85 0.82 0.84 201

accuracy 0.84 400


macro avg 0.84 0.84 0.84 400
weighted avg 0.84 0.84 0.84 400

https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 6/9
4/23/24, 2:20 PM nlp.ipynb - Colab
#5 case study
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
# Download NLTK resources (only required once)
nltk.download('vader_lexicon')
# Sample reviews
reviews = [
"This product is amazing! I love it.",
"The product was good, but the packaging was damaged.",
"Very disappointing experience. Would not recommend.",
"Neutral feedback on the product.",
]
# Initialize Sentiment Intensity Analyzer
sid = SentimentIntensityAnalyzer()
# Analyze sentiment for each review
for review in reviews:
print("Review:", review)
scores = sid.polarity_scores(review)
print("Sentiment:", end=' ')
if scores['compound'] > 0.05:
print("Positive")
elif scores['compound'] < -0.05:
print("Negative")
else:
print("Neutral")
print()

Review: This product is amazing! I love it.


Sentiment: Positive
Review: The product was good, but the packaging was damaged.
Sentiment: Negative
Review: Very disappointing experience. Would not recommend.
Sentiment: Negative
Review: Neutral feedback on the product.
Sentiment: Neutral

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...

#6
# Install NLTK (if not already installed)
!pip install nltk
# Import necessary libraries
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
# Sample text for POS tagging
text = "Parts of speech tagging helps to understand the function of each word in a sentence."
# Tokenize the text into words
tokens = nltk.word_tokenize(text)
# Perform POS tagging
pos_tags = nltk.pos_tag(tokens)
# Display the POS tags
print("POS tags:", pos_tags)

Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (3.8.1)


Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk) (8.1.7)
Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk) (1.4.0)
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.10/dist-packages (from nltk) (2023.12.25)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from nltk) (4.66.2)
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /root/nltk_data...
[nltk_data] Unzipping taggers/averaged_perceptron_tagger.zip.
POS tags: [('Parts', 'NNS'), ('of', 'IN'), ('speech', 'NN'), ('tagging', 'VBG'), ('helps', 'NNS'), ('to', 'TO'), ('understand', 'VB

https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 7/9
4/23/24, 2:20 PM nlp.ipynb - Colab
#6 Case study
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
# Download NLTK resources (if not already downloaded)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
def pos_tagging(text):
sentences = sent_tokenize(text)
tagged_tokens = []
for sentence in sentences:
tokens = word_tokenize(sentence)
tagged_tokens.extend(nltk.pos_tag(tokens))
return tagged_tokens
def main():
article_text = """Manchester United secured a 3-1 victory over Chelsea in yesterday's match.
Goals from Rashford, Greenwood, and Fernandes sealed the win for United.
Chelsea's only goal came from Pulisic in the first half.
The victory boosts United's chances in the Premier League title race.
"""
tagged_tokens = pos_tagging(article_text)
print("Original Article Text:\n", article_text)
print("\nParts of Speech Tagging:")
for token, pos_tag in tagged_tokens:
print(f"{token}: {pos_tag}")
if __name__ == "__main__":
main()

Original Article Text:


Manchester United secured a 3-1 victory over Chelsea in yesterday's match.
Goals from Rashford, Greenwood, and Fernandes sealed the win for United.
Chelsea's only goal came from Pulisic in the first half.
The victory boosts United's chances in the Premier League title race.

Parts of Speech Tagging:


Manchester: NNP
United: NNP
secured: VBD
a: DT
3-1: JJ
victory: NN
over: IN
Chelsea: NNP
in: IN
yesterday: NN
's: POS
match: NN
.: .
Goals: NNS
from: IN
Rashford: NNP
,: ,
Greenwood: NNP
,: ,
and: CC
Fernandes: NNP
sealed: VBD
the: DT
win: NN
for: IN
United: NNP
.: .
Chelsea: NN
's: POS
only: JJ
goal: NN
came: VBD
from: IN
Pulisic: NNP
in: IN
the: DT
first: JJ
half: NN
.: .
The: DT
victory: NN
boosts: VBZ
United: NNP
's: POS
chances: NNS
in: IN
the: DT
Premier: NNP
League: NNP
title: NN

https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 8/9
4/23/24, 2:20 PM nlp.ipynb - Colab
#7
!pip install nltk
import nltk
from nltk import RegexpParser
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
# Download NLTK resources (run only once if not downloaded)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
# Sample sentence
sentence = "The quick brown fox jumps over the lazy dog"
# Tokenize the sentence
tokens = word_tokenize(sentence)
# POS tagging
tagged = pos_tag(tokens)
# Define a chunk grammar using regular expressions
# NP (noun phrase) chunking: "NP: {<DT>?<JJ>*<NN>}"
# This grammar captures optional determiner (DT), adjectives (JJ), and nouns (NN) as a noun phrase
chunk_grammar = r"""
NP: {<DT>?<JJ>*<NN>}
"""
# Create a chunk parser with the defined grammar
chunk_parser = RegexpParser(chunk_grammar)
# Parse the tagged sentence to extract chunks
chunks = chunk_parser.parse(tagged)
# Display the chunks
for subtree in chunks.subtrees():
if subtree.label() == 'NP':
print(subtree)

Requirement already satisfied: nltk in /usr/local/lib/python3.10/dist-packages (3.8.1)


Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from nltk) (8.1.7)
Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk) (1.4.0)
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.10/dist-packages (from nltk) (2023.12.25)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from nltk) (4.66.2)
(NP The/DT quick/JJ brown/NN)
(NP f /NN)

https://colab.research.google.com/drive/15A6dtAf88PKmxYgKa333vDGxQRI7pumd#scrollTo=ADhz1KBmxW3V 9/9
SATHYABAMA INSTITUTE OF SCIENCE & TECHNOLOGY
SCHOOL OF COMPUTING
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
SCSA 2604 NATURAL LANGUAGE PROCESSING LAB

LAB – 1 : CASE STUDY

AIM: To Enhance Customer Feedback Analysis through NLP-based Text Processing

PROBLEM STATEMENT:

A company receives a large volume of customer feedback across various channels such as
emails, social media, and surveys. Understanding and categorizing this feedback manually is
time-consuming and inefficient. The goal is to develop an NLP-based program to automatically
process and analyze customer feedback to extract valuable insights.

OBJECTIVE:

Utilize spaCy and NLP techniques to process customer feedback text, extract tokens, perform
lemmatization, and conduct dependency parsing to uncover underlying relationships between
words.

APPROACH:

Data Collection:
Gather a dataset containing customer feedback from different sources, including emails, social
media comments, and survey responses.

Text Processing with spaCy:


Utilize spaCy to process the customer feedback text. Extract tokens to identify individual words
and perform lemmatization to obtain their base forms.
1
Page

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


Dependency Parsing Analysis:
Use spaCy's dependency parsing feature to identify the syntactic relationships between words.
Analyze the dependency tree to understand how different parts of the feedback sentences are
connected.

Insight Generation:
Categorize the feedback based on sentiment, identify frequently occurring topics, or extract
key phrases related to specific issues or praises mentioned by customers.

Implementation:

Use Python and spaCy to develop the program for text processing and analysis.
Incorporate visualization techniques (e.g., graphs, word clouds) to represent the findings and
insights derived from the processed feedback.
Evaluation:

Evaluate the accuracy and efficiency of tokenization, lemmatization, and dependency parsing
in handling different types of customer feedback.
Measure the program's ability to extract meaningful insights and categorize feedback
accurately.
PROGRAM:
import spacy

# Load English tokenizer, tagger, parser, NER, and word vectors


nlp = spacy.load("en_core_web_sm")

# Sample customer feedback data


customer_feedback = [
"The product is amazing! I love the quality.",
"The customer service was terrible, very disappointed.",
"Great experience overall, highly recommended.",
2

"The delivery was late, very frustrating."


Page

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


]

def analyze_feedback(feedback):
for idx, text in enumerate(feedback, start=1):
print(f"\nAnalyzing Feedback {idx}: '{text}'")
doc = nlp(text)

# Extract tokens and lemmatization


tokens = [token.text for token in doc]
lemmas = [token.lemma_ for token in doc]
print("Tokens:", tokens)
print("Lemmas:", lemmas)

# Dependency parsing
print("\nDependency Parsing:")
for token in doc:
print(token.text, token.dep_, token.head.text, token.head.pos_,
[child for child in token.children])

if __name__ == "__main__":
analyze_feedback(customer_feedback)

OUTPUT:
Analyzing Feedback 1: 'The product is amazing! I love the quality.'
Tokens: ['The', 'product', 'is', 'amazing', '!', 'I', 'love', 'the', 'quality', '.']
Lemmas: ['the', 'product', 'be', 'amazing', '!', 'I', 'love', 'the', 'quality', '.']

Dependency Parsing:
The det product NOUN []
3

product nsubj is AUX [The]


Page

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


is ROOT is AUX [product, amazing, !]
amazing acomp is AUX []
! punct is AUX []
I nsubj love VERB []
love ROOT love VERB [I, quality, .]
the det quality NOUN []
quality dobj love VERB [the]
. punct love VERB []

Analyzing Feedback 2: 'The customer service was terrible, very disappointed.'


Tokens: ['The', 'customer', 'service', 'was', 'terrible', ',', 'very', 'disappointed', '.']
Lemmas: ['the', 'customer', 'service', 'be', 'terrible', ',', 'very', 'disappointed', '.']

Dependency Parsing:
The det service NOUN []
customer compound service NOUN []
service nsubj was AUX [The, customer]
was ROOT was AUX [service, disappointed, .]
terrible amod disappointed ADJ []
, punct disappointed ADJ []
very advmod disappointed ADJ []
disappointed acomp was AUX [terrible, ,, very]
. punct was AUX []

Analyzing Feedback 3: 'Great experience overall, highly recommended.'


Tokens: ['Great', 'experience', 'overall', ',', 'highly', 'recommended', '.']
Lemmas: ['great', 'experience', 'overall', ',', 'highly', 'recommend', '.']

Dependency Parsing:
4

Great amod experience NOUN []


Page

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST


experience nsubj recommended VERB [Great]
overall advmod recommended VERB []
, punct recommended VERB []
highly advmod recommended VERB []
recommended ROOT recommended VERB [experience, overall, ,, highly, .]
. punct recommended VERB []

Analyzing Feedback 4: 'The delivery was late, very frustrating.'


Tokens: ['The', 'delivery', 'was', 'late', ',', 'very', 'frustrating', '.']
Lemmas: ['the', 'delivery', 'be', 'late', ',', 'very', 'frustrating', '.']

Dependency Parsing:
The det delivery NOUN [ ]
delivery nsubj was AUX [The]
was ROOT was AUX [delivery, frustrating, .]
late advmod frustrating ADJ [ ]
, punct frustrating ADJ [ ]
very advmod frustrating ADJ [ ]
frustrating acomp was AUX [late, ,, very]
. punct was AUX [ ]

CONCLUSION:
The developed NLP-based program utilizing spaCy proves to be an efficient solution for
processing and analyzing customer feedback. Its capability to extract tokens, perform
lemmatization, and conduct dependency parsing aids in understanding the sentiment,
identifying key topics, and establishing relationships within the feedback data. This enables
companies to derive actionable insights, prioritize issues, and enhance customer satisfaction
based on the analysis of their feedback.

RESULT:
This case study demonstrates the practical application of the provided code snippet using
5

spaCy in a business context, specifically for customer feedback analysis, showcasing how
Page

NLP techniques can be employed to extract valuable insights from unstructured text data.

Mrs. Parveen . A , Assistant Professor, Dept. of CSE, SIST

You might also like