0% found this document useful (0 votes)

53 views

6 - Text Vectorization-CSC688-SP22

This document discusses text vectorization techniques in natural language processing. It introduces bag-of-words representation using CountVectorizer, where documents are represented as vectors of word counts. Term frequency-inverse document frequency (TF-IDF) is also covered, where words that are more unique to a document receive higher weights. Code examples are provided to generate n-grams, transform text to sparse matrices, and view the results as DataFrames. Common NLP tasks like loading data, preprocessing text, and building feature representations are demonstrated.

Uploaded by

Crypto Genius

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views

6 - Text Vectorization-CSC688-SP22

Uploaded by

Crypto Genius

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

6_Text Vectorization-CSC688-SP22

Pauline Maouad, PhD

April 26, 2022

CSC688/498 - Natural Language Processing

Lebanese American University - SP22

(c) Pauline Maouad, PhD

[ ]: import nltk
# from nltk.book import *
from nltk.stem.porter import *
from nltk.stem import *
from nltk import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from collections import Counter
from nltk.tokenize import RegexpTokenizer #regular expression tokenizer
from nltk import FreqDist

from nltk.corpus import gutenberg as g

from nltk.corpus import brown
from nltk.corpus import nps_chat

import re # regular expression

import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics.pairwise import cosine_similarity

import matplotlib.pyplot as plt

import random
import string # to process standard python strings
import warnings
warnings.filterwarnings('ignore')

import pandas as pd

1
import scipy.sparse
from sklearn.feature_extraction.text import CountVectorizer #for Bag of Words
from sklearn.feature_extraction.text import TfidfVectorizer # for Term Frequency␣
,→Inverse Document Frequency

[ ]: pwd()

[ ]:

[ ]: from bs4 import BeautifulSoup

import urllib.request

Beautiful Soup presents the same interface to a number of different parsers, but each parser is
different. Different parsers will create different parse trees from the same document. The biggest
differences are between the HTML parsers and the XML parsers.
There are also differences between HTML parsers. If you give BeautifulSoup a perfectly-formed
HTML document, these differences won’t matter. One parser will be faster than another, but
they’ll all give you a data structure that looks exactly like the original HTML document.
But if the document is not perfectly-formed, different parsers will give different results.
There are many libraries to make an HTTP request in Python, which are httplib, urllib, httplib2 ,
treq, etc., but requests is the one of the best with cool features.

[ ]: import requests
website_url = requests.get('https://en.wikipedia.org/wiki/
,→List_of_Asian_countries_by_area').text

soup = BeautifulSoup(website_url, 'lxml') # requires that you have lxml installed

# soup

If you carefully inspect the HTML script all the table contents i.e. names of the countries which
we intend to extract is under class Wikitable Sortable.
First we’ll find class ‘wikitable sortable’ in the HTML script.
Under table class ‘wikitable sortable’ we have links with country name as title.

[ ]: my_table = soup.find('table',{'class':'wikitable sortable'})

# my_table

[ ]: all_links = my_table.find_all('a')
# all_links

Extract from the links, the titles which are the countries’ names.
To that end, create a list Countries and extract the names of countries then append them to the list

2
[ ]: countries = []

for l in all_links:
countries.append(l.get('title'))

print(countries)

[ ]: countries_df = pd.DataFrame(countries)
countries_df.head()

0.0.1 Retrieving another text from wikipedia

[ ]: raw_html = urllib.request.urlopen('https://en.wikipedia.org/wiki/
,→Natural_language_processing')

raw_html = raw_html.read()

soup_wiki = BeautifulSoup(raw_html, 'html.parser')

# Prettify() function in BeautifulSoup enables you to view how the tags are␣
,→nested in the document.

# print(soup_wiki.prettify())

[ ]: article_paragraphs = soup_wiki.find_all('p')

[ ]: article_text = ''

for p in article_paragraphs:
article_text += p.text

# print(article_text)

[ ]: wiki_text = soup_wiki.get_text()
# print(wiki_text)

0.1 Loading Data

[ ]: #method with pandas

data = pd.read_csv('train.tsv', sep='\t')
data

This data has 5 sentiment labels:

3
0 - negative 1 - somewhat negative 2 - neutral 3 - somewhat positive 4 - positive
[ ]: data.head()

0.1.1 Text Feature Generation using Bag of Words (BoW)

Generate Document Term matrix by using scikit-learn’s CountVectorizer.

The Document Term matrix can use a single word (similar to what we saw in class. Check
powerpoint file). It can also be a combination of two or more words, which is called a bigram
or trigram model and the general approach is called the n-gram model.
[ ]: from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import RegexpTokenizer

#tokenizer to remove unwanted elements from data like symbols and numbers
token = RegexpTokenizer(r'[a-zA-Z0-9]+')

# bigrams
cv = CountVectorizer(max_features=1000, stop_words='english',ngram_range =␣
,→(2,3), tokenizer = token.tokenize)

text_counts = cv.fit_transform(data['Phrase'])
cols_names = cv.get_feature_names()

[ ]: # print(text_counts)
# pd.DataFrame(cv.get_feature_names())

cols_names

[ ]: # display sparse matrix as a pandas dataframe

pd.DataFrame.sparse.from_spmatrix(text_counts[50:150])

0.1.2 Text Feature Generation using TF-IDF

[ ]: from sklearn.feature_extraction.text import CountVectorizer # Converts a␣

,→collection of text documents to a matrix of token counts

corpus = [
"you were born with potential",
"you were born with goodness and trust",
"you were born with ideals and dreams",
"you were born with greatness",
"you were born with wings",
"you are not meant for crawling, so don't",

4
"you have wings",
"learn to use them and fly",
"-- Rumi"
]
vectorizer = CountVectorizer(ngram_range = (1,1))

X=vectorizer.fit(corpus) # Learn the vocabulary dictionary and return␣

,→document-term matrix.

print(vectorizer.get_feature_names()) # feature names

print(vectorizer.vocabulary_) # A mapping of terms to feature indices.

[ ]: vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(2, 3)) #unigrams and␣

,→bigrams

X2 = vectorizer2.fit_transform(corpus) # Learn the vocabulary dictionary and␣

,→return document-term matrix.

print(vectorizer2.get_feature_names()) # Array mapping from feature integer␣

,→indices to feature name.

wds = vectorizer2.get_feature_names()

[ ]: vect=TfidfVectorizer(ngram_range=(3, 3)) #bigrams and trigrams

tf_idf = vect.fit_transform(corpus)

terms = vect.get_feature_names()

[ ]: # print(text_tf)
print("pandas sparse matrix display: ")
pd.DataFrame.sparse.from_spmatrix(tf_idf, index=corpus, columns=terms)

Expert Advisor Programming For Metatrader 5
86% (7)
Expert Advisor Programming For Metatrader 5
338 pages
WSMA Lab Manual 2
No ratings yet
WSMA Lab Manual 2
8 pages
5 Networks
No ratings yet
5 Networks
34 pages
Sahil Malhotra 16 BCE 0113 Web Mining L51+L52: 1. Universal Crawling 1.1. CODE
No ratings yet
Sahil Malhotra 16 BCE 0113 Web Mining L51+L52: 1. Universal Crawling 1.1. CODE
11 pages
IR Practical Code
No ratings yet
IR Practical Code
13 pages
Medical Text Classifier GabrieldeOlaguibel
No ratings yet
Medical Text Classifier GabrieldeOlaguibel
12 pages
TP1 NLP
No ratings yet
TP1 NLP
7 pages
ir
No ratings yet
ir
23 pages
NLP Manual
No ratings yet
NLP Manual
21 pages
British_Airways_Forage_Report
No ratings yet
British_Airways_Forage_Report
12 pages
Python Record Manual
No ratings yet
Python Record Manual
18 pages
QA_Using_Gemini_Langchain_ChromaDB_PDF
No ratings yet
QA_Using_Gemini_Langchain_ChromaDB_PDF
2 pages
Streamlit PDF Application Setup All Commands in One Single File
No ratings yet
Streamlit PDF Application Setup All Commands in One Single File
8 pages
Methodology (Autosaved)
No ratings yet
Methodology (Autosaved)
9 pages
1_5089492269589857342(1)
No ratings yet
1_5089492269589857342(1)
7 pages
AI Phash3
No ratings yet
AI Phash3
11 pages
Programs code
No ratings yet
Programs code
7 pages
Python Libraries
No ratings yet
Python Libraries
27 pages
SEM (IV) (AKTU) THEORY EXAMINATION 2023-24 SOLUTION
No ratings yet
SEM (IV) (AKTU) THEORY EXAMINATION 2023-24 SOLUTION
9 pages
PYTHON PYQ SOLUTION AKTU
No ratings yet
PYTHON PYQ SOLUTION AKTU
42 pages
A Comprehensive Guide To Understand and Implement Text Classification in Python
No ratings yet
A Comprehensive Guide To Understand and Implement Text Classification in Python
34 pages
Getting Data II Solutions
No ratings yet
Getting Data II Solutions
9 pages
BDA List of Experiments For Practical Exam
No ratings yet
BDA List of Experiments For Practical Exam
21 pages
TSA Student
No ratings yet
TSA Student
20 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
15 pages
Improve Your Python Code Automatically
No ratings yet
Improve Your Python Code Automatically
16 pages
IR practical
No ratings yet
IR practical
24 pages
TA3_U6_ISR
No ratings yet
TA3_U6_ISR
3 pages
1 Java Script
No ratings yet
1 Java Script
59 pages
Exercise and Experiment 3
No ratings yet
Exercise and Experiment 3
14 pages
ChatDBClean - Colab
No ratings yet
ChatDBClean - Colab
3 pages
Agentic RAG_removed
No ratings yet
Agentic RAG_removed
9 pages
How To Use NLP in Python A Practical Step-by-Step ExampleTo Find Out The In-Demand Skills For Data SC
No ratings yet
How To Use NLP in Python A Practical Step-by-Step ExampleTo Find Out The In-Demand Skills For Data SC
12 pages
Omkar Nimbalkar Ass3
No ratings yet
Omkar Nimbalkar Ass3
14 pages
LINQ Interview Questions
No ratings yet
LINQ Interview Questions
32 pages
Python Chatbot Project
No ratings yet
Python Chatbot Project
10 pages
De Programs2
No ratings yet
De Programs2
16 pages
MultiModel-RAG
No ratings yet
MultiModel-RAG
18 pages
Lab7 - Python Assisted Exploitation
No ratings yet
Lab7 - Python Assisted Exploitation
11 pages
10.1 Object-Oriented Programming in Python
No ratings yet
10.1 Object-Oriented Programming in Python
23 pages
CCS369 - Text and Speech Analysis
No ratings yet
CCS369 - Text and Speech Analysis
31 pages
Twitter Sentiment Analysis Dss
No ratings yet
Twitter Sentiment Analysis Dss
14 pages
NLP Lab
No ratings yet
NLP Lab
18 pages
All Practicals
No ratings yet
All Practicals
33 pages
SMA (TASK1 AND 2) ... HARDCOPY (Final) ..Pranchal..
No ratings yet
SMA (TASK1 AND 2) ... HARDCOPY (Final) ..Pranchal..
11 pages
Next Word Prediction With NLP and Deep Learning
No ratings yet
Next Word Prediction With NLP and Deep Learning
13 pages
Medium Com Unstructured Io Setting Up A Private Retrieval Augmented Generation Rag System With Local Vector Database D42f34692ca7 1
No ratings yet
Medium Com Unstructured Io Setting Up A Private Retrieval Augmented Generation Rag System With Local Vector Database D42f34692ca7 1
9 pages
Nlp Lab Manual
No ratings yet
Nlp Lab Manual
21 pages
Experiment-4 BDA LAB
No ratings yet
Experiment-4 BDA LAB
7 pages
Extract Transform Load
No ratings yet
Extract Transform Load
80 pages
IR - 754 All Practical
No ratings yet
IR - 754 All Practical
21 pages
Data frames pandas, handout 1 (1)
No ratings yet
Data frames pandas, handout 1 (1)
16 pages
Lab5 Example Fall 23
No ratings yet
Lab5 Example Fall 23
4 pages
Naive Bayes
No ratings yet
Naive Bayes
11 pages
PPFSD - Question - Bank Answers New
No ratings yet
PPFSD - Question - Bank Answers New
61 pages
Sentiments Analysis Code Analysis
No ratings yet
Sentiments Analysis Code Analysis
42 pages
Adv ML Lab Record
No ratings yet
Adv ML Lab Record
36 pages
Answer For Questions
No ratings yet
Answer For Questions
5 pages
Ai - Phase 3
No ratings yet
Ai - Phase 3
9 pages
Python Question Bank
No ratings yet
Python Question Bank
8 pages
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
From Everand
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
Tenko
No ratings yet
Inspiring Powershell Articles
From Everand
Inspiring Powershell Articles
Murat Yildirimoglu
No ratings yet
Chapter 13
No ratings yet
Chapter 13
37 pages
Teaching Business Students Chatbots
No ratings yet
Teaching Business Students Chatbots
10 pages
Ch6 - Text Vectorization - 1
No ratings yet
Ch6 - Text Vectorization - 1
63 pages
6 MOLP and Goal Programming
No ratings yet
6 MOLP and Goal Programming
30 pages
MSBA 320 Syllabus Optimization and Simulation
No ratings yet
MSBA 320 Syllabus Optimization and Simulation
5 pages
Gaddis Python 4e Chapter 04
No ratings yet
Gaddis Python 4e Chapter 04
30 pages
Computer Organization and Architecture Major Advances in Computers
No ratings yet
Computer Organization and Architecture Major Advances in Computers
14 pages
MTH603 Final Term Papers in One File PDF
No ratings yet
MTH603 Final Term Papers in One File PDF
11 pages
Financial Risk Management With Bayesian Estimation of GARCH Models PDF
No ratings yet
Financial Risk Management With Bayesian Estimation of GARCH Models PDF
204 pages
OOAD QUESTION BANK - May 2024
No ratings yet
OOAD QUESTION BANK - May 2024
1 page
aoop (1)
No ratings yet
aoop (1)
13 pages
CSC301 Lecture 01
No ratings yet
CSC301 Lecture 01
20 pages
Unit 1-1
No ratings yet
Unit 1-1
26 pages
Record Pgms
No ratings yet
Record Pgms
48 pages
Maturi Venkatasubba Rao (MVSR) Engineering College: Department of Computer Science and Engineering
No ratings yet
Maturi Venkatasubba Rao (MVSR) Engineering College: Department of Computer Science and Engineering
69 pages
TA1004 練習
No ratings yet
TA1004 練習
13 pages
13 - Implementation of LZW Algorithm For Binary Lossless Data Compression
No ratings yet
13 - Implementation of LZW Algorithm For Binary Lossless Data Compression
6 pages
Design and Analysis of Algorithms: CSE 5311 Lecture 3 Divide-and-Conquer
No ratings yet
Design and Analysis of Algorithms: CSE 5311 Lecture 3 Divide-and-Conquer
33 pages
Intel Microprocessor Chapter 01
No ratings yet
Intel Microprocessor Chapter 01
86 pages
Notes MC 2
No ratings yet
Notes MC 2
24 pages
Universidad Tecnológica de Honduras
No ratings yet
Universidad Tecnológica de Honduras
3 pages
Data Structure Using C R
No ratings yet
Data Structure Using C R
128 pages
Daa Assignment 3
No ratings yet
Daa Assignment 3
16 pages
Word Searchword Search Puzzle in Python Programming
No ratings yet
Word Searchword Search Puzzle in Python Programming
5 pages
Cos Po1 Po2 Po3 Po4 Po5 Po6 Po7 Po8 Po9 Po10 Po11 Po 12 Pso1 Pso2 Co1 Co2 Co3 Co4 Co5 Average (Rounded To Nearest Integer)
No ratings yet
Cos Po1 Po2 Po3 Po4 Po5 Po6 Po7 Po8 Po9 Po10 Po11 Po 12 Pso1 Pso2 Co1 Co2 Co3 Co4 Co5 Average (Rounded To Nearest Integer)
127 pages
Bubble Sort Explanation
No ratings yet
Bubble Sort Explanation
22 pages
4th Unit Web
No ratings yet
4th Unit Web
43 pages
Constituency Parsing Ppt
No ratings yet
Constituency Parsing Ppt
94 pages
5-TA-modelling
No ratings yet
5-TA-modelling
59 pages
Oops Exception Handling
No ratings yet
Oops Exception Handling
8 pages
Unit 3
No ratings yet
Unit 3
100 pages
Convex Integral
No ratings yet
Convex Integral
3 pages
GE3151_IAT-1_Q.Bank
No ratings yet
GE3151_IAT-1_Q.Bank
2 pages
DLC Modelqp Set2
No ratings yet
DLC Modelqp Set2
2 pages
Exercise 6 - Canonical Forms and 2, 3 variable Karnaugh
No ratings yet
Exercise 6 - Canonical Forms and 2, 3 variable Karnaugh
2 pages

6 - Text Vectorization-CSC688-SP22

Uploaded by

6 - Text Vectorization-CSC688-SP22

Uploaded by

6_Text Vectorization-CSC688-SP22

Pauline Maouad, PhD

CSC688/498 - Natural Language Processing

Lebanese American University - SP22

(c) Pauline Maouad, PhD

from nltk.corpus import gutenberg as g

import re # regular expression

from sklearn.feature_extraction.text import TfidfVectorizer

import matplotlib.pyplot as plt

[ ]: from bs4 import BeautifulSoup

soup = BeautifulSoup(website_url, 'lxml') # requires that you have lxml installed

[ ]: my_table = soup.find('table',{'class':'wikitable sortable'})

0.0.1 Retrieving another text from wikipedia

soup_wiki = BeautifulSoup(raw_html, 'html.parser')

0.1 Loading Data

[ ]: #method with pandas

This data has 5 sentiment labels:

0.1.1 Text Feature Generation using Bag of Words (BoW)

Generate Document Term matrix by using scikit-learn’s CountVectorizer.

[ ]: # display sparse matrix as a pandas dataframe

0.1.2 Text Feature Generation using TF-IDF

[ ]: from sklearn.feature_extraction.text import CountVectorizer # Converts a␣

X=vectorizer.fit(corpus) # Learn the vocabulary dictionary and return␣

print(vectorizer.get_feature_names()) # feature names

print(vectorizer.vocabulary_) # A mapping of terms to feature indices.

[ ]: vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(2, 3)) #unigrams and␣

X2 = vectorizer2.fit_transform(corpus) # Learn the vocabulary dictionary and␣

print(vectorizer2.get_feature_names()) # Array mapping from feature integer␣

[ ]: vect=TfidfVectorizer(ngram_range=(3, 3)) #bigrams and trigrams

You might also like