0% found this document useful (0 votes)
53 views

6 - Text Vectorization-CSC688-SP22

This document discusses text vectorization techniques in natural language processing. It introduces bag-of-words representation using CountVectorizer, where documents are represented as vectors of word counts. Term frequency-inverse document frequency (TF-IDF) is also covered, where words that are more unique to a document receive higher weights. Code examples are provided to generate n-grams, transform text to sparse matrices, and view the results as DataFrames. Common NLP tasks like loading data, preprocessing text, and building feature representations are demonstrated.

Uploaded by

Crypto Genius
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views

6 - Text Vectorization-CSC688-SP22

This document discusses text vectorization techniques in natural language processing. It introduces bag-of-words representation using CountVectorizer, where documents are represented as vectors of word counts. Term frequency-inverse document frequency (TF-IDF) is also covered, where words that are more unique to a document receive higher weights. Code examples are provided to generate n-grams, transform text to sparse matrices, and view the results as DataFrames. Common NLP tasks like loading data, preprocessing text, and building feature representations are demonstrated.

Uploaded by

Crypto Genius
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

6_Text Vectorization-CSC688-SP22

Pauline Maouad, PhD


April 26, 2022

CSC688/498 - Natural Language Processing

Lebanese American University - SP22

(c) Pauline Maouad, PhD


[ ]: import nltk
# from nltk.book import *
from nltk.stem.porter import *
from nltk.stem import *
from nltk import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from collections import Counter
from nltk.tokenize import RegexpTokenizer #regular expression tokenizer
from nltk import FreqDist

from nltk.corpus import gutenberg as g


from nltk.corpus import brown
from nltk.corpus import nps_chat

import re # regular expression

import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer


from sklearn.metrics.pairwise import cosine_similarity

import matplotlib.pyplot as plt

import random
import string # to process standard python strings
import warnings
warnings.filterwarnings('ignore')

import pandas as pd

1
import scipy.sparse
from sklearn.feature_extraction.text import CountVectorizer #for Bag of Words
from sklearn.feature_extraction.text import TfidfVectorizer # for Term Frequency␣
,→Inverse Document Frequency

[ ]: pwd()

[ ]:

[ ]: from bs4 import BeautifulSoup


import urllib.request

Beautiful Soup presents the same interface to a number of different parsers, but each parser is
different. Different parsers will create different parse trees from the same document. The biggest
differences are between the HTML parsers and the XML parsers.
There are also differences between HTML parsers. If you give BeautifulSoup a perfectly-formed
HTML document, these differences won’t matter. One parser will be faster than another, but
they’ll all give you a data structure that looks exactly like the original HTML document.
But if the document is not perfectly-formed, different parsers will give different results.
There are many libraries to make an HTTP request in Python, which are httplib, urllib, httplib2 ,
treq, etc., but requests is the one of the best with cool features.

[ ]: import requests
website_url = requests.get('https://en.wikipedia.org/wiki/
,→List_of_Asian_countries_by_area').text

soup = BeautifulSoup(website_url, 'lxml') # requires that you have lxml installed

# soup

If you carefully inspect the HTML script all the table contents i.e. names of the countries which
we intend to extract is under class Wikitable Sortable.
First we’ll find class ‘wikitable sortable’ in the HTML script.
Under table class ‘wikitable sortable’ we have links with country name as title.

[ ]: my_table = soup.find('table',{'class':'wikitable sortable'})


# my_table

[ ]: all_links = my_table.find_all('a')
# all_links

Extract from the links, the titles which are the countries’ names.
To that end, create a list Countries and extract the names of countries then append them to the list

2
[ ]: countries = []

for l in all_links:
countries.append(l.get('title'))

print(countries)

[ ]: countries_df = pd.DataFrame(countries)
countries_df.head()

0.0.1 Retrieving another text from wikipedia

[ ]: raw_html = urllib.request.urlopen('https://en.wikipedia.org/wiki/
,→Natural_language_processing')

raw_html = raw_html.read()

soup_wiki = BeautifulSoup(raw_html, 'html.parser')

# Prettify() function in BeautifulSoup enables you to view how the tags are␣
,→nested in the document.

# print(soup_wiki.prettify())

[ ]: article_paragraphs = soup_wiki.find_all('p')

[ ]: article_text = ''

for p in article_paragraphs:
article_text += p.text

# print(article_text)

[ ]: wiki_text = soup_wiki.get_text()
# print(wiki_text)

0.1 Loading Data

[ ]: #method with pandas


data = pd.read_csv('train.tsv', sep='\t')
data

This data has 5 sentiment labels:

3
0 - negative 1 - somewhat negative 2 - neutral 3 - somewhat positive 4 - positive
[ ]: data.head()

0.1.1 Text Feature Generation using Bag of Words (BoW)

Generate Document Term matrix by using scikit-learn’s CountVectorizer.

The Document Term matrix can use a single word (similar to what we saw in class. Check
powerpoint file). It can also be a combination of two or more words, which is called a bigram
or trigram model and the general approach is called the n-gram model.
[ ]: from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import RegexpTokenizer

#tokenizer to remove unwanted elements from data like symbols and numbers
token = RegexpTokenizer(r'[a-zA-Z0-9]+')

# bigrams
cv = CountVectorizer(max_features=1000, stop_words='english',ngram_range =␣
,→(2,3), tokenizer = token.tokenize)

text_counts = cv.fit_transform(data['Phrase'])
cols_names = cv.get_feature_names()

[ ]: # print(text_counts)
# pd.DataFrame(cv.get_feature_names())

cols_names

[ ]: # display sparse matrix as a pandas dataframe


pd.DataFrame.sparse.from_spmatrix(text_counts[50:150])

0.1.2 Text Feature Generation using TF-IDF

[ ]: from sklearn.feature_extraction.text import CountVectorizer # Converts a␣


,→collection of text documents to a matrix of token counts

corpus = [
"you were born with potential",
"you were born with goodness and trust",
"you were born with ideals and dreams",
"you were born with greatness",
"you were born with wings",
"you are not meant for crawling, so don't",

4
"you have wings",
"learn to use them and fly",
"-- Rumi"
]
vectorizer = CountVectorizer(ngram_range = (1,1))

X=vectorizer.fit(corpus) # Learn the vocabulary dictionary and return␣


,→document-term matrix.

print(vectorizer.get_feature_names()) # feature names

print(vectorizer.vocabulary_) # A mapping of terms to feature indices.

[ ]: vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(2, 3)) #unigrams and␣


,→bigrams

X2 = vectorizer2.fit_transform(corpus) # Learn the vocabulary dictionary and␣


,→return document-term matrix.

print(vectorizer2.get_feature_names()) # Array mapping from feature integer␣


,→indices to feature name.

wds = vectorizer2.get_feature_names()

[ ]: vect=TfidfVectorizer(ngram_range=(3, 3)) #bigrams and trigrams

tf_idf = vect.fit_transform(corpus)

terms = vect.get_feature_names()

[ ]: # print(text_tf)
print("pandas sparse matrix display: ")
pd.DataFrame.sparse.from_spmatrix(tf_idf, index=corpus, columns=terms)

You might also like