6 - Text Vectorization-CSC688-SP22
6 - Text Vectorization-CSC688-SP22
import numpy as np
import random
import string # to process standard python strings
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
1
import scipy.sparse
from sklearn.feature_extraction.text import CountVectorizer #for Bag of Words
from sklearn.feature_extraction.text import TfidfVectorizer # for Term Frequency␣
,→Inverse Document Frequency
[ ]: pwd()
[ ]:
Beautiful Soup presents the same interface to a number of different parsers, but each parser is
different. Different parsers will create different parse trees from the same document. The biggest
differences are between the HTML parsers and the XML parsers.
There are also differences between HTML parsers. If you give BeautifulSoup a perfectly-formed
HTML document, these differences won’t matter. One parser will be faster than another, but
they’ll all give you a data structure that looks exactly like the original HTML document.
But if the document is not perfectly-formed, different parsers will give different results.
There are many libraries to make an HTTP request in Python, which are httplib, urllib, httplib2 ,
treq, etc., but requests is the one of the best with cool features.
[ ]: import requests
website_url = requests.get('https://en.wikipedia.org/wiki/
,→List_of_Asian_countries_by_area').text
# soup
If you carefully inspect the HTML script all the table contents i.e. names of the countries which
we intend to extract is under class Wikitable Sortable.
First we’ll find class ‘wikitable sortable’ in the HTML script.
Under table class ‘wikitable sortable’ we have links with country name as title.
[ ]: all_links = my_table.find_all('a')
# all_links
Extract from the links, the titles which are the countries’ names.
To that end, create a list Countries and extract the names of countries then append them to the list
2
[ ]: countries = []
for l in all_links:
countries.append(l.get('title'))
print(countries)
[ ]: countries_df = pd.DataFrame(countries)
countries_df.head()
[ ]: raw_html = urllib.request.urlopen('https://en.wikipedia.org/wiki/
,→Natural_language_processing')
raw_html = raw_html.read()
# Prettify() function in BeautifulSoup enables you to view how the tags are␣
,→nested in the document.
# print(soup_wiki.prettify())
[ ]: article_paragraphs = soup_wiki.find_all('p')
[ ]: article_text = ''
for p in article_paragraphs:
article_text += p.text
# print(article_text)
[ ]: wiki_text = soup_wiki.get_text()
# print(wiki_text)
3
0 - negative 1 - somewhat negative 2 - neutral 3 - somewhat positive 4 - positive
[ ]: data.head()
The Document Term matrix can use a single word (similar to what we saw in class. Check
powerpoint file). It can also be a combination of two or more words, which is called a bigram
or trigram model and the general approach is called the n-gram model.
[ ]: from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import RegexpTokenizer
#tokenizer to remove unwanted elements from data like symbols and numbers
token = RegexpTokenizer(r'[a-zA-Z0-9]+')
# bigrams
cv = CountVectorizer(max_features=1000, stop_words='english',ngram_range =␣
,→(2,3), tokenizer = token.tokenize)
text_counts = cv.fit_transform(data['Phrase'])
cols_names = cv.get_feature_names()
[ ]: # print(text_counts)
# pd.DataFrame(cv.get_feature_names())
cols_names
corpus = [
"you were born with potential",
"you were born with goodness and trust",
"you were born with ideals and dreams",
"you were born with greatness",
"you were born with wings",
"you are not meant for crawling, so don't",
4
"you have wings",
"learn to use them and fly",
"-- Rumi"
]
vectorizer = CountVectorizer(ngram_range = (1,1))
wds = vectorizer2.get_feature_names()
tf_idf = vect.fit_transform(corpus)
terms = vect.get_feature_names()
[ ]: # print(text_tf)
print("pandas sparse matrix display: ")
pd.DataFrame.sparse.from_spmatrix(tf_idf, index=corpus, columns=terms)