0% found this document useful (0 votes)
3 views

Lect04

Uploaded by

rodrigoferraribr
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Lect04

Uploaded by

rodrigoferraribr
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Text Representation

By Ivan Wong
Feature Extraction in ML
• Feature extraction is an important step for any machine learning
problem.
• No matter how good a modeling algorithm you use, if you feed in
poor features, you will get poor results.
• how do we go about doing feature engineering for text data?
• How do we transform a given text into numerical form so that it
can be fed into NLP and ML algorithms?
Text Representation
What Computers See
What Computers See
Text Representation
• Text representation has been an active area of research in
the past decades, especially the last one.
• These approaches are classified into four categories:
• Basic vectorization approaches
• Distributed representations
• Universal language representation
• Handcrafted features
Sentiment Analysis
• To correctly predict the sentiment of a sentence, the model needs
to understand the meaning of the sentence.
• Break the sentence into lexical units such as lexemes, words, and
phrases
• Derive the meaning for each of the lexical units
• Understand the syntactic (grammatical) structure of the sentence
• Understand the context in which the sentence appears
• Any good text representation scheme must facilitate the
extraction of those data points in the best possible way to reflect
the linguistic properties of the text.
Vector Space Models
• We’ll represent text units (characters, phonemes, words, phrases,
sentences, paragraphs, and documents) with vectors of numbers.
• VSM is fundamental to many information-retrieval operations,
from scoring documents on a query to document classification
and document clustering
• It’s a mathematical model that represents text units as vectors.
Basic Vectorization Approaches
• Basic Idea: Map each word in the vocabulary (V) of the text corpus
to a unique ID (integer value), then represent each sentence or
document in the corpus as a V-dimensional vector.t units as
vectors.
D1 Dog bites man.
D2 Man bites dog.
D3 Dog eats meat.
D4 Man eats food.
One-Hot Encoding
• In one-hot encoding, each word w in the corpus vocabulary is
given a unique integer ID wid that is between 1 and |V|, where V is
the set of the corpus vocabulary.
• Each word is then represented by a V-dimensional binary vector of
0s and 1s.
One-Hot Encoding
Word ID One-hot Encoding
dog 1 [1 0 0 0 0 0]
bites 2 [0 1 0 0 0 0]
man 3 [0 0 1 0 0 0]
meat 4 [0 0 0 1 0 0]
food 5 ?
eats 6 ?
• D1 (Dog bites man.): [ [1 0 0 0 0 0] [0 1 0 0 0 0] [0 0 1 0 0 0]].
• D4 (Man eats food.): [ [ 0 0 1 0 0 0] [0 0 0 0 1 0] [0 0 0 0 0 1]].
Pros and Cons
• Pros:
• One-hot encoding is intuitive to understand and straightforward to
implement.
• Cons:
• The size of a one-hot vector is directly proportional to size of the
vocabulary, and most real-world corpora have large vocabularies.
• This representation does not give a fixed-length representation for text.
• It treats words as atomic units and has no notion of (dis)similarity
between words.
• The out of vocabulary (OOV) problem. There’s no way to represent it in our
model.
Bag of Words
• Bag of words (BoW) is a classical text representation technique
that has been used commonly in NLP, especially in text
classification problems.
• The basic intuition behind it is that:
• It assumes that the text belonging to a given class in the dataset is
characterized by a unique set of words.
• If two text pieces have nearly the same words, then they belong to the
same bag (class).
Bag of Words
• Each document in the corpus is then converted into a vector of |V|
dimensions:
• The ith component of the vector, i = wid, is simply the number of times the
word w occurs in the document.
Word ID D1 (Dog bites man.):
dog 1 [1 1 1 0 0 0]
bites 2
man 3
D4 (Man eats food.):
meat 4
[ 0 0 1 0 1 1]
food 5
eats 6
Bag of Words
• Sometimes, we don’t care about the frequency of occurrence of
words in text and we only want to represent whether a word exists
in the text or not.
• Researchers have shown that such a representation without
considering frequency is useful for sentiment analysis
Pros and Cons
• Pros: • Cons:
• BoW is fairly simple to understand • The size of the vector increases
and implement. with the size of the vocabulary
• Documents having the same words • It does not capture the similarity
will have their vector between different words that mean
representations closer to each the same thing.
other in Euclidean space. So if two • Out of vocabulary Problem
documents have similar • Word order information is lost in
vocabulary, they’ll be closer to this representation.
each other in the vector space and
vice versa.
• We have a fixed-length encoding for
any sentence of arbitrary length.
Bag of N-Grams
• There is no notion of phrases or word ordering.
• The bag-of-n-grams (BoN) approach tries to remedy this.
• It does so by breaking text into chunks of n contiguous words (or
tokens).
• Each chunk is called an n-gram.
• The corpus vocabulary, V, is then nothing but a collection of all
unique n-grams across the text corpus.
D1 Dog bites man.
D2 Man bites dog.

Bag of N-Grams D3 Dog eats meat.


D4 Man eats food.

• Let’s construct a 2-gram (a.k.a. bigram) model for it.


• The set of all bigrams in the corpus is as follows:
• {dog bites, bites man, man bites, bites dog, dog eats, eats meat, man
eats, eats food}.
• The bigram representation for the first two documents is as
follows: D1 : [1,1,0,0,0,0,0,0], D2 : [0,0,1,1,0,0,0,0]
Pros and Cons
• Pros: • Cons:
• It captures some context and • As n increases, dimensionality
word-order information. (and therefore sparsity) only
• Thus, resulting vector space is increases rapidly. (What’s the
able to capture some semantic best n?)
similarity. Documents having the • It still provides no way to
same n-grams will have their address the OOV problem.
vectors closer to each other in
Euclidean space as compared to
documents with completely
different n-grams.
TF-IDF
• TF-IDF, or term frequency–inverse document frequency, addresses this
issue.
• In all the three approaches we’ve seen so far, all the words in the text are treated
as equally important—there’s no notion of some words in the document being
more important than others.
• If a word w appears many times in a document di but does not occur
much in the rest of the documents dj in the corpus, then the word w
must be of great importance to the document di.
• The importance of w should increase in proportion to its frequency in
di, but at the same time, its importance should decrease in
proportion to the word’s frequency in other documents dj.
• Mathematically, this is captured using two quantities: TF and IDF. The
two are then combined to arrive at the TF-IDF score.
TF-IDF
• TF (term frequency) measures how often a term or word occurs in
a given document.

• IDF weighs down the terms that are very common across a corpus
and weighs up the rare terms. IDF of a term t is calculated as
follows:
D1 Dog bites man.
D2 Man bites dog.

TF-IDF D3 Dog eats meat.


D4 Man eats food.

Word TF score IDF score TF-IDF score


dog ⅓ = 0.33 log2(4/3) = 0.4114 0.4114 * 0.33 = 0.136
bites ⅙ = 0.17 log2(4/2) = 1 1* 0.17 = 0.17
man 0.33 log2(4/3) =0.4114 0.4114 * 0.33 = 0.136
eats 0.17 log2(4/2) =1 1* 0.17 = 0.17
meat 1/12 = 0.083 log2(4/1) =2 2* 0.083 = 0.17
food 0.083 log2(4/1) =2 2* 0.083 = 0.17

Dog bites man eats meat food


D1:
0.136 0.17 0.136 0 0 0
TF-IDF
• There are several variations of the basic TF-IDF formula that are
used in practice.
• Notice that the TF-IDF scores that we calculated for our corpus might not
match the TF-IDF scores given by scikit-learn.
• This is because scikit-learn uses a slightly modified version of the IDF
formula.
• This stems from provisions to account for possible zero divisions and to
not entirely ignore terms that appear in all documents.
Pros and Cons
• Pros: • Cons:
• we can use the TF-IDF vectors to • It still suffers from the curse of
calculate similarity between two high dimensionality.
texts using a similarity measure
like Euclidean distance or cosine
similarity.
• TF-IDF is a commonly used
representation in application
scenarios such as information
retrieval and text classification.

Even today, TF-IDF continues to be a popular representation scheme


for many NLP tasks, especially the initial versions of the solution.
Distributed Representations
• Methods that use neural network architectures to create dense,
low-dimensional representations of words and texts.
• Distributional similarity
• This is the idea that the meaning of a word can be understood from the
context in which the word appears. (e.g., NLP rocks)
• Distributional hypothesis
• This hypothesizes that words that occur in similar contexts have similar
meanings. (e.g., Learning Python is easy., Learning Java is easy.)
• if two words often occur in similar context, then their corresponding
representation vectors must also be close to each other.
Distributed Representations
• Distributional representation
• Mathematically, distributional representation schemes use high-
dimensional vectors to represent words.
• Distributed representation
• Distributed representation schemes significantly compress the
dimensionality.
• This results in vectors that are compact (i.e., low dimensional) and dense
(i.e., hardly any zeros).
Distributed Representations
• Embedding
• For the set of words in a corpus, embedding is a mapping between vector
space coming from distributional representation to vector space coming
from distributed representation.
• Vector semantics
• This refers to the set of NLP methods that aim to learn the word
representations based on distributional properties of words in a large
corpus.
Word Embeddings
• What does it mean when we say a text representation should
capture “distributional similarities between words”?
• If we’re given the word “USA,” distributionally similar words could be other
countries (e.g., Canada, Germany, India, etc.) or cities in the USA.
• If we’re given the word “beautiful,” words that share some relationship
with this word (e.g., synonyms, antonyms) could be considered
distributionally similar words.
• The neural network–based word representation model known as
“Word2vec,” based on “distributional similarity,” can capture word
analogy relationships such as:
King – Man + Woman ≈ Queen
Word Embeddings

https://informatics.ed.ac.uk/news-events/news/news-archive/king-man-
woman-queen-the-hidden-algebraic-struct
Pre-trained word embeddings
• Training your own word embeddings is a pretty expensive process
(in terms of both time and computing).
• it’s not necessary to train your own embeddings, and using pre-
trained word embeddings often suffices.
• Such embeddings can be thought of as a large collection of key-
value pairs, where keys are the words in the vocabulary and values
are their corresponding word vectors.
• Some of the most popular pre-trained embeddings are Word2vec by
Google [8], GloVe by Stanford [9], and fasttext embeddings by Facebook
[10], to name a few.
• Further, they’re available for various dimensions like d = 25, 50, 100, 200,
300, 600.
Pre-trained word embeddings
• You can download a pre-trained word embedding model:
https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQm
M/edit?resourcekey=0-wjGZdNAUop6WykTtMip30g
• Load pre-trained Word2vec embeddings and look for the most
similar words (ranked by cosine similarity) to a given word.
Training our own embeddings
• Two architectural variants that were proposed in the original
Word2vec approach.

• Continuous bag of words (CBOW)


• SkipGram

• Both of these have a lot of similarities in many respects.


CBOW
• In CBOW, the primary task is to build a language model that
correctly predicts the center word given the context words in
which the center word appears.
• What is a language model?
• It is a (statistical) model that tries to give a probability distribution over
sequences of words.
• Given a sentence of, say, m words, it assigns a probability Pr(w1, w2, …..,
wn) to the whole sentence.
• The objective of a language model is to assign probabilities in such a way
that it gives high probability to “good” sentences and low probabilities to
“bad” sentences.
CBOW
• By good, we mean sentences that are semantically and
syntactically correct. By bad, we mean sentences that are
incorrect—semantically or syntactically or both.
• “The cat jumped over the dog,” it will try to assign a probability close to
1.0, whereas for a sentence like “jumped over the the cat dog,” it tries to
assign a probability close to 0.0.
• CBOW tries to learn a language model that tries to predict the
“center” word from the words in its context.
CBOW Training
SkipGram
• In SkipGram, the task is to predict the context words from the
center word.
Using off-the-shelf implementations of W2V
• There are several available implementations that abstract the
mathematical details for us.
• One of the most commonly used implementations is genism.
• Despite the availability of several off-the-shelf implementations,
we still have to make decisions on several hyperparameters:
• Dimensionality of the word vectors
• Context window
• CBOW or SkipGram
Going Beyond Words
• In most NLP applications, we seldom deal with atomic units like
words—we deal with sentences, paragraphs, or even full texts.
• So, we need a way to represent larger units of text.
• A simple approach is to break the text into constituent words, take
the embeddings for individual words, and combine them to form
the representation for the text.
OOV Problem
• A simple approach that often works is to exclude those words
from the feature extraction process so we don’t have to worry
about how to get their representations.
• If we’re using a model trained on a large corpus, we shouldn’t see
too many OOV words anyway.
• However, if a large fraction of the words from our production data
isn’t present in the word embedding’s vocabulary, we’re unlikely to
see good performance.
• This vocabulary overlap is a great heuristic to gauge the
performance of an NLP model.
Distributed Representations Beyond Words
and Characters
• There are also other approaches that handle the OOV problem by
modifying the training process by bringing in characters and other
subword-level linguistic components.
• The key idea is that one can potentially handle the OOV problem
by using subword information, such as morphological properties
(e.g., prefixes, suffixes, word endings, etc.), or by using character
representations. fastText, from Facebook AI research, is one of the
popular algorithms that follows this approach.
Visualizing Embeddings
• Visual exploration is a very important aspect of any data-related
problem.
• Is there a way to visually inspect word vectors? Even though embeddings
are low-dimensional vectors, even 100 or 300 dimensions are too high to
visualize.
Visualizing Embeddings
• t-SNE [30], or t-distributed Stochastic Neighboring Embedding. It’s
a technique used for visualizing high-dimensional data like
embeddings by reducing them to two- or three-dimensional data.

You might also like