Lect04
Lect04
By Ivan Wong
Feature Extraction in ML
• Feature extraction is an important step for any machine learning
problem.
• No matter how good a modeling algorithm you use, if you feed in
poor features, you will get poor results.
• how do we go about doing feature engineering for text data?
• How do we transform a given text into numerical form so that it
can be fed into NLP and ML algorithms?
Text Representation
What Computers See
What Computers See
Text Representation
• Text representation has been an active area of research in
the past decades, especially the last one.
• These approaches are classified into four categories:
• Basic vectorization approaches
• Distributed representations
• Universal language representation
• Handcrafted features
Sentiment Analysis
• To correctly predict the sentiment of a sentence, the model needs
to understand the meaning of the sentence.
• Break the sentence into lexical units such as lexemes, words, and
phrases
• Derive the meaning for each of the lexical units
• Understand the syntactic (grammatical) structure of the sentence
• Understand the context in which the sentence appears
• Any good text representation scheme must facilitate the
extraction of those data points in the best possible way to reflect
the linguistic properties of the text.
Vector Space Models
• We’ll represent text units (characters, phonemes, words, phrases,
sentences, paragraphs, and documents) with vectors of numbers.
• VSM is fundamental to many information-retrieval operations,
from scoring documents on a query to document classification
and document clustering
• It’s a mathematical model that represents text units as vectors.
Basic Vectorization Approaches
• Basic Idea: Map each word in the vocabulary (V) of the text corpus
to a unique ID (integer value), then represent each sentence or
document in the corpus as a V-dimensional vector.t units as
vectors.
D1 Dog bites man.
D2 Man bites dog.
D3 Dog eats meat.
D4 Man eats food.
One-Hot Encoding
• In one-hot encoding, each word w in the corpus vocabulary is
given a unique integer ID wid that is between 1 and |V|, where V is
the set of the corpus vocabulary.
• Each word is then represented by a V-dimensional binary vector of
0s and 1s.
One-Hot Encoding
Word ID One-hot Encoding
dog 1 [1 0 0 0 0 0]
bites 2 [0 1 0 0 0 0]
man 3 [0 0 1 0 0 0]
meat 4 [0 0 0 1 0 0]
food 5 ?
eats 6 ?
• D1 (Dog bites man.): [ [1 0 0 0 0 0] [0 1 0 0 0 0] [0 0 1 0 0 0]].
• D4 (Man eats food.): [ [ 0 0 1 0 0 0] [0 0 0 0 1 0] [0 0 0 0 0 1]].
Pros and Cons
• Pros:
• One-hot encoding is intuitive to understand and straightforward to
implement.
• Cons:
• The size of a one-hot vector is directly proportional to size of the
vocabulary, and most real-world corpora have large vocabularies.
• This representation does not give a fixed-length representation for text.
• It treats words as atomic units and has no notion of (dis)similarity
between words.
• The out of vocabulary (OOV) problem. There’s no way to represent it in our
model.
Bag of Words
• Bag of words (BoW) is a classical text representation technique
that has been used commonly in NLP, especially in text
classification problems.
• The basic intuition behind it is that:
• It assumes that the text belonging to a given class in the dataset is
characterized by a unique set of words.
• If two text pieces have nearly the same words, then they belong to the
same bag (class).
Bag of Words
• Each document in the corpus is then converted into a vector of |V|
dimensions:
• The ith component of the vector, i = wid, is simply the number of times the
word w occurs in the document.
Word ID D1 (Dog bites man.):
dog 1 [1 1 1 0 0 0]
bites 2
man 3
D4 (Man eats food.):
meat 4
[ 0 0 1 0 1 1]
food 5
eats 6
Bag of Words
• Sometimes, we don’t care about the frequency of occurrence of
words in text and we only want to represent whether a word exists
in the text or not.
• Researchers have shown that such a representation without
considering frequency is useful for sentiment analysis
Pros and Cons
• Pros: • Cons:
• BoW is fairly simple to understand • The size of the vector increases
and implement. with the size of the vocabulary
• Documents having the same words • It does not capture the similarity
will have their vector between different words that mean
representations closer to each the same thing.
other in Euclidean space. So if two • Out of vocabulary Problem
documents have similar • Word order information is lost in
vocabulary, they’ll be closer to this representation.
each other in the vector space and
vice versa.
• We have a fixed-length encoding for
any sentence of arbitrary length.
Bag of N-Grams
• There is no notion of phrases or word ordering.
• The bag-of-n-grams (BoN) approach tries to remedy this.
• It does so by breaking text into chunks of n contiguous words (or
tokens).
• Each chunk is called an n-gram.
• The corpus vocabulary, V, is then nothing but a collection of all
unique n-grams across the text corpus.
D1 Dog bites man.
D2 Man bites dog.
• IDF weighs down the terms that are very common across a corpus
and weighs up the rare terms. IDF of a term t is calculated as
follows:
D1 Dog bites man.
D2 Man bites dog.
https://informatics.ed.ac.uk/news-events/news/news-archive/king-man-
woman-queen-the-hidden-algebraic-struct
Pre-trained word embeddings
• Training your own word embeddings is a pretty expensive process
(in terms of both time and computing).
• it’s not necessary to train your own embeddings, and using pre-
trained word embeddings often suffices.
• Such embeddings can be thought of as a large collection of key-
value pairs, where keys are the words in the vocabulary and values
are their corresponding word vectors.
• Some of the most popular pre-trained embeddings are Word2vec by
Google [8], GloVe by Stanford [9], and fasttext embeddings by Facebook
[10], to name a few.
• Further, they’re available for various dimensions like d = 25, 50, 100, 200,
300, 600.
Pre-trained word embeddings
• You can download a pre-trained word embedding model:
https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQm
M/edit?resourcekey=0-wjGZdNAUop6WykTtMip30g
• Load pre-trained Word2vec embeddings and look for the most
similar words (ranked by cosine similarity) to a given word.
Training our own embeddings
• Two architectural variants that were proposed in the original
Word2vec approach.