Lect04

Uploaded by

rodrigoferraribr

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Lect04

Uploaded by

rodrigoferraribr

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

Text Representation

By Ivan Wong
Feature Extraction in ML
• Feature extraction is an important step for any machine learning
problem.
• No matter how good a modeling algorithm you use, if you feed in
poor features, you will get poor results.
• how do we go about doing feature engineering for text data?
• How do we transform a given text into numerical form so that it
can be fed into NLP and ML algorithms?
Text Representation
What Computers See
What Computers See
Text Representation
• Text representation has been an active area of research in
the past decades, especially the last one.
• These approaches are classified into four categories:
• Basic vectorization approaches
• Distributed representations
• Universal language representation
• Handcrafted features
Sentiment Analysis
• To correctly predict the sentiment of a sentence, the model needs
to understand the meaning of the sentence.
• Break the sentence into lexical units such as lexemes, words, and
phrases
• Derive the meaning for each of the lexical units
• Understand the syntactic (grammatical) structure of the sentence
• Understand the context in which the sentence appears
• Any good text representation scheme must facilitate the
extraction of those data points in the best possible way to reflect
the linguistic properties of the text.
Vector Space Models
• We’ll represent text units (characters, phonemes, words, phrases,
sentences, paragraphs, and documents) with vectors of numbers.
• VSM is fundamental to many information-retrieval operations,
from scoring documents on a query to document classification
and document clustering
• It’s a mathematical model that represents text units as vectors.
Basic Vectorization Approaches
• Basic Idea: Map each word in the vocabulary (V) of the text corpus
to a unique ID (integer value), then represent each sentence or
document in the corpus as a V-dimensional vector.t units as
vectors.
D1 Dog bites man.
D2 Man bites dog.
D3 Dog eats meat.
D4 Man eats food.
One-Hot Encoding
• In one-hot encoding, each word w in the corpus vocabulary is
given a unique integer ID wid that is between 1 and |V|, where V is
the set of the corpus vocabulary.
• Each word is then represented by a V-dimensional binary vector of
0s and 1s.
One-Hot Encoding
Word ID One-hot Encoding
dog 1 [1 0 0 0 0 0]
bites 2 [0 1 0 0 0 0]
man 3 [0 0 1 0 0 0]
meat 4 [0 0 0 1 0 0]
food 5 ?
eats 6 ?
• D1 (Dog bites man.): [ [1 0 0 0 0 0] [0 1 0 0 0 0] [0 0 1 0 0 0]].
• D4 (Man eats food.): [ [ 0 0 1 0 0 0] [0 0 0 0 1 0] [0 0 0 0 0 1]].
Pros and Cons
• Pros:
• One-hot encoding is intuitive to understand and straightforward to
implement.
• Cons:
• The size of a one-hot vector is directly proportional to size of the
vocabulary, and most real-world corpora have large vocabularies.
• This representation does not give a fixed-length representation for text.
• It treats words as atomic units and has no notion of (dis)similarity
between words.
• The out of vocabulary (OOV) problem. There’s no way to represent it in our
model.
Bag of Words
• Bag of words (BoW) is a classical text representation technique
that has been used commonly in NLP, especially in text
classification problems.
• The basic intuition behind it is that:
• It assumes that the text belonging to a given class in the dataset is
characterized by a unique set of words.
• If two text pieces have nearly the same words, then they belong to the
same bag (class).
Bag of Words
• Each document in the corpus is then converted into a vector of |V|
dimensions:
• The ith component of the vector, i = wid, is simply the number of times the
word w occurs in the document.
Word ID D1 (Dog bites man.):
dog 1 [1 1 1 0 0 0]
bites 2
man 3
D4 (Man eats food.):
meat 4
[ 0 0 1 0 1 1]
food 5
eats 6
Bag of Words
• Sometimes, we don’t care about the frequency of occurrence of
words in text and we only want to represent whether a word exists
in the text or not.
• Researchers have shown that such a representation without
considering frequency is useful for sentiment analysis
Pros and Cons
• Pros: • Cons:
• BoW is fairly simple to understand • The size of the vector increases
and implement. with the size of the vocabulary
• Documents having the same words • It does not capture the similarity
will have their vector between different words that mean
representations closer to each the same thing.
other in Euclidean space. So if two • Out of vocabulary Problem
documents have similar • Word order information is lost in
vocabulary, they’ll be closer to this representation.
each other in the vector space and
vice versa.
• We have a fixed-length encoding for
any sentence of arbitrary length.
Bag of N-Grams
• There is no notion of phrases or word ordering.
• The bag-of-n-grams (BoN) approach tries to remedy this.
• It does so by breaking text into chunks of n contiguous words (or
tokens).
• Each chunk is called an n-gram.
• The corpus vocabulary, V, is then nothing but a collection of all
unique n-grams across the text corpus.
D1 Dog bites man.
D2 Man bites dog.

Bag of N-Grams D3 Dog eats meat.

D4 Man eats food.

• Let’s construct a 2-gram (a.k.a. bigram) model for it.

• The set of all bigrams in the corpus is as follows:
• {dog bites, bites man, man bites, bites dog, dog eats, eats meat, man
eats, eats food}.
• The bigram representation for the first two documents is as
follows: D1 : [1,1,0,0,0,0,0,0], D2 : [0,0,1,1,0,0,0,0]
Pros and Cons
• Pros: • Cons:
• It captures some context and • As n increases, dimensionality
word-order information. (and therefore sparsity) only
• Thus, resulting vector space is increases rapidly. (What’s the
able to capture some semantic best n?)
similarity. Documents having the • It still provides no way to
same n-grams will have their address the OOV problem.
vectors closer to each other in
Euclidean space as compared to
documents with completely
different n-grams.
TF-IDF
• TF-IDF, or term frequency–inverse document frequency, addresses this
issue.
• In all the three approaches we’ve seen so far, all the words in the text are treated
as equally important—there’s no notion of some words in the document being
more important than others.
• If a word w appears many times in a document di but does not occur
much in the rest of the documents dj in the corpus, then the word w
must be of great importance to the document di.
• The importance of w should increase in proportion to its frequency in
di, but at the same time, its importance should decrease in
proportion to the word’s frequency in other documents dj.
• Mathematically, this is captured using two quantities: TF and IDF. The
two are then combined to arrive at the TF-IDF score.
TF-IDF
• TF (term frequency) measures how often a term or word occurs in
a given document.

• IDF weighs down the terms that are very common across a corpus
and weighs up the rare terms. IDF of a term t is calculated as
follows:
D1 Dog bites man.
D2 Man bites dog.

TF-IDF D3 Dog eats meat.

D4 Man eats food.

Word TF score IDF score TF-IDF score

dog ⅓ = 0.33 log2(4/3) = 0.4114 0.4114 * 0.33 = 0.136
bites ⅙ = 0.17 log2(4/2) = 1 1* 0.17 = 0.17
man 0.33 log2(4/3) =0.4114 0.4114 * 0.33 = 0.136
eats 0.17 log2(4/2) =1 1* 0.17 = 0.17
meat 1/12 = 0.083 log2(4/1) =2 2* 0.083 = 0.17
food 0.083 log2(4/1) =2 2* 0.083 = 0.17

Dog bites man eats meat food

D1:
0.136 0.17 0.136 0 0 0
TF-IDF
• There are several variations of the basic TF-IDF formula that are
used in practice.
• Notice that the TF-IDF scores that we calculated for our corpus might not
match the TF-IDF scores given by scikit-learn.
• This is because scikit-learn uses a slightly modified version of the IDF
formula.
• This stems from provisions to account for possible zero divisions and to
not entirely ignore terms that appear in all documents.
Pros and Cons
• Pros: • Cons:
• we can use the TF-IDF vectors to • It still suffers from the curse of
calculate similarity between two high dimensionality.
texts using a similarity measure
like Euclidean distance or cosine
similarity.
• TF-IDF is a commonly used
representation in application
scenarios such as information
retrieval and text classification.

Even today, TF-IDF continues to be a popular representation scheme

for many NLP tasks, especially the initial versions of the solution.
Distributed Representations
• Methods that use neural network architectures to create dense,
low-dimensional representations of words and texts.
• Distributional similarity
• This is the idea that the meaning of a word can be understood from the
context in which the word appears. (e.g., NLP rocks)
• Distributional hypothesis
• This hypothesizes that words that occur in similar contexts have similar
meanings. (e.g., Learning Python is easy., Learning Java is easy.)
• if two words often occur in similar context, then their corresponding
representation vectors must also be close to each other.
Distributed Representations
• Distributional representation
• Mathematically, distributional representation schemes use high-
dimensional vectors to represent words.
• Distributed representation
• Distributed representation schemes significantly compress the
dimensionality.
• This results in vectors that are compact (i.e., low dimensional) and dense
(i.e., hardly any zeros).
Distributed Representations
• Embedding
• For the set of words in a corpus, embedding is a mapping between vector
space coming from distributional representation to vector space coming
from distributed representation.
• Vector semantics
• This refers to the set of NLP methods that aim to learn the word
representations based on distributional properties of words in a large
corpus.
Word Embeddings
• What does it mean when we say a text representation should
capture “distributional similarities between words”?
• If we’re given the word “USA,” distributionally similar words could be other
countries (e.g., Canada, Germany, India, etc.) or cities in the USA.
• If we’re given the word “beautiful,” words that share some relationship
with this word (e.g., synonyms, antonyms) could be considered
distributionally similar words.
• The neural network–based word representation model known as
“Word2vec,” based on “distributional similarity,” can capture word
analogy relationships such as:
King – Man + Woman ≈ Queen
Word Embeddings

https://informatics.ed.ac.uk/news-events/news/news-archive/king-man-
woman-queen-the-hidden-algebraic-struct
Pre-trained word embeddings
• Training your own word embeddings is a pretty expensive process
(in terms of both time and computing).
• it’s not necessary to train your own embeddings, and using pre-
trained word embeddings often suffices.
• Such embeddings can be thought of as a large collection of key-
value pairs, where keys are the words in the vocabulary and values
are their corresponding word vectors.
• Some of the most popular pre-trained embeddings are Word2vec by
Google [8], GloVe by Stanford [9], and fasttext embeddings by Facebook
[10], to name a few.
• Further, they’re available for various dimensions like d = 25, 50, 100, 200,
300, 600.
Pre-trained word embeddings
• You can download a pre-trained word embedding model:
https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQm
M/edit?resourcekey=0-wjGZdNAUop6WykTtMip30g
• Load pre-trained Word2vec embeddings and look for the most
similar words (ranked by cosine similarity) to a given word.
Training our own embeddings
• Two architectural variants that were proposed in the original
Word2vec approach.

• Continuous bag of words (CBOW)

• SkipGram

• Both of these have a lot of similarities in many respects.

CBOW
• In CBOW, the primary task is to build a language model that
correctly predicts the center word given the context words in
which the center word appears.
• What is a language model?
• It is a (statistical) model that tries to give a probability distribution over
sequences of words.
• Given a sentence of, say, m words, it assigns a probability Pr(w1, w2, …..,
wn) to the whole sentence.
• The objective of a language model is to assign probabilities in such a way
that it gives high probability to “good” sentences and low probabilities to
“bad” sentences.
CBOW
• By good, we mean sentences that are semantically and
syntactically correct. By bad, we mean sentences that are
incorrect—semantically or syntactically or both.
• “The cat jumped over the dog,” it will try to assign a probability close to
1.0, whereas for a sentence like “jumped over the the cat dog,” it tries to
assign a probability close to 0.0.
• CBOW tries to learn a language model that tries to predict the
“center” word from the words in its context.
CBOW Training
SkipGram
• In SkipGram, the task is to predict the context words from the
center word.
Using off-the-shelf implementations of W2V
• There are several available implementations that abstract the
mathematical details for us.
• One of the most commonly used implementations is genism.
• Despite the availability of several off-the-shelf implementations,
we still have to make decisions on several hyperparameters:
• Dimensionality of the word vectors
• Context window
• CBOW or SkipGram
Going Beyond Words
• In most NLP applications, we seldom deal with atomic units like
words—we deal with sentences, paragraphs, or even full texts.
• So, we need a way to represent larger units of text.
• A simple approach is to break the text into constituent words, take
the embeddings for individual words, and combine them to form
the representation for the text.
OOV Problem
• A simple approach that often works is to exclude those words
from the feature extraction process so we don’t have to worry
about how to get their representations.
• If we’re using a model trained on a large corpus, we shouldn’t see
too many OOV words anyway.
• However, if a large fraction of the words from our production data
isn’t present in the word embedding’s vocabulary, we’re unlikely to
see good performance.
• This vocabulary overlap is a great heuristic to gauge the
performance of an NLP model.
Distributed Representations Beyond Words
and Characters
• There are also other approaches that handle the OOV problem by
modifying the training process by bringing in characters and other
subword-level linguistic components.
• The key idea is that one can potentially handle the OOV problem
by using subword information, such as morphological properties
(e.g., prefixes, suffixes, word endings, etc.), or by using character
representations. fastText, from Facebook AI research, is one of the
popular algorithms that follows this approach.
Visualizing Embeddings
• Visual exploration is a very important aspect of any data-related
problem.
• Is there a way to visually inspect word vectors? Even though embeddings
are low-dimensional vectors, even 100 or 300 dimensions are too high to
visualize.
Visualizing Embeddings
• t-SNE [30], or t-distributed Stochastic Neighboring Embedding. It’s
a technique used for visualizing high-dimensional data like
embeddings by reducing them to two- or three-dimensional data.

Modern Trends in Education
No ratings yet
Modern Trends in Education
18 pages
Classroom Management Theories
100% (1)
Classroom Management Theories
16 pages
04 - Text Representation
No ratings yet
04 - Text Representation
131 pages
Unit-2-TB
No ratings yet
Unit-2-TB
20 pages
Unit iv
No ratings yet
Unit iv
57 pages
Unit-2
No ratings yet
Unit-2
21 pages
UNIT-II
No ratings yet
UNIT-II
20 pages
Unit iv
No ratings yet
Unit iv
58 pages
NLP Asgn2
No ratings yet
NLP Asgn2
7 pages
Module03 Embeddings
No ratings yet
Module03 Embeddings
102 pages
NLP Prez Word - Sentence Embedding - MAQUET - MARTIN - LEEFEBURE - MOGAVERO
No ratings yet
NLP Prez Word - Sentence Embedding - MAQUET - MARTIN - LEEFEBURE - MOGAVERO
18 pages
Word Embedding
No ratings yet
Word Embedding
60 pages
ML UNIT-II
No ratings yet
ML UNIT-II
27 pages
Machine Learning For NLP: Vocabulary
No ratings yet
Machine Learning For NLP: Vocabulary
37 pages
NLP Basic - YL
No ratings yet
NLP Basic - YL
16 pages
NLP Asgn3
No ratings yet
NLP Asgn3
6 pages
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
No ratings yet
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
18 pages
Bag of Words
No ratings yet
Bag of Words
32 pages
Text Mining - Vectorization
No ratings yet
Text Mining - Vectorization
24 pages
4 Word Representation
No ratings yet
4 Word Representation
41 pages
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
100% (1)
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
12 pages
Traditional Word Embedding
No ratings yet
Traditional Word Embedding
9 pages
Unit 2a
No ratings yet
Unit 2a
51 pages
unit2newml
No ratings yet
unit2newml
25 pages
21 Word2Vec 24 09 2024
No ratings yet
21 Word2Vec 24 09 2024
63 pages
Motivation Video: Mitsuku Vs Cleverbot - AI (Artificial Intelligence)
No ratings yet
Motivation Video: Mitsuku Vs Cleverbot - AI (Artificial Intelligence)
45 pages
Chapter 3 After Modfiy
No ratings yet
Chapter 3 After Modfiy
4 pages
week2and3
No ratings yet
week2and3
76 pages
Lecture 2a - Word Level Semantics
No ratings yet
Lecture 2a - Word Level Semantics
34 pages
wordembed
No ratings yet
wordembed
31 pages
Embeddings
No ratings yet
Embeddings
3 pages
NLP-Neuro Linguistic Programming: What Is A Corpus?
No ratings yet
NLP-Neuro Linguistic Programming: What Is A Corpus?
3 pages
Vector Semantics
No ratings yet
Vector Semantics
83 pages
Lec8-9- VSM
No ratings yet
Lec8-9- VSM
20 pages
Chapter II
No ratings yet
Chapter II
26 pages
Module III
No ratings yet
Module III
42 pages
NLP m3
No ratings yet
NLP m3
111 pages
Wordembed v2.0
No ratings yet
Wordembed v2.0
46 pages
Constructing and Evaluating Word Embeddings
No ratings yet
Constructing and Evaluating Word Embeddings
33 pages
Word 2 Vector
No ratings yet
Word 2 Vector
4 pages
Word Embeddings
No ratings yet
Word Embeddings
59 pages
NLP Text Preprocessing
No ratings yet
NLP Text Preprocessing
19 pages
word embedding
No ratings yet
word embedding
35 pages
Zhou 2020
No ratings yet
Zhou 2020
5 pages
Feature extraction techniques in NLP
No ratings yet
Feature extraction techniques in NLP
10 pages
06 Wordvectors
No ratings yet
06 Wordvectors
96 pages
Part 3
No ratings yet
Part 3
5 pages
NLP Notes
No ratings yet
NLP Notes
11 pages
Learning Representations That Convey Semantic and Syntactic Information
No ratings yet
Learning Representations That Convey Semantic and Syntactic Information
14 pages
DLNLP CH-3 N
No ratings yet
DLNLP CH-3 N
11 pages
Cs 224 N
No ratings yet
Cs 224 N
128 pages
CS224n: Natural Language Processing With Deep Learning
No ratings yet
CS224n: Natural Language Processing With Deep Learning
14 pages
Vector Representation of Text: Vagelis Hristidis Prepared With The Help of Nhat Le Many Slides Are From Richard Socher
No ratings yet
Vector Representation of Text: Vagelis Hristidis Prepared With The Help of Nhat Le Many Slides Are From Richard Socher
20 pages
Christopher Manning Lecture 2: Word Vectors, Word Senses, and Neural Classifiers
No ratings yet
Christopher Manning Lecture 2: Word Vectors, Word Senses, and Neural Classifiers
57 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part I Spring 2016
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part I Spring 2016
10 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
33 pages
Document
No ratings yet
Document
6 pages
3 WordMeaning
No ratings yet
3 WordMeaning
78 pages
Dictionary Skills
From Everand
Dictionary Skills
Sylvia J Duncan
No ratings yet
Key & Common Swedish Words A Vocabulary List of High Frequency Swedish Words(1000 Words): Swedish, #0
From Everand
Key & Common Swedish Words A Vocabulary List of High Frequency Swedish Words(1000 Words): Swedish, #0
MostUsedWords
2/5 (4)
English Prepositions. Exercises Part 1
From Everand
English Prepositions. Exercises Part 1
Więckowski
No ratings yet
Teaching Composition Guide
From Everand
Teaching Composition Guide
Valerie Hockert, PhD
No ratings yet
Csis 3300 w5 9 Nosql
No ratings yet
Csis 3300 w5 9 Nosql
27 pages
3175 Lab 4
No ratings yet
3175 Lab 4
2 pages
3175 Lab 3
No ratings yet
3175 Lab 3
1 page
CSIS 3300 W13 Transactions
No ratings yet
CSIS 3300 W13 Transactions
13 pages
3175 Lab 2 - Trip Booking App - Solutions(1)
No ratings yet
3175 Lab 2 - Trip Booking App - Solutions(1)
1 page
CSIS 3300 W3 Denormalization StarSchema
No ratings yet
CSIS 3300 W3 Denormalization StarSchema
27 pages
CSIS 3300 W3 Denormalization StarSchema Sol
No ratings yet
CSIS 3300 W3 Denormalization StarSchema Sol
2 pages
Lect06
No ratings yet
Lect06
21 pages
Csis3300 001 Outline Nb f24
No ratings yet
Csis3300 001 Outline Nb f24
8 pages
Lect08
No ratings yet
Lect08
17 pages
CSIS 3300 W11 QueryOptimization
No ratings yet
CSIS 3300 W11 QueryOptimization
27 pages
Lect07
No ratings yet
Lect07
24 pages
Proj2
No ratings yet
Proj2
5 pages
Lect05
No ratings yet
Lect05
17 pages
Lect02
No ratings yet
Lect02
23 pages
Lect01
No ratings yet
Lect01
28 pages
CSIS3400 070CourseOutline 2024Fall(1)
No ratings yet
CSIS3400 070CourseOutline 2024Fall(1)
5 pages
Proj01
No ratings yet
Proj01
5 pages
SLA Research and L2 Pedagogy
No ratings yet
SLA Research and L2 Pedagogy
14 pages
Tertium Comparationis: The Third Part of The Comparison, The Quality That Two Things Have in Common
No ratings yet
Tertium Comparationis: The Third Part of The Comparison, The Quality That Two Things Have in Common
9 pages
Society of Metaphysicians - The Structure of All
No ratings yet
Society of Metaphysicians - The Structure of All
30 pages
The noun. Grammatical categories - Горнік
No ratings yet
The noun. Grammatical categories - Горнік
6 pages
LDM Report Script
No ratings yet
LDM Report Script
3 pages
Hands On Question Answering Systems With BERT Applications in Neural Networks and Natural Language Processing 1st Edition Navin Sabharwal Amit Agrawal 2024 Scribd Download
100% (4)
Hands On Question Answering Systems With BERT Applications in Neural Networks and Natural Language Processing 1st Edition Navin Sabharwal Amit Agrawal 2024 Scribd Download
53 pages
Shresttha Dubey 1
No ratings yet
Shresttha Dubey 1
2 pages
21st Century Skills+Metcog
No ratings yet
21st Century Skills+Metcog
6 pages
White Paper - Why Deloitte, Adobe & Co. Replaced The Annual Review
No ratings yet
White Paper - Why Deloitte, Adobe & Co. Replaced The Annual Review
15 pages
DLL C.writing Week 6
No ratings yet
DLL C.writing Week 6
3 pages
Antología de Trabajo A1 - Primeras 6 Semanas
No ratings yet
Antología de Trabajo A1 - Primeras 6 Semanas
49 pages
#NOW: The Surprising Truth About The Power of Now
0% (1)
#NOW: The Surprising Truth About The Power of Now
16 pages
Using NLP Techniques To Produce Powerful Change: With Any Counseling Approach
No ratings yet
Using NLP Techniques To Produce Powerful Change: With Any Counseling Approach
13 pages
UNIT 7 Communication Expressing Wishes and Regret
No ratings yet
UNIT 7 Communication Expressing Wishes and Regret
1 page
Assignment 1 NG4S258 2022-23
No ratings yet
Assignment 1 NG4S258 2022-23
5 pages
Teaching Demo Evaluation
No ratings yet
Teaching Demo Evaluation
1 page
Questionnaire 1 To 100
No ratings yet
Questionnaire 1 To 100
71 pages
Nakshatra Ardra
No ratings yet
Nakshatra Ardra
1 page
Indigenization of Curriculum
100% (2)
Indigenization of Curriculum
11 pages
Pre and Post Test
No ratings yet
Pre and Post Test
9 pages
Mentalism, Behavior-Behavior Relations, and A Behavior-Analytic View of The Purposes of Science
No ratings yet
Mentalism, Behavior-Behavior Relations, and A Behavior-Analytic View of The Purposes of Science
16 pages
In-Company Trainer - Wingfried Heusinger - EN
No ratings yet
In-Company Trainer - Wingfried Heusinger - EN
36 pages
Nursing As An Art: Caring
100% (1)
Nursing As An Art: Caring
58 pages
F A T City Workshop Note-Taking Sheet
No ratings yet
F A T City Workshop Note-Taking Sheet
2 pages
Syllabus For Medical English
No ratings yet
Syllabus For Medical English
2 pages
Mapeh 5 1st Quarterly Exam
No ratings yet
Mapeh 5 1st Quarterly Exam
5 pages
Socratic Questioning Stool
No ratings yet
Socratic Questioning Stool
2 pages
Pacis Lesson Kit
No ratings yet
Pacis Lesson Kit
7 pages

Lect04

Uploaded by

Lect04

Uploaded by

Text Representation

Bag of N-Grams D3 Dog eats meat.

• Let’s construct a 2-gram (a.k.a. bigram) model for it.

TF-IDF D3 Dog eats meat.

Word TF score IDF score TF-IDF score

Dog bites man eats meat food

Even today, TF-IDF continues to be a popular representation scheme

• Continuous bag of words (CBOW)

• Both of these have a lot of similarities in many respects.

You might also like