Learning Representations That Convey Semantic and Syntactic Information
Learning Representations That Convey Semantic and Syntactic Information
1.3. Word2Vec
What is an iteration-based model?
A model that is able to learn one iteration at a time and eventually be able to encode the probability of
a word given its context.
What is Word2Vec?
A model whose parameters are the word vectors. Train the model on a certain objective. At every
iteration, we run our model, evaluate the errors and backpropagate the gradients in the model.
What are the initial embeddings of Word2Vec model?
The embedding matrix is initialized randomly using a Normal or uniform distribution. Then, the
embedding of word i in the vocabulary is the row i of the embedding matrix.
What are the two algorithms used by Word2Vec? Explain how they work.
Continuous bag-of-words (CBOW)
Page 3 of 14 NLP
Skip-gram
What are the two training methods used?
Hierarchical softmax
Negative sampling
What is the advantage of Word2Vec over SVD-based methods?
Much faster to compute and capture complex linguistic patterns beyond word similarity
What is the limitation of Word2Vec?
Fails to make use of global co-occurrence statistics. It only relies on local statistics (words in the
neighborhood of word i).
E.g.: The cat sat on the mat. Word2Vec doesn't capture if the is a special word in the context of cat or
just a stop word.
F(wi,wj,w~k)=pco(w~k|wi)pco(w~k|wj)�(��,��,�~�)=�co(�~�|
��)�co(�~�|��)
F is designed to be a function of the linear difference between two words wi and wj. It is an
exponential function.
What are the pros of GloVe?
The GloVe model efficiently leverages global statistical information by training only on non-zero
elements in a word-word co-occurence matrix, and produces a vector space with meaningful
substructure.
What is window classification and why is it important?
Natural languages tend to use the same word for very different meanings and we typically need to
know the context of the word usage to discriminate between meanings.
E.g.: 'to sanction' means depending on the context 'to permit' or 'to punish'
A sequence is a central word vector preceded and succeeded by context word vectors. The number
of words in the context is also known as the context window size and varies depending on the
problem being solved.
How do window size relate to performance?
Generally, narrower window size lead to better performance in syntactic tests while wider windows
lead to better performance in semantic tests.
Language models compute the probability of occurrence of a number of words in particular sequence.
ht=σ(W(hh)ht−1+W(hx)xt)ℎ�=�(�(ℎℎ)ℎ�−1+�(ℎ�)��)
yt^=softmax(W(S)ht)��^=�������(�(�)ℎ�)
After this operation, the output h(t) is multiplied by a weight matrix and run through a softmax over the
entire vocabulary, which outputs the probabilities of the next word. The word with the highest
probability is picked.
How does a RNN solve the curse of dimensionality problem incurred by n-gram language
models?
It is solved since the weight matrices are applied at every step of the network. Hence the model
parameters don't grow proportionally to the input sequence size. The number of parameters is
independent of the sequence length.
What is the loss function of a RNN?
Cross-entropy summed over a corpus of size T and a vocabulary of size V.
What is the perplexity of a RNN and what does it mean to have a low perplexity?
The perplexity of a RNN is 2 to the negative log probability of the cross entropy loss function.
Perplexity is a measure of confusion where lower values imply more confidence in predicting the next
word in the sequence.
A perplexity for a language model of 247 means that the model is as confused/perplex as if it has to
choose uniformly and independently among 247 possibilities for each word.
The same weights are applied at every time step of the input, so there is symmetry in how
inputs are processed
Give an example of the vanishing gradient problem in RNN and explain it.
S1: 'Jane walked into the room. John walked in too. Jane said hi to ?'.
S2: 'Jane walked into the room. John walked in too. It was late in the day, and everyone was walking
home after a long day at work. Jane said hi to ?'.
In both cases, the RNN should predict John as an answer. However, in practice, it turns out the RNN
is more likely to predict John in sentence 1 than in sentence 2. Indeed, during backpropagation, the
contribution of gradient values gradually vanishes as they propagate to earlier timesteps. Thus
for long sentences, the RNN is less likely to recall information introduced in the earliest part of a
sentence.
How to solve vanishing gradient problem?
Technique 1: Instead of initializing W randomly, start off from an identity matrix initialization.
Technique 2: Use ReLU as an activation function since the derivative of the gradient is either 0 or 1.
This way, gradients would flow through the neurons whose derivatives is 1 without getting attenuated
while propagating back through time-steps.
What are exploding gradients and give a technique on how to solve them?
The explosion occurs through exponential growth by repeatedly multiplying gradients through the
network layers that have values larger than 1.0.
A technique to solve exploding gradients is gradient clipping. Gradient clipping is a simple heuristic
invented by Thomas Mikolov to counter the effect of exploding gradient. That is, whenever the
gradient reach a certain threshold, they are set back to a small number.
ht→=f(W−→xt+V→ht−1−→−+b→)ℎ�→=�(�→��+�→ℎ�−1→+�→)
ht←=f(W←−xt+V←ht−1←−−+b←)ℎ�←=�(�←��+�←ℎ�−1←+�←)
yt^=g(U[ht→;ht←]+c)��^=�(�[ℎ�→;ℎ�←]+�)
What is a limitation of deep bidirectional RNN?
Exploding and vanishing gradients.
rt=σ(W(r)xt+U(r)ht−1)��=�(�(�)��+�(�)ℎ�−1)
This stage is the one who knows the recipe of combining a newly observed word with the past hidden
state h(t-1) to summarize this new word in light of the contextual past.
ht=tanh(rt∘Uht−1+Wxt)ℎ�=���ℎ(��∘�ℎ�−1+���)
What is the update gate in a GRU?
The update gate is responsible for how much of h(t-1) should be carried forward to the next state.
zt=σ(W(z)xt+U(z)ht−1)��=�(�(�)��+�(�)ℎ�−1)
Give the equation of the new hidden state in a GRU cell.
ht=(1−zt)∘h^t+zt∘ht−1ℎ�=(1−��)∘ℎ^�+��∘ℎ�−1
where z is the update gate, h hat the new memory and h(t-1) the previous hidden state.
Seq2Seq models can generate arbitrary output sequences after seeing the entire input. They can
even focus in on specific parts of the input automatically to help generate a useful translation.
E.g.: 1 encoder and 1 decoder (both LSTM or bi-LSTM)
What does an encoder do in a Seq2Seq model?
It reads the input sequence and generate a fixed-dimensional context vector C for the sequence.
The encoder stacks multiple RNNs on top of each other to effectively compress an arbitrary-length
sequence into a fixed-size vector. The final layer's LSTM hidden state will be used as a context
vector.
What does a decoder do in a Seq2Seq model?
It uses the context vector as a 'seed' from which to generate an output sequence.
In which order does an encoder read a sentence?
Reverse
How does a decoder work in a Seq2Seq model?
We initialize the hidden state of our first layer with the context vector, we then pass an token
appended to the end of the input. We then run the three-stacked RNN, following up with a softmax on
the final layer's output to generate the first word.
Attention(Q,K,V)=softmax(QKTdk−
−√)V���������(�,�,�)=�������(�����)�
Page 8 of 14 NLP
Computing the dot product between Q and K gives similarity scores between rows (word embedding
vectors). So every line represents the similarity between one token and other tokens.
Using softmax allows to normalize the matrix.
V still represents the sentence with all the embedding vectors stacked. Hence multiplying V with the
attention matrix acts as a weighted sum to give more importance to some vectors than others.
The attention matrix acts as a filtering matrix, which can make the value matrix pay more attention to
those important words.
How does attention work?
We compute an attention vector. It is a vector of weights that are used to compute the context vector
as a weighted average of the hidden states generated by the encoder at time steps 1, ..., n.
The decoder network is provided with a look at the entire input sequence at every decoding step
since it is fed with the context vector. The decoder can then decide what input words are important.
What is global attention?
Instead of giving to the decoder's LSTM cells a single context vector which is a weighted average of
all the hidden states of the encoder (attention). LSTM cells in the decoder are fed with
the concatenation of the encoder hidden states at time i and the context vector.
Give 4 algorithms that can be used by decoders to search translations.
Exhaustive search, ancestral sampling, greedy search, beam search
How does beam search work?
We maintain K candidates at each time step. To do so, we compute H(t+1) by expanding H(t) and
keeping the best K candidates (the ones with the highest probability).
1. Represent each word in the corpus as a combination of the characters along with the special
end of word token </w>
2. Iteratively count character pairs in all tokens of the vocabulary
3. Merge every occurrence of the most frequent pair, add the new character n-gram to the
vocabulary
4. Repeat step 3 until the desired number of merge operations are completed or the desired
vocabulary size is achieved
{'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w e s t </w>': 6, 'w i d e s t </w>': 3}
{'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w es t </w>': 6, 'w i d es t </w>': 3}
{'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w est </w>': 6, 'w i d est </w>': 3}
How does word segmentation work?
Start with a vocabulary of characters and keep extending the vocabulary with most frequent n-gram
pairs in the data set. This process is repeated until all n-gram pairs are selected or vocabulary size
reaches some threshold.
What is the difference between WordPiece and SentencePiece?
WordPiece is an algorithm that creates smaller units than words (sub-word level).
SentencePiece - created by Google - is an algorithm that combines sub-word level tokenization (BPE)
as well as unigram tokenization.
The system translates mostly at word-level and consults the character components for rare words.
Give the structure of a hybrid NMT.
A word-based translation as a backbone
A source character-based representation
A target character-level generation
What is the purpose of a word-based translation as a backbone in a hybrid NMT?
This is the core of a hybrid NMT (LSTM encoder-decoder) that translates at the word level. We use to
represent OOV words.
What is the purpose of a source character-based representation model in a hybrid NMT?
A deep LSTM model that learns over characters of rare words and use the final hidden state of the
LSTM as the representation for the rare word.
What is the purpose of a target character-level generation model?
The goal is to create a coherent representation that handles unlimited output vocabulary. To do so,
we have a separate deep LSTM that 'translates' at the character-level given the current word-level
state.
p(x1,…,xn)=∏i=1np(xi∣x1,…,xi−1)�(�1,…,��)=∏�=1��(��∣�1,…,��−1)
Backward pass:
p(x1,…,xn)=∏i=1np(xi∣xi+1,…,xn)�(�1,…,��)=∏�=1��(��∣��+1,…,��)
The model is trained to minimize the negative log-likelihood in both directions:
L=−∑i=1n(logp(xi∣x1,…,xi−1;Θe,Θ→LSTM,Θs)+logp(xi∣xi+1,
…,xn;Θe,Θ←LSTM,Θs))�=−∑�=1�(log�(��∣�1,…,��−1;Θ�,Θ→LSTM,Θ�)
+log�(��∣��+1,…,��;Θ�,Θ←LSTM,Θ�))
How does ELMo learn task-specific representations?
Page 11 of 14 NLP
On top of a L-layer biLM, ELMo stacks all the hidden states across layers together by learning a task-
specific linear combination.
The weights, s_task, in the linear combination are learned for each end task and normalized by
softmax. The scaling factor γtask is used to correct the misalignment between the distribution of biLM
hidden states and the distribution of task specific representations.
vi=f(Ri;Θtask)=γtask∑ℓ=0Lstaskihi,ℓ��=�(��;Θtask)=�task∑ℓ=0���taskℎ�,ℓ
To which tasks correspond which layers?
The comparison study indicates that syntactic information is better represented at lower layers while
semantic information is captured by higher layers. Because different layers tend to carry different type
of information, stacking them together helps.
5.2. ULMFiT
What is innovative with ULMFiT?
ULMFiT is the first model to introduce the idea of generative pretrained language models that is fine-
tuned for a specific task.
What are the three steps to achieve good transfer learning results?
1. General LM pre-training (already done by fast.ai researchers) on Wikipedia
2. Target task LM fine-tuning: finetuning LM on a specific vocabulary
3. Train a target task classifier with 2 fully-connected layers
What are the two techniques used when fine-tuning LM?
1. Discriminative fine-tuning: tune each layer with different learning rates
2. Slanted triangular learning rates: triangular learning rate schedule
What are the two techniques used when fine-tuning the classifier?
1. Concat pooling extracts max-polling and mean-pooling over the history of hidden states and
concatenates them with the final hidden state.
2. Gradual unfreezing extracts max-polling and mean-pooling over the history of hidden states
and concatenates them with the final hidden state.
5.3. Transformers
What are the two major difference between ELMo and OpenAI GPT?
1. The model architectures are different: ELMo uses a shallow concatenation of independently
trained left-to-right and right-to-left multi-layer LSTMs, while GPT is a multi-layer transformer
decoder.
2. ELMo feeds embeddings into models customized for specific tasks as additional features,
while GPT fine-tunes the same base model for all end tasks.
What is the major upgrade brought by OpenAI GPT?
To get rid of the task-specific model and use the pre-trained language model directly.
Page 12 of 14 NLP
Hence we don't need new a new design for specific tasks. We just need to modify the input sequence
by adding custom tags. At the first stage, generative pre-training on a language model can absorb as
much free text as possible. Then at the second stage, the model is fine-tuned on specific tasks with a
small labeled dataset and a minimal set of new parameters to learn.
What is the loss of OpenAI GPT?
The loss is the negative log-likelihood for true labels to which we add the LM loss as an auxiliary loss.
Lcls=∑(x,y)∈DlogP(y∣x1,
…,xn)=∑(x,y)∈Dlogsoftmax(h(n)L(x)Wy)�cls=∑(�,�)∈�log�(�∣�1,
…,��)=∑(�,�)∈�logsoftmax(ℎ�(�)(�)��)
LLM=−∑ilogp(xi∣xi−k,…,xi−1)�LM=−∑�log�(��∣��−�,…,��−1)
L=Lcls+λLLM�=�cls+��LM
What is a limitation of OpenAI GPT?
Its unidirectional nature.
5.3.2. BERT
What is the biggest difference between OpenAI GPT and BERT? Which limitation does it
solve?
OpenAI is unidirectional. BERT is bidirectional.
What is the base component of BERT?
The model architecture of BERT is a multi-layer bidirectional Transformer encoder. It is composed of
Multi-headed self attention, feed-forward layers, layer norm and residuals and positional embeddings.
What are the tasks on which BERT is trained?
1. Mask language model (MLM): Randomly mask 15% of tokens in each sequence.
BERT employed several heuristic tricks:
(a) with 80% probability, replace the chosen words with [MASK];
(b) with 10% probability, replace with a random word;
(c) with 10% probability, keep it the same.
2. Next sentence prediction: tell whether one sentence is the next sentence of the other. It is
used to learn relationships between sentences.
Give the 3 components that constitutes BERT input embedding.
1. WordPiece embeddings (cf. previously)
2. Segment embeddings
3. Position embeddings (are learned)
What is the structure of BERT base?
12 layers, 768-dimensional output hidden state, 12 heads
What is the structure of BERT large?
24 layers, 1024-dimensional output hidden state, 16 heads
What is the [SEP] token used for?
[SEP] token is used when building a sequence from multiple sequences
Page 13 of 14 NLP
E.g.: two sequences for sequence classification or for a text and a question for question answering
What is the [PAD] token used for?
The token used for padding, for example when batching sequences of different lengths
What is the [CLS] token used for?
The classifier token which is used when doing sequence classification. It is the first token of the
sequence when built with special tokens.
5.3.3. RoBERTa
What does RoBERTa stand for?
Robustly optimized BERT approach
5.3.4. ALBERT
5.3.5. ELECTRA
5.3.6. DistilBERT
How does distillation work?
Train "Teacher": Use SOTA pre-training + fine-tuning technique to train model with maximum
accuracy
Label a large amount of unlabeled input examples with Teacher
Train "Student": much smaller model which is trained to mimic Teacher output
Student objective is typically MSE or cross-entropy
5.3.7. XLNet
What are two innovations of XLNet?
1. Relative position embeddings
2. Permutation language modelling
linkcode
Page 14 of 14 NLP
6. Miscellaneous
Explain what is self-supervised learning.
Given a task and enough labels, supervised learning can solve it really well. Good performance
usually requires a decent amount of labels, but collecting manual labels is expensive (i.e. ImageNet)
and hard to be scaled up. Considering the amount of unlabelled data (e.g. free text, all the images on
the Internet) is substantially more than a limited number of human curated labelled datasets, it is
kinda wasteful not to use them. However, unsupervised learning is not easy and usually works
much less efficiently than supervised learning.
Self-supervised learning consists in getting labels for free for unlabelled data and train
unsupervised dataset in a supervised manner.
What is a problem induced by ReLU activations?
Exploding gradients, dying ReLU, mean and variance of activations is not 0 or 1
How to solve exploding gradients?
Gradient clipping
What is dying ReLU?
No learning if the activation is 0
How to solve dying ReLU?
Parametric ReLU
What is the difference between learning latent features using SVD and getting embedding
vectors from a neural network?
SVD uses linear combination of inputs while a neural network uses non-linear combination.
Give the time complexity of LSTM.
seqlength∗hiddensize2��������ℎ∗ℎ���������2
Give the time complexity of a transformer.
seqlength2∗hiddensize��������ℎ2∗ℎ���������
How is AdamW different from Adam?
AdamW is Adam with L2 regularization on weight as models with smaller weights generalise better
What is the difference between LayerNorm and BatchNorm?
BatchNorm: compute the mean and variance at each layer for every minibatch
LayerNorm: compute the mean and variance for every single sample for each layer independently
What changes would you make to your deep learning code if you knew there are errors in your
training data?
Label smoothing where the smoothening values is based on % error.
Use class weights to modify the loss