Assignment-7-Solution
Assignment-7-Solution
Assignment- 7
QUESTION 1: [1 mark]
Which of the following best describes how ELMo’s architecture captures different linguistic
properties?
Correct Answer: b
Solution: ELMo uses a multi-layer bidirectional LSTM architecture, where different layers
capture different aspects of language. Empirical evidence shows that lower layers focus
more on syntactic information while higher layers capture more semantic nuances.
_________________________________________________________________________
QUESTION 2: [1 mark]
BERT and BART models differ in their architectures. While BERT is (i) model, BART
is (ii) one. Select the correct choices for (i) and (ii).
Correct Answer: c
_________________________________________________________________________
QUESTION 3: [1 mark]
Correct Answer: c
Solution: T5 is trained using a span corruption objective, which requires the model to
reconstruct masked spans of text.
________________________________________________________________________
QUESTION 4: [1 mark]
Correct Answer: d
Solution: T5 was pretrained on the “C4” (Colossal Clean Crawled Corpus) dataset.
_________________________________________________________________________
QUESTION 5: [1 mark]
Which of the following special tokens are introduced in BERT to handle sentence pairs?
a) [MASK] and [CLS]
b) [SEP] and [CLS]
c) [CLS] and [NEXT]
d) [SEP] and [MASK]
Correct Answer: b
Solution: BERT introduces the [CLS] token at the start for classification or overall sequence
representation and the [SEP] token to separate sentences. Thus, the special tokens are
“[SEP]” and “[CLS]”.
_________________________________________________________________________
QUESTION 6: [2 marks]
ELMo and BERT represent two different pre-training strategies for language models. Which
of the following statement(s) about these approaches is/are true?
a) ELMo uses a bi-directional LSTM to pre-train word representations, while BERT uses
a transformer encoder with masked language modeling.
b) ELMo provides context-independent word representations, whereas BERT provides
context-dependent representations.
c) Pre-training of both ELMo and BERT involve next token prediction.
d) Both ELMo and BERT produce word embeddings that can be fine-tuned for
downstream tasks.
Correct Answer: a, d
Solution: ELMo uses bidirectional LSTMs with a language modeling objective, while BERT
uses a transformer encoder and masked language modelling. Both can produce embeddings
that are fine-tuned for downstream tasks. Hence, the correct answers are (a) and (d).
_________________________________________________________________________
QUESTION 7: [1 mark]
a) P(y | x) where x is the input sequence and y is the gold output sequence
b) P(x ∣ y) where x is the input sequence and y is the gold output sequence
c) P(wt ∣ w1:t−1), where wt represents the token at position t, and w1:t−1 is the sequence of
tokens from position 1 to t-1
d) P(wt ∣ w1:t+1), where wt represents the token at position t, and w1:t+1 is the sequence of
tokens from position 1 to t+1
Correct Answer: c
In the previous week, we saw the usage of einsum function in numpy as a generalized
1 5
operation for performing tensor multiplications. Now, consider two matrices: 𝐴 = [ ] and
3 7
2 −1
𝐴 = [ ] . Then, what is the output of the following numpy operation?
4 2
numpy.einsum('ij,ij->', A, B)
Correct Answer: 23