Week 9
Week 9
1
d) To adjust the weights of the neural network during training
Answer: b) To transform the dot product into a probability distribution
Solution: The softmax function is used in the skip-gram method to transform the dot
product between the target word and the context words into a probability distribution. This
distribution represents the likelihood of seeing each context word given the target word, and
is used to train the model by minimizing the cross-entropy loss between the predicted and
actual distributions.
6. Suppose we are learning the representations of words using Glove representations. If we
observe that the cosine similarity between two representations vi and vj for words ‘i’ and ‘j’
is very high. which of the following statements is true?( parameter bi = 0.02 and bj = 0.05
a)Xij = 0.03.
b)Xij = 0.8.
c)Xij = 0.35.
d)Xij = 0.
Answer: b)
Solution: Since the word representations are similar we know viT vj is high but
viT vj = Xij − bi − bj . Hence Xij is high but the only high value for Xij is 0.8
7. We add incorrect pairs into our corpus to maximize the probability of words that occur in
the same context and minimize the probability of words that occur in different contexts.
This technique is called-
a)Hierarchical softmax
b)Contrastive estimation
c)Negative sampling
d)Glove representations
Answer: c)
Solution: The process of adding incorrect pair to the training set is called negative sampling.
8. What is the computational complexity of computing the softmax function in the output layer
of a neural network?
a) O(n)
b) O(n2 )
c) O(nlogn)
d) O(logn)
Answer: a)
Explanation: The computational complexity of computing the softmax function in the
output layer of a neural network is O(n), where n is the number of output classes.
9. How does Hierarchical Softmax reduce the computational complexity of computing the
softmax function?
a) It replaces the softmax function with a linear function
b) It uses a binary tree to approximate the softmax function
c) It uses a heuristic to compute the softmax function faster
2
d) It does not reduce the computational complexity of computing the softmax function
Answer: b)
Explanation: Hierarchical Softmax uses a binary tree to approximate the softmax function.
This reduces the computational complexity of computing the softmax function from O(n) to
O(log n).
10. What is the disadvantage of using Hierarchical Softmax?
a) It requires more memory to store the binary tree
b) It is slower than computing the softmax function directly
c) It is less accurate than computing the softmax function directly
d) It is more prone to overfitting than computing the softmax function directly
Answer: a)
Explanation: The disadvantage of using Hierarchical Softmax is that it requires more
memory to store the binary tree. This can be a problem when dealing with large datasets or
models with a large number of output classes.