The document contains a series of questions and answers related to deep learning concepts, covering topics such as hyperparameters, neural network architecture, activation functions, and training techniques. Each question is followed by the correct answer, providing insights into best practices and common challenges in training deep neural networks. The content serves as a quiz or study guide for individuals looking to enhance their understanding of deep learning fundamentals.
The document contains a series of questions and answers related to deep learning concepts, covering topics such as hyperparameters, neural network architecture, activation functions, and training techniques. Each question is followed by the correct answer, providing insights into best practices and common challenges in training deep neural networks. The content serves as a quiz or study guide for individuals looking to enhance their understanding of deep learning fundamentals.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6
1.
Which of the following is NOT typically considered a hyperparameter in training a deep
neural network? a. Number of epochs b. Learning rate c. Weight parameters of the network after training d. Batch size 2. In a fully connected feedforward neural network, what best describes the term “layer”? a. A collection of nodes receiving input directly from the outside world b. A set of neurons where each unit takes input from the previous set of neurons and passes output to the next c. A single neuron with trainable weights d. The output prediction vector 3. When using the ReLU activation function, what is the output for a negative input value x < 0? a. x b. |x| c. 0 d. A small negative constant 4. Suppose you have a binary classification problem. Which of the following is the most appropriate output activation function on the last layer? a. ReLU b. Sigmoid c. Tanh d. Softmax with equal probabilities 5. If the input dimension to a layer is 128 and the layer has 64 neurons, how many weight parameters (excluding bias) connect these inputs to the next layer? a. 64 b. 128 c. 8192 d. 192 6. In the context of gradient descent, what is one common reason for using mini-batch gradient descent instead of full-batch or purely stochastic gradient descent? a. It always finds the global minimum b. It reduces variance in gradient estimates and improves computational efficiency c. It guarantees a better learning rate d. It prevents any form of overfitting 7. During forward propagation in a neural network, what is computed at each neuron (ignoring activation functions)? a. The gradient of the error with respect to the weights b. The weighted sum of inputs plus a bias c. The weight decay term d. The maximum value of all the inputs 8. Which of the following best describes the purpose of a validation set? a. To provide data for updating model parameters during training b. To tune hyperparameters and prevent overfitting c. To never be used after training d. To provide additional data to increase the training set size 9. One hallmark of deep learning is: a. Shallow architectures with a single layer b. Feature engineering done entirely by humans c. Multiple layers of learned feature representations d. Guaranteed perfect generalization 10. Which of the following is a common symptom of vanishing gradients during training? a. Model weights become extremely large b. Learning slows down dramatically or stops altogether c. Training loss increases rapidly d. The network refuses to compile 11. Weight initialization with small random values close to zero is typically done to: a. Break symmetry and ensure each neuron learns different features b. Guarantee immediate convergence c. Ensure that all neurons learn identical parameters d. Prevent the network from using activation functions 12. Given a training set loss and a validation set loss, if you notice your training loss keeps decreasing but your validation loss starts increasing, what phenomenon is most likely occurring? a. Underfitting b. Overfitting c. Proper generalization d. Stable convergence 13. Suppose you have a dataset with features on vastly different scales. Which method might you apply before training a deep neural network? a. Dropout b. Feature scaling (e.g., standardization or normalization) c. Early stopping d. Gradient clipping 14. The backpropagation algorithm primarily uses which rule to compute gradients? a. Hebbian learning rule b. Forward-mode differentiation c. Chain rule of calculus d. Laplacian smoothing 15. Which of the following techniques can help reduce overfitting in deep networks? a. Increasing the learning rate indefinitely b. Dropout regularization c. Using infinitely many layers d. Replacing ReLU with linear activation 16. A softmax output layer is commonly used for: a. Multi-class classification tasks b. Regression tasks with continuous outputs c. Binary classification tasks only d. Unsupervised dimension reduction 17. Momentum in optimization helps primarily by: a. Keeping the network weights unchanged b. Accelerating convergence by dampening oscillations in the gradient direction c. Discarding previously computed gradients d. Making the gradient updates random 18. In a neural network, bias terms are used to: a. Reduce overfitting by removing flexibility b. Shift the activation function and improve representational power c. Ensure the weights remain constant d. Increase training time without changing performance 19. Batch normalization is used to: a. Eliminate the need for activation functions b. Stabilize the distribution of layer inputs and speed up training c. Make training data unnecessary d. Ensure weights never decay 20. If a network’s parameters are initialized too large, what is a likely outcome when training begins? a. Gradients will vanish b. Gradients might explode, causing unstable updates c. All weights will freeze at zero d. Perfect generalization from the first epoch 21. Given a network with L hidden layers, which layer's activations serve as input to the (L+1)-th layer? a. The first hidden layer’s input b. The previous layer’s activations c. The final output layer d. The raw input features 22. The term “epoch” in neural network training refers to: a. A single update of weights after one batch b. One complete pass through the entire training dataset c. The time it takes to initialize the network d. The final stage of model evaluation 23. L2 regularization encourages weights to be: a. Larger in magnitude b. Closer to zero c. Completely sparse (many zeros) d. Randomly shuffled after each update 24. Assume a neural network outputs a probability distribution over classes. Which loss function is most commonly used for multi-class classification tasks? a. Mean Squared Error (MSE) b. Binary Cross-Entropy c. Categorical Cross-Entropy (Softmax Cross-Entropy) d. Hinge Loss 25. Early stopping is a form of: a. Weight initialization technique b. Data augmentation strategy c. Regularization that halts training before overfitting d. Optimization algorithm 26. The main goal of the activation function in a neuron is to: a. Keep the output linear b. Introduce non-linearity and allow complex decision boundaries c. Scale all outputs to a fixed range without non-linearity d. Control the learning rate 27. A deep network that consistently predicts the average of training outputs (e.g., a constant value) for all inputs is likely: a. Underfitting the data b. Overfitting the data c. Perfectly trained d. Experiencing exploding gradients 28. The primary difference between the forward pass and backward pass in training a neural network is: a. The forward pass computes outputs from inputs, while the backward pass computes gradients from outputs to inputs b. The backward pass updates inputs, while the forward pass updates weights c. Both passes compute gradients d. The backward pass uses no activation functions 29. In practice, to prevent numerical instability when computing the softmax function, one might: a. Use very large numbers in the exponent b. Subtract the maximum input value from each input before exponentiation c. Add a small constant to inputs d. Use random values for initial exponentiation 30. “Glorot” (Xavier) initialization is designed to: a. Initialize biases to large negative values b. Keep the variance of outputs at each layer roughly the same, preventing vanishing/exploding gradients c. Guarantee no overfitting d. Make gradient calculations unnecessary 31. Dropout works by: a. Setting a subset of activations to zero at random during training b. Eliminating entire layers permanently c. Adding Gaussian noise to the inputs d. Forcing all weights to be identical 32. The choice of optimizer (e.g., SGD vs. Adam) primarily affects: a. How the architecture of the neural network is designed b. How gradients are used to update the parameters, potentially impacting training speed and stability c. The training dataset’s size d. The number of layers needed in the network 33. Suppose you apply L1 regularization to your neural network. This encourages: a. Weights to become sparser (more zeros) b. No change in weight magnitude c. Weights to grow without bound d. Weights to stay strictly positive 34. A potential advantage of deep networks over shallow models is: a. They never require regularization b. They can learn hierarchical representations of data c. They train faster with no tuning d. They always have fewer parameters 35. Which is NOT a benefit of using vectorized operations (as opposed to explicit loops) for training neural networks? a. Faster computations due to optimized linear algebra libraries b. Easier to implement automatic differentiation c. Reduction in code complexity d. Necessarily higher accuracy on the validation set 36. If you have a large training set and notice your model is still overfitting, which strategy might help? a. Increase the model size further b. Apply stronger regularization (e.g., dropout, L2) c. Train for more epochs d. Stop using batch normalization 37. Activation functions like sigmoid or tanh can suffer from saturation, which leads to: a. Exploding gradients b. Zero gradients in saturated regions, slowing or stopping learning c. Infinite gradients d. Constant updates to the weights 38. The main idea behind using a validation set separate from the test set is to: a. Use it for final model performance reporting b. Adjust hyperparameters without contaminating the test performance estimate c. Make the model generalize instantly d. Skip the need for a training set 39. Suppose a neural network uses sigmoid activation in the output layer for binary classification. The predicted output is 0.8. This output can be interpreted as: a. The predicted probability of the positive class is 0.8 b. The raw class score for the positive class is 0.8 units c. The network is certain of the positive class d. The margin for a linear classifier 40. When dealing with very high-dimensional inputs (e.g., images), deep networks help by: a. Relying solely on handcrafted features b. Automatically learning complex features hierarchically from raw data c. Ignoring lower-level patterns d. Reducing the input dimension to a single scalar without learning 1. c (Hyperparameters are set before training, learned weights after training are not hyperparameters.) 2. b (A layer is a set of neurons that transform inputs to outputs for the next layer.) 3. c (ReLU(x) = max(0, x), so negative inputs produce 0.) 4. b (For binary classification, a sigmoid output is standard.) 5. c (Number of weights = 128 inputs * 64 neurons = 8192.) 6. b (Mini-batch gradient descent balances variance and computation.) 7. b (A neuron sums weighted inputs plus a bias, then applies activation.) 8. b (Validation sets help tune hyperparameters and prevent overfitting.) 9. c (Deep learning involves multiple levels of representation.) 10. b (Vanishing gradients slow or halt learning.) 11. a (Small random initialization breaks symmetry.) 12. b (Training loss ↓ while validation loss ↑ typically indicates overfitting.) 13. b (Feature scaling ensures numeric stability and efficient training.) 14. c (Backpropagation uses the chain rule of calculus.) 15. b (Dropout is a common method to reduce overfitting.) 16. a (Softmax is used for multi-class outputs.) 17. b (Momentum helps accelerate convergence and reduce oscillations.) 18. b (Bias shifts the activation function and increases representational flexibility.) 19. b (Batch normalization stabilizes input distributions of intermediate layers.) 20. b (Excessively large initial weights can lead to exploding gradients.) 21. b (Each layer receives input from the previous layer’s outputs.) 22. b (An epoch is one full pass over the training set.) 23. b (L2 regularization pushes weights towards zero.) 24. c (Categorical cross-entropy is standard for multi-class classification.) 25. c (Early stopping prevents overfitting by halting training.) 26. b (Activation functions add non-linearity.) 27. a (Predicting a constant average often means underfitting.) 28. a (Forward pass: inputs → outputs; Backward pass: gradients from outputs → inputs.) 29. b (Subtracting max input value reduces potential overflow in the exponent.) 30. b (Glorot initialization maintains variance across layers.) 31. a (Dropout randomly “drops” certain neurons’ outputs during training.) 32. b (Optimizers affect how parameters are updated.) 33. a (L1 regularization encourages sparse weights.) 34. b (Deep networks learn hierarchical feature representations.) 35. d (Vectorization does not guarantee higher accuracy, it just makes computation faster and cleaner.) 36. b (If still overfitting, stronger regularization can help.) 37. b (Sigmoid/tanh saturation means gradients approach zero.) 38. b (Validation sets guide hyperparameter tuning without touching the final test set.) 39. a (Sigmoid outputs represent probabilities of the positive class.) 40. b (Deep nets learn features automatically from raw data.)