0% found this document useful (0 votes)
3 views

10_Improving_Deep_Neural_Networks_Hyperparameter_Tuning,_Regularization

The document contains a series of questions and answers related to deep learning concepts, covering topics such as hyperparameters, neural network architecture, activation functions, and training techniques. Each question is followed by the correct answer, providing insights into best practices and common challenges in training deep neural networks. The content serves as a quiz or study guide for individuals looking to enhance their understanding of deep learning fundamentals.

Uploaded by

eyob53834
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

10_Improving_Deep_Neural_Networks_Hyperparameter_Tuning,_Regularization

The document contains a series of questions and answers related to deep learning concepts, covering topics such as hyperparameters, neural network architecture, activation functions, and training techniques. Each question is followed by the correct answer, providing insights into best practices and common challenges in training deep neural networks. The content serves as a quiz or study guide for individuals looking to enhance their understanding of deep learning fundamentals.

Uploaded by

eyob53834
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

1.

Which of the following is NOT typically considered a hyperparameter in training a deep


neural network?
a. Number of epochs
b. Learning rate
c. Weight parameters of the network after training
d. Batch size
2. In a fully connected feedforward neural network, what best describes the term “layer”?
a. A collection of nodes receiving input directly from the outside world
b. A set of neurons where each unit takes input from the previous set of neurons and
passes output to the next
c. A single neuron with trainable weights
d. The output prediction vector
3. When using the ReLU activation function, what is the output for a negative input value x
< 0?
a. x
b. |x|
c. 0
d. A small negative constant
4. Suppose you have a binary classification problem. Which of the following is the most
appropriate output activation function on the last layer?
a. ReLU
b. Sigmoid
c. Tanh
d. Softmax with equal probabilities
5. If the input dimension to a layer is 128 and the layer has 64 neurons, how many weight
parameters (excluding bias) connect these inputs to the next layer?
a. 64
b. 128
c. 8192
d. 192
6. In the context of gradient descent, what is one common reason for using mini-batch
gradient descent instead of full-batch or purely stochastic gradient descent?
a. It always finds the global minimum
b. It reduces variance in gradient estimates and improves computational efficiency
c. It guarantees a better learning rate
d. It prevents any form of overfitting
7. During forward propagation in a neural network, what is computed at each neuron
(ignoring activation functions)?
a. The gradient of the error with respect to the weights
b. The weighted sum of inputs plus a bias
c. The weight decay term
d. The maximum value of all the inputs
8. Which of the following best describes the purpose of a validation set?
a. To provide data for updating model parameters during training
b. To tune hyperparameters and prevent overfitting
c. To never be used after training
d. To provide additional data to increase the training set size
9. One hallmark of deep learning is:
a. Shallow architectures with a single layer
b. Feature engineering done entirely by humans
c. Multiple layers of learned feature representations
d. Guaranteed perfect generalization
10. Which of the following is a common symptom of vanishing gradients during training?
a. Model weights become extremely large
b. Learning slows down dramatically or stops altogether
c. Training loss increases rapidly
d. The network refuses to compile
11. Weight initialization with small random values close to zero is typically done to:
a. Break symmetry and ensure each neuron learns different features
b. Guarantee immediate convergence
c. Ensure that all neurons learn identical parameters
d. Prevent the network from using activation functions
12. Given a training set loss and a validation set loss, if you notice your training loss keeps
decreasing but your validation loss starts increasing, what phenomenon is most likely
occurring?
a. Underfitting
b. Overfitting
c. Proper generalization
d. Stable convergence
13. Suppose you have a dataset with features on vastly different scales. Which method might
you apply before training a deep neural network?
a. Dropout
b. Feature scaling (e.g., standardization or normalization)
c. Early stopping
d. Gradient clipping
14. The backpropagation algorithm primarily uses which rule to compute gradients?
a. Hebbian learning rule
b. Forward-mode differentiation
c. Chain rule of calculus
d. Laplacian smoothing
15. Which of the following techniques can help reduce overfitting in deep networks?
a. Increasing the learning rate indefinitely
b. Dropout regularization
c. Using infinitely many layers
d. Replacing ReLU with linear activation
16. A softmax output layer is commonly used for:
a. Multi-class classification tasks
b. Regression tasks with continuous outputs
c. Binary classification tasks only
d. Unsupervised dimension reduction
17. Momentum in optimization helps primarily by:
a. Keeping the network weights unchanged
b. Accelerating convergence by dampening oscillations in the gradient direction
c. Discarding previously computed gradients
d. Making the gradient updates random
18. In a neural network, bias terms are used to:
a. Reduce overfitting by removing flexibility
b. Shift the activation function and improve representational power
c. Ensure the weights remain constant
d. Increase training time without changing performance
19. Batch normalization is used to:
a. Eliminate the need for activation functions
b. Stabilize the distribution of layer inputs and speed up training
c. Make training data unnecessary
d. Ensure weights never decay
20. If a network’s parameters are initialized too large, what is a likely outcome when training
begins?
a. Gradients will vanish
b. Gradients might explode, causing unstable updates
c. All weights will freeze at zero
d. Perfect generalization from the first epoch
21. Given a network with L hidden layers, which layer's activations serve as input to the
(L+1)-th layer?
a. The first hidden layer’s input
b. The previous layer’s activations
c. The final output layer
d. The raw input features
22. The term “epoch” in neural network training refers to:
a. A single update of weights after one batch
b. One complete pass through the entire training dataset
c. The time it takes to initialize the network
d. The final stage of model evaluation
23. L2 regularization encourages weights to be:
a. Larger in magnitude
b. Closer to zero
c. Completely sparse (many zeros)
d. Randomly shuffled after each update
24. Assume a neural network outputs a probability distribution over classes. Which loss
function is most commonly used for multi-class classification tasks?
a. Mean Squared Error (MSE)
b. Binary Cross-Entropy
c. Categorical Cross-Entropy (Softmax Cross-Entropy)
d. Hinge Loss
25. Early stopping is a form of:
a. Weight initialization technique
b. Data augmentation strategy
c. Regularization that halts training before overfitting
d. Optimization algorithm
26. The main goal of the activation function in a neuron is to:
a. Keep the output linear
b. Introduce non-linearity and allow complex decision boundaries
c. Scale all outputs to a fixed range without non-linearity
d. Control the learning rate
27. A deep network that consistently predicts the average of training outputs (e.g., a constant
value) for all inputs is likely:
a. Underfitting the data
b. Overfitting the data
c. Perfectly trained
d. Experiencing exploding gradients
28. The primary difference between the forward pass and backward pass in training a neural
network is:
a. The forward pass computes outputs from inputs, while the backward pass computes
gradients from outputs to inputs
b. The backward pass updates inputs, while the forward pass updates weights
c. Both passes compute gradients
d. The backward pass uses no activation functions
29. In practice, to prevent numerical instability when computing the softmax function, one
might:
a. Use very large numbers in the exponent
b. Subtract the maximum input value from each input before exponentiation
c. Add a small constant to inputs
d. Use random values for initial exponentiation
30. “Glorot” (Xavier) initialization is designed to:
a. Initialize biases to large negative values
b. Keep the variance of outputs at each layer roughly the same, preventing
vanishing/exploding gradients
c. Guarantee no overfitting
d. Make gradient calculations unnecessary
31. Dropout works by:
a. Setting a subset of activations to zero at random during training
b. Eliminating entire layers permanently
c. Adding Gaussian noise to the inputs
d. Forcing all weights to be identical
32. The choice of optimizer (e.g., SGD vs. Adam) primarily affects:
a. How the architecture of the neural network is designed
b. How gradients are used to update the parameters, potentially impacting training speed
and stability
c. The training dataset’s size
d. The number of layers needed in the network
33. Suppose you apply L1 regularization to your neural network. This encourages:
a. Weights to become sparser (more zeros)
b. No change in weight magnitude
c. Weights to grow without bound
d. Weights to stay strictly positive
34. A potential advantage of deep networks over shallow models is:
a. They never require regularization
b. They can learn hierarchical representations of data
c. They train faster with no tuning
d. They always have fewer parameters
35. Which is NOT a benefit of using vectorized operations (as opposed to explicit loops) for
training neural networks?
a. Faster computations due to optimized linear algebra libraries
b. Easier to implement automatic differentiation
c. Reduction in code complexity
d. Necessarily higher accuracy on the validation set
36. If you have a large training set and notice your model is still overfitting, which strategy
might help?
a. Increase the model size further
b. Apply stronger regularization (e.g., dropout, L2)
c. Train for more epochs
d. Stop using batch normalization
37. Activation functions like sigmoid or tanh can suffer from saturation, which leads to:
a. Exploding gradients
b. Zero gradients in saturated regions, slowing or stopping learning
c. Infinite gradients
d. Constant updates to the weights
38. The main idea behind using a validation set separate from the test set is to:
a. Use it for final model performance reporting
b. Adjust hyperparameters without contaminating the test performance estimate
c. Make the model generalize instantly
d. Skip the need for a training set
39. Suppose a neural network uses sigmoid activation in the output layer for binary
classification. The predicted output is 0.8. This output can be interpreted as:
a. The predicted probability of the positive class is 0.8
b. The raw class score for the positive class is 0.8 units
c. The network is certain of the positive class
d. The margin for a linear classifier
40. When dealing with very high-dimensional inputs (e.g., images), deep networks help by:
a. Relying solely on handcrafted features
b. Automatically learning complex features hierarchically from raw data
c. Ignoring lower-level patterns
d. Reducing the input dimension to a single scalar without learning
1. c (Hyperparameters are set before training, learned weights after training are not
hyperparameters.)
2. b (A layer is a set of neurons that transform inputs to outputs for the next layer.)
3. c (ReLU(x) = max(0, x), so negative inputs produce 0.)
4. b (For binary classification, a sigmoid output is standard.)
5. c (Number of weights = 128 inputs * 64 neurons = 8192.)
6. b (Mini-batch gradient descent balances variance and computation.)
7. b (A neuron sums weighted inputs plus a bias, then applies activation.)
8. b (Validation sets help tune hyperparameters and prevent overfitting.)
9. c (Deep learning involves multiple levels of representation.)
10. b (Vanishing gradients slow or halt learning.)
11. a (Small random initialization breaks symmetry.)
12. b (Training loss ↓ while validation loss ↑ typically indicates overfitting.)
13. b (Feature scaling ensures numeric stability and efficient training.)
14. c (Backpropagation uses the chain rule of calculus.)
15. b (Dropout is a common method to reduce overfitting.)
16. a (Softmax is used for multi-class outputs.)
17. b (Momentum helps accelerate convergence and reduce oscillations.)
18. b (Bias shifts the activation function and increases representational flexibility.)
19. b (Batch normalization stabilizes input distributions of intermediate layers.)
20. b (Excessively large initial weights can lead to exploding gradients.)
21. b (Each layer receives input from the previous layer’s outputs.)
22. b (An epoch is one full pass over the training set.)
23. b (L2 regularization pushes weights towards zero.)
24. c (Categorical cross-entropy is standard for multi-class classification.)
25. c (Early stopping prevents overfitting by halting training.)
26. b (Activation functions add non-linearity.)
27. a (Predicting a constant average often means underfitting.)
28. a (Forward pass: inputs → outputs; Backward pass: gradients from outputs →
inputs.)
29. b (Subtracting max input value reduces potential overflow in the exponent.)
30. b (Glorot initialization maintains variance across layers.)
31. a (Dropout randomly “drops” certain neurons’ outputs during training.)
32. b (Optimizers affect how parameters are updated.)
33. a (L1 regularization encourages sparse weights.)
34. b (Deep networks learn hierarchical feature representations.)
35. d (Vectorization does not guarantee higher accuracy, it just makes computation
faster and cleaner.)
36. b (If still overfitting, stronger regularization can help.)
37. b (Sigmoid/tanh saturation means gradients approach zero.)
38. b (Validation sets guide hyperparameter tuning without touching the final test
set.)
39. a (Sigmoid outputs represent probabilities of the positive class.)
40. b (Deep nets learn features automatically from raw data.)

You might also like