#### Table of Contents

## Hyperparameter Tuning, Regularization and Optimization Quiz Answers

Q1. If you have 10,000,000 examples, how would you split the train/dev/test set?

- 60% train . 20% dev . 20% test
**98% train . 1% dev . 1% test**- 33% train . 33% dev . 33% test

Q2. The dev and test set should:

**Come from the same distribution**- Have the same number of examples
- Be identical to each other (same (x,y) pairs)
- Come from different distributions

Q3. If your Neural Network model seems to have high variance, what of the following would be promising things to try?

- Make the Neural Network deeper
- Increase the number of units in each hidden layer
**Get more training data**- Get more test data
**Add regularization**

Q4. You are working on an automated check-out kiosk for a supermarket, and are building a classifier for apples, bananas and oranges. Suppose your classifier obtains a training set error of 0.5%, and a dev set error of 7%. Which of the following are promising things to try to improve your classifier? (Check all that apply.)

**Increase the regularization parameter lambda**- Decrease the regularization parameter lambda
**Get more training data**- Use a bigger neural network

Q5. What is weight decay?

- A technique to avoid vanishing gradient by imposing a ceiling on the values of the weights.
- The process of gradually decreasing the learning rate during training.
**A regularization technique (such as L2 regularization) that results in gradient descent shrinking the weights on every iteration.**- Gradual corruption of the weights in the neural network if it is trained on noisy data.

Q6. What happens when you increase the regularization hyperparameter lambda?

- Weights are pushed toward becoming bigger (further from 0)
**Weights are pushed toward becoming smaller (closer to 0)**- Doubling lambda should roughly result in doubling the weights
- Gradient descent taking bigger steps with each iteration (proportional to lambda)

Q7. With the inverted dropout technique, at test time:

- You apply dropout (randomly eliminating units) and do not keep the 1/keep_prob factor in the calculations used in training
- You apply dropout (randomly eliminating units) but keep the 1/keep_prob factor in the calculations used in training.
**You do not apply dropout (do not randomly eliminate units) and do not keep the 1/keep_prob factor in the calculations used in training**- You do not apply dropout (do not randomly eliminate units), but keep the 1/keep_prob factor in the calculations used in training.

Q8. Increasing the parameter keep_prob from (say) 0.5 to 0.6 will likely cause the following: (Check the two that apply)

- Increasing the regularization effect
**Reducing the regularization effect**- Causing the neural network to end up with a higher training set error
**Causing the neural network to end up with a lower training set error**

Q9. Which of these techniques are useful for reducing variance (reducing overfitting)? (Check all that apply.)

- Exploding gradient
**L2 regularization**- Vanishing gradient
- Xavier initialization
**Data augmentation**- Gradient Checking
**Dropout**

Q10. Why do we normalize the inputs xx?

- It makes it easier to visualize the data
- It makes the parameter initialization faster
- Normalization is another word for regularization–It helps to reduce variance
**It makes the cost function faster to optimize**

## Hyperparameter Tuning, Regularization and Optimization Week 02 Quiz Answers

Q1. Which notation would you use to denote the 3rd layer’s activations when the input is the 7th example from the 8th minibatch?

- a^{[3]\{7\}(8)}
- a^{[8]\{3\}(7)}
- a^{[8]\{7\}(3)}
**a^{[3]\{8\}(7)}**

Q2. Which of these statements about mini-batch gradient descent do you agree with?

- Training one epoch (one pass through the training set) using mini-batch gradient descent is faster than training one epoch using batch gradient descent.
**One iteration of mini-batch gradient descent (computing on a single mini-batch) is faster than one iteration of batch gradient descent.**- You should implement mini-batch gradient descent without an explicit for-loop over different mini-batches, so that the algorithm processes all mini-batches at the same time (vectorization).

Q3. Why is the best mini-batch size usually not 1 and not m, but instead something in-between?

- If the mini-batch size is 1, you end up having to process the entire training set before making any progress.
- If the mini-batch size is m, you end up with stochastic gradient descent, which is usually slower than mini-batch gradient descent.
**If the mini-batch size is m, you end up with batch gradient descent, which has to process the whole training set before making progress.****If the mini-batch size is 1, you lose the benefits of vectorization across examples in the mini-batch.**

Q4. Suppose your learning algorithm’s cost JJ, plotted as a function of the number of iterations, looks like this:

- Which of the following do you agree with?
- Whether you’re using batch gradient descent or mini-batch gradient descent, something is wrong.
- Whether you’re using batch gradient descent or mini-batch gradient descent, this looks acceptable.
- If you’re using mini-batch gradient descent, something is wrong. But if you’re using batch gradient descent, this looks acceptable.
**If you’re using mini-batch gradient descent, this looks acceptable. But if you’re using batch gradient descent, something is wrong.**

Q5. Suppose the temperature in Casablanca over the first two days of January are the same:

Jan 1st: θ_1 = 10

Jan 2nd: θ_2 * 10

(We used Fahrenheit in lecture, so will use Celsius here in honor of the metric world.)

Say you use an exponentially weighted average with \beta = 0.5*β*=0.5 to track the temperature: v_0 = 0*v*0=0, v_t = \beta v_{t-1} +(1-\beta)\theta_t*v**t*=*β**v**t*−1+(1−*β*)*θ**t*. If v_2*v*2 is the value computed after day 2 without bias correction, and v_2^{corrected}*v*2*c**o**r**r**e**c**t**e**d* is the value you compute with bias correction. What are these values? (You might be able to do this without a calculator, but you don’t actually need one. Remember what bias correction is doing.)

*v*_2=7.5, v_2^{corrected} =10- v_2 = 7.5
*v*2=7.5, v_2^{corrected} =7.5 - v_2 = 10
*v*2=10, v_2^{corrected} =7.5 - v_2 = 10
*v*2=10, v_2^{corrected} =10

Q6. Which of these is NOT a good learning rate decay scheme? Here, t is the epoch number.

- α = e^t * α_0
*α*=0.95*tα*0*α*=*etα*0*α*=1+2∗*t*1*α*0

Q7. You use an exponentially weighted average on the London temperature dataset. You use the following to track the temperature: v_{t} = \beta v_{t-1} + (1-\beta)\theta_t*vt*=*βvt*−1+(1−*β*)*θt*. The red line below was computed using \beta = 0.9*β*=0.9. What would happen to your red curve as you vary \beta*β*? (Check the two that apply)

- Decreasing β will shift the red line slightly to the right.
**Increasing β will shift the red line slightly to the right.****Decreasing β will create more oscillation within the red line.**- Increasing β will create more oscillations within the red line.

Q8. Consider this figure:

These plots were generated with gradient descent; with gradient descent with momentum (β = 0.5) and gradient descent with momentum (β = 0.9). Which curve corresponds to which algorithm?

**(1) is gradient descent. (2) is gradient descent with momentum (small β). (3) is gradient descent with momentum (large β)**- (1) is gradient descent with momentum (small β), (2) is gradient descent with momentum (small β), (3) is gradient descent
- (1) is gradient descent. (2) is gradient descent with momentum (large β) . (3) is gradient descent with momentum (small β)
- (1) is gradient descent with momentum (small β). (2) is gradient descent. (3) is gradient descent with momentum (large β)

Q9.

Suppose batch gradient descent in a deep network is taking excessively long to find a value of the parameters that achieves a small value for the cost function J(*W*[1],*b*[1],…,*W*[*L*],*b*[*L*]). Which of the following techniques could help find parameter values that attain a small value for\mathcal{J}J? (Check all that apply)

- Try mi
**ni-batch gradient descent** **Try tuning the learning rate α**- Try initializing all the weights to zero
**Try using Adam****Try better random initialization for the weights**

Q10. Which of the following statements about Adam is False?

- We usually use “default” values for the hyperparameters
*β*1,*β*2 and*ε*in Adam (*β*1 = 0.9*β*2 = 0.999,*ε*=10−8) **Adam should be used with batch gradient computations, not with mini-batches.**- The learning rate hyperparameter \alphaα in Adam usually needs to be tuned.
- Adam combines the advantages of RMSProp and momentum

## Hyperparameter Tuning, Regularization and Optimization Week 03 Quiz Answers

Q1. If searching among a large number of hyperparameters, you should try values in a grid rather than random values, so that you can carry out the search more systematically and not rely on chance. True or False?

- True
**False**

Q2. Every hyperparameter, if set poorly, can have a huge negative impact on training, and so all hyperparameters are about equally important to tune well. True or False?

- True
**False**

Q3. During hyperparameter search, whether you try to babysit one model (“Panda” strategy) or train a lot of models in parallel (“Caviar”) is largely determined by:

- The presence of local minima (and saddle points) in your neural network
- The number of hyperparameters you have to tune
- Whether you use batch or mini-batch optimization
**The amount of computational power you can access**

Q4. If you think \betaβ (hyperparameter for momentum) is between 0.9 and 0.99, which of the following is the recommended way to sample a value for beta?

- r = np.random.rand() beta = 1-10**(- r + 1)
- r = np.random.rand() beta = r*0.09 + 0.9
**r = np.random.rand() beta = 1-10**(- r – 1)**- r = np.random.rand() beta = r*0.9 + 0.09

Q5. Finding good hyperparameter values is very time-consuming. So typically you should do it once at the start of the project, and try to find very good hyperparameters so that you don’t ever have to revisit tuning them again. True or false?

- True
**False**

Q6. In batch normalization as presented in the videos, if you apply it on the *l*th layer of your neural network, what are you normalizing?

*b*[*l*]*z*[*l*]*W*[*l*]*a*[*l*]

Q7. In the normalization formula *znorm*(*i*)=*σ*2+*ε**z*(*i*)−*μ* why do we use epsilon?

- To speed up convergence
- To have a more accurate normalization
- In case μ is too small
**To avoid division by zero**

Q8. Which of the following statements about γ and β in Batch Norm are true?

**They set the mean and variance of the linear variable***z*[*l*] of a given layer.- There is one global value of
*γ*∈ ℜ and one global value of*β*∈ ℜ for each layer, and applies to all the hidden units in that layer. *β*and*γ*are hyperparameters of the algorithm, which we tune via random sampling.**They can be learned using Adam, Gradient descent with momentum, or RMSprop, not just with gradient descent.**- The optimal values are
*γ*=*σ*2+*ε*, and*β*=*μ*.

Q9. After training a neural network with Batch Norm, at test time, to evaluate the neural network on a new example you should:

- Skip the step where you normalize using μ and σ2 since a single test example cannot be normalized.
- Use the most recent mini-batch’s value of μ and σ2 to perform the needed normalizations.
**Perform the needed normalizations, use μ and σ2 estimated using an exponentially weighted average across mini-batches seen during training.**- If you implemented Batch Norm on mini-batches of (say) 256 examples, then to evaluate on one test example, duplicate that example 256 times so that you’re working with a mini-batch the same size as during training.

Q10. Which of these statements about deep learning programming frameworks are true? (Check all that apply)

**A programming framework allows you to code up deep learning algorithms with typically fewer lines of code than a lower-level language such as Python.**- Deep learning programming frameworks require cloud-based machines to run.
**Even if a project is currently open source, good governance of the project helps ensure that the it remains open even in the long term, rather than become closed or modified to benefit only one company.**

**Get all Course Quiz Answers of Deep Learning Specialization**

**Course 01: Neural Networks and Deep Learning Coursera Quiz Answers**

**Course 03: Structuring Machine Learning Projects Coursera Quiz Answers**

**Course 04: Convolutional Neural Networks Coursera Quiz Answers**

**Course 05: Sequence Models Coursera Quiz Answers**