### All Weeks Sequence Models Coursera Quiz Answers

In the fifth course of the Deep Learning Specialization, you will become familiar with sequence models and their exciting applications such as speech recognition, music synthesis, chatbots, machine translation, natural language processing (NLP), and more.

The Deep Learning Specialization is a foundational program that will help you understand the capabilities, challenges, and consequences of deep learning and prepare you to participate in the development of leading-edge AI technology. It provides a pathway for you to take the definitive step in the world of AI by helping you gain the knowledge and skills to level up your career.

**Enroll in Sequence Models Coursera **

### Sequence Models Week 1 Coursera Quiz Answers

Q1. Suppose your training examples are sentences (sequences of words). Which of the following refers to the j^{th} word in the i^{th} training example?

**x(i)<j>**- x<i>(j)
- x(j)<i>
- x<j>(i)

Q2. Consider this RNN:

*This specific type of architecture is appropriate when: *

**Tx=Ty***Tx*<*Ty**Tx*>*Ty**Tx*=1

Q3. To which of these tasks would you apply a many-to-one RNN architecture? (Check all that apply).

- speech recognition (input an audio clip and output a transcript)
**Sentiment classification (input a piece of text and output a 0/1 to denote positive or negative sentiment)**- Image classification (input an image and output a label)
**Gender recognition from speech (input an audio clip and output a label indicating the speaker’s gender)**

*Q4. You are training this RNN language model. *

At the t^{th}*tth* time step, what is the RNN doing? Choose the best answer.

- Estimating P(y^{<1>}, y^{<2>}, …, y^{<t-1>})
*P*(*y*<1>,*y*<2>,…,*y*<*t*−1>) - Estimating P(y^{<t>})
*P*(*y*<*t*>) **Estimating P(y^{<t>} \mid y^{<1>}, y^{<2>}, …, y^{<t-1>})***P*(*y*<*t*>∣*y*<1>,*y*<2>,…,*y*<*t*−1>)- Estimating P(y^{<t>} \mid y^{<1>}, y^{<2>}, …, y^{<t>})
*P*(*y*<*t*>∣*y*<1>,*y*<2>,…,*y*<*t*>)

*Q5. You have finished training a language model RNN and are using it to sample random sentences, as follows: *

*What are you doing at each time step t? *

- (i) Use the probabilities output by the RNN to pick the highest probability word for that time-step as \hat{y}^{<t>}
*y*^<*t*>. (ii) Then pass the ground-truth word from the training set to the next time-step. - (i) Use the probabilities output by the RNN to randomly sample a chosen word for that time-step as \hat{y}^{<t>}
*y*^<*t*>.(ii) Then pass the ground-truth word from the training set to the next time-step. - (i) Use the probabilities output by the RNN to pick the highest probability word for that time-step as \hat{y}^{<t>}
*y*^<*t*>.(ii) Then pass this selected word to the next time-step. **(i) Use the probabilities output by the RNN to randomly sample a chosen word for that time-step as \hat{y}^{<t>}***y*^<*t*>.(ii) Then pass this selected word to the next time-step.

*Q6. You are training an RNN, and find that your weights and activations are all taking on the value of NaN (“Not a Number”). Which of these is the most likely cause of this problem? *

- Vanishing gradient problem.
**Exploding gradient problem.**- ReLU activation function g(.) used to compute g(z), where z is too large.
- Sigmoid activation function g(.) used to compute g(z), where z is too large.

Q7. Suppose you are training a LSTM. You have a 10000 word vocabulary, and are using an LSTM with 100-dimensional activations a^{<t>}*a*<*t*>. What is the dimension of \Gamma_uΓ*u* at each time step?

- 1
**100**- 300
- 10000

*Q8. Here’re the update equations for the GRU. *

Alice proposes to simplify the GRU by always removing the \Gamma_uΓ*u*. I.e., setting \Gamma_uΓ*u* = 1. Betty proposes to simplify the GRU by removing the \Gamma_rΓ*r*. I. e., setting \Gamma_rΓ*r* = 1 always. Which of these models is more likely to work without vanishing gradient problems even when trained on very long input sequences?

- Alice’s model (removing Γ
*u*), because if Γ*r*≈0 for a timestep, the gradient can propagate back through that timestep without much decay. - Alice’s model (removing Γ
*u*), because if Γ*r*≈1 for a timestep, the gradient can propagate back through that timestep without much decay. **Betty’s model (removing Γ***r*), because if Γ*u*≈0 for a timestep, the gradient can propagate back through that timestep without much decay.- Betty’s model (removing Γ
*r*), because if Γ*u*≈1 for a timestep, the gradient can propagate back through that timestep without much decay.

*Q9. Here are the equations for the GRU and the LSTM: *

From these, we can see that the Update Gate and Forget Gate in the LSTM play a role similar to _______ and ______ in the GRU. What should go in the blanks?**1 point**

**Γ***u* and 1−Γ*u*- Γ
*u* and Γ*r* - 1−Γ
*u* and Γ*u* - Γ
*r* and Γ*u*

Q10. You have a pet dog whose mood is heavily dependent on the current and past few days’ weather. You’ve collected data for the past 365 days on the weather, which you represent as a sequence as x^{<1>}, …, x^{<365>}*x*<1>,…,*x*<365>. You’ve also collected data on your dog’s mood, which you represent as y^{<1>}, …, y^{<365>}*y*<1>,…,*y*<365>. You’d like to build a model to map from x \rightarrow y*x*→*y*. Should you use a Unidirectional RNN or Bidirectional RNN for this problem?

- Bidirectional RNN, because this allows the prediction of mood on day t to take into account more information.
- Bidirectional RNN, because this allows backpropagation to compute more accurate gradients.
**Unidirectional RNN, because the value of y^{<t>}***y*<*t*> depends only on x^{<1>}, …, x^{<t>}*x*<1>,…,*x*<*t*>, but not on x^{<t+1>}, …, x^{<365>}*x*<*t*+1>,…,*x*<365>- Unidirectional RNN, because the value of y^{<t>}
*y*<*t*> depends only on x^{<t>}*x*<*t*>, and not other days’ weather.

#### Sequence Models Week 2 Coursera Quiz Answers

Q1. Suppose you learn a word embedding for a vocabulary of 10000 words. Then the embedding vectors should be 10000 dimensional, so as to capture the full range of variation and meaning in those words.

- True
**False**

Q2. What is t-SNE?

- An open-source sequence modeling library
- A supervised learning algorithm for learning word embeddings
**A non-linear dimensionality reduction technique**- A linear transformation that allows us to solve analogies on word vectors

Q3. Suppose you download a pre-trained word embedding which has been trained on a huge corpus of text. You then use this word embedding to train an RNN for a language task of recognizing if someone is happy from a short snippet of text, using a small training set.

x (input text) | y (happy?) |

I’m feeling wonderful today! | 1 |

I’m bummed my cat is ill. | 0 |

Really enjoying this! | 1 |

Then even if the word “ecstatic” does not appear in your small training set, your RNN might reasonably be expected to recognize “I’m ecstatic” as deserving a label

**True**- False

Q4. Which of these equations do you think should hold for a good word embedding? (Check all that apply)

**e_{boy} – e_{girl} \approx e_{brother} – e_{sister}****e_{boy} – e_{brother} \approx e_{girl} – e_{sister}**

Q5. Let EE be an embedding matrix, and let o_{1234}o

1234

be a one-hot vector corresponding to word 1234. Then to get the embedding of word 1234, why don’t we call E * o_{1234}E∗o 1234 in Python?

- None of the above: calling the Python snippet as described above is fine.
**It is computationally wasteful.**- The correct formula is E^T* o_{1234}
*ET*∗*o*1234. - This doesn’t handle unknown words (<UNK>).

Q6. When learning word embeddings, we create an artificial task of estimating P(target \mid context)P(target∣context). It is okay if we do poorly on this artificial prediction task; the more important by-product of this task is that we learn a useful set of word embeddings.

**True**- False

Q7. In the word2vec algorithm, you estimate P(t \mid c)P(t∣c), where tt is the target word and cc is a context word. How are tt and cc chosen from the training set? Pick the best answer.

- c is the one word that comes immediately before t.
- c is the sequence of all the words in the sentence before t.
**c and t are chosen to be nearby words.**- c is a sequence of several words immediately before t.

Q8. Suppose you have a 10000 word vocabulary, and are learning 500-dimensional word embeddings. The word2vec model uses the following softmax function: *P*(*t*∣*c*)=∑*t*’=110000*eθt*’*T**ec**eθtT**ec*

- After training, we should expect \theta_t
*θt* to be very close to e_c*ec* when t*t*and c*c*are the same word. *θt* and e_c*ec* are both 10000 dimensional vectors.*θt* and e_c*ec* are both trained with an optimization algorithm such as Adam or gradient descent.*θt* and e_c*ec* are both 500 dimensional vectors.

Q9. Suppose you have a 10000 word vocabulary, and are learning 500-dimensional word embeddings.The GloVe model minimizes this objective: min∑*i*=110,000∑*j*=110,000*f*(*Xij*)(*θiT**ej*+*bi*+*bj*’−*logXij*)2

Which of these statements are correct? Check all that apply.

**The weighting function f(.)***f*(.) must satisfy f(0) = 0*f*(0)=0.*Xij* is the number of times word j appears in the context of word i*θi* and e_j*ej* should be initialized to 0 at the beginning of training.*θi* and e_j*ej* should be initialized randomly at the beginning of training.

Q10.

You have trained word embeddings using a text dataset of m_1*m*1 words. You are considering using these word embeddings for a language task, for which you have a separate labeled dataset of m_2*m*2 words. Keeping in mind that using word embeddings is a form of transfer learning, under which of these circumstances would you expect the word embeddings to be helpful?

*m*1 >>*m*2*m*1 <<*m*2

#### Sequence Models Week 3 Coursera Quiz Answers

Q1. Consider using this encoder-decoder model for machine translation.

This model is a “conditional language model” in the sense that the encoder portion (shown in green) is modeling the probability of the input sentence xx.

**False**- True

Q2. In beam search, if you increase the beam width BB, which of the following would you expect to be true? Check all that apply.

- Beam search will converge after fewer steps.
**Beam search will generally find better solutions (i.e. do a better job maximizing P(y \mid x)P(y∣x))****Beam search will use up more memory.****Beam search will run more slowly.**

Q3. In machine translation, if we carry out beam search without using sentence normalization, the algorithm will tend to output overly short translations.

**True**- False

Q4. Suppose you are building a speech recognition system, which uses an RNN model to map from audio clip xx to a text transcript yy. Your algorithm uses beam search to try to find the value of yy that maximizes P(y∣x).

*P*(*y*^∣*x*)=1.09∗10−7

P(y^* \mid x) = 7.21*10^-8*P*(*y*∗∣*x*)=7.21∗10−8

Would you expect increasing the beam width B to help correct this example?**1 point**

**No, because P(y^* \mid x) \leq P(\hat{y} \mid x)***P*(*y*∗∣*x*)≤*P*(*y*^∣*x*) indicates the error should be attributed to the RNN rather than to the search algorithm.- No, because P(y^* \mid x) \leq P(\hat{y} \mid x)
*P*(*y*∗∣*x*)≤*P*(*y*^∣*x*) indicates the error should be attributed to the search algorithm rather than to the RNN. - Yes, because P(y^* \mid x) \leq P(\hat{y} \mid x)
*P*(*y*∗∣*x*)≤*P*(*y*^∣*x*) indicates the error should be attributed to the search algorithm rather than to the RNN. - Yes, because P(y^* \mid x) \leq P(\hat{y} \mid x)
*P*(*y*∗∣*x*)≤*P*(*y*^∣*x*) indicates the error should be attributed to the RNN rather than to the search algorithm.

Q5. Continuing the example from Q4, suppose you work on your algorithm for a few more weeks, and now find that for the vast majority of examples on which your algorithm makes a mistake, P(y^* \mid x) > P(\hat{y} \mid x)P(y∗∣x)>P(y^∣x). This suggests you should focus your attention on improving the search algorithm.

**True**.- False.

Q6. Consider the attention model for machine translation.

Further, here is the formula for \alpha^{}α

Which of the following statements about \alpha^{}α are true? Check all that apply.

**We expect α<t,t′> to be generally larger for values of a<t′> that are highly relevant to the value the network should output for y. (Note the indices in the superscripts.)****∑t′α<t,t′>=1 (Note the summation is over t′.)**

Q7. The network learns where to “pay attention” by learning the values e^{}e , which are computed using a small neural network:

We can’t replace s^{<t-1>}*s*<*t*−1> with s^{<t>}*s*<*t*> as an input to this neural network. This is because s^{<t>}*s*<*t*> depends on \alpha^{<t,t’>}*α*<*t*,*t*’> which in turn depends on e^{<t,t’>}*e*<*t*,*t*’>; so at the time we need to evaluate this network, we haven’t computed s^{<t>}*s*<*t*> yet.

- False
**True**

Q8. Compared to the encoder-decoder model shown in Question 1 of this quiz (which does not use an attention mechanism), we expect the attention model to have the greatest advantage when:

**The input sequence length T_x***Tx* is large.- The input sequence length T_x
*Tx* is small.

Q9. Under the CTC model, identical repeated characters not separated by the “blank” character (_) are collapsed. Under the CTC model, what does the following string collapse to?

_c_oo_o_kk___b_ooooo__oo__kkk

- cokbok
- coookkboooooookkk
**cookbook**- cook book

Q10. In trigger word detection, x^{}x is:

- Whether someone has just finished saying the trigger word at time tt.
- The tt-th input word, represented as either a one-hot vector or a word embedding.
- Whether the trigger word is being said at time tt.
**Features of the audio (such as spectrogram features) at time tt.**

#### Sequence Models Week 4 Coursera Quiz Answers

Q1. A Transformer Network, like its predecessors RNNs, GRUs and LSTMs, can process information one word at a time. (Sequential architecture).

**False**- True

Q2. Transformer Network methodology is taken from: (Check all that apply)

**Attention mechanism.**- Convolutional Neural Network style of architecture.
- None of these.
**Convolutional Neural Network style of processing.**

Q3. The concept of *Self-Attention* is that:**1 point**

- Given a word, its neighbouring words are used to compute its context by selecting the highest of those word values to map the Attention related to that given word.
**Given a word, its neighbouring words are used to compute its context by selecting the lowest of those word values to map the Attention related to that given word.**- Given a word, its neighbouring words are used to compute its context by taking the average of those word values to map the Attention related to that given word.
- Given a word, its neighbouring words are used to compute its context by summing up the word values to map the Attention related to that given word.

Q4. Which of the following correctly represents *Attention ?*

- Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V
*Attention*(*Q*,*K*,*V*)=*softmax*(*dk**QKT*)*V* - Attention(Q, K, V) = softmax(\frac{QV^T}{\sqrt{d_k}})K
*Attention*(*Q*,*K*,*V*)=*softmax*(*dk**QVT*)*K* - Attention(Q, K, V) = min(\frac{QV^T}{\sqrt{d_k}})K
*Attention*(*Q*,*K*,*V*)=*min*(*dk**QVT*)*K* **Attention(Q, K, V) = min(\frac{QK^T}{\sqrt{d_k}})V***Attention*(*Q*,*K*,*V*)=*min*(*dk**QKT*)*V*

Q5. Are the following statements true regarding Query (Q), Key (K) and Value (V) ?

Q = interesting questions about the words in a sentence

K = specific representations of words given a Q

V = qualities of words given a Q

**False**- True

Q6.

i*i* here represents the computed attention weight matrix associated with the ith*ith* “word” in a sentence.

**False**- True

Q7. Following is the architecture within a Transformer Network. **(without displaying positional encodingand output layers(s))**

What information does the *Decoder *take from the *Encoder* for its second block of *Multi-Head Attention* ? (Marked X*X*, pointed by the independent arrow)

(Check all that apply)

**K**- Q
**V**

Q8. Following is the architecture within a Transformer Network. **(without displaying positional encoding and output layers(s))**

What is the output layer(s) of the *Decoder* ? (Marked Y*Y*, pointed by the independent arrow)**1 point**

- Softmax layer
- Linear layer
- Softmax layer followed by a linear layer.
**Linear layer followed by a softmax layer.**

Q9. Why is positional encoding important in the translation process? (Check all that apply)**1 point**

**Position and word order are essential in sentence construction of any language.**- It helps to locate every word within a sentence.
- It is used in CNN and works well there.
**Providing extra information to our model.**

Q10. Which of these is a good criteria for a good positionial encoding algorithm?**1 point**

**It should output a unique encoding for each time-step (word’s position in a sentence).****Distance between any two time-steps should be consistent for all sentence lengths.****The algorithm should be able to generalize to longer sentences.**- None of the these.

**Get all Course Quiz Answers of Deep Learning Specialization**

**Course 01: Neural Networks and Deep Learning Coursera Quiz Answers**

**Course 03: Structuring Machine Learning Projects Coursera Quiz Answers**

**Course 04: Convolutional Neural Networks Coursera Quiz Answers**

**Course 05: Sequence Models Coursera Quiz Answers**