Welcome to your comprehensive guide for Sequence Models quiz answers! Whether you’re working through practice quizzes to enhance your understanding or preparing for graded quizzes to test your knowledge, this guide is here to help.
Covering all course modules, this resource will deepen your grasp of sequence modeling concepts, including recurrent neural networks (RNNs), long short-term memory (LSTM), gated recurrent units (GRUs), and their applications in natural language processing and time-series data.
Sequence Models Coursera Quiz Answers – Practice & Graded Quizzes for All Modules
Table of Contents
Sequence Models Module 01 Quiz Answers
Q1: Suppose your training examples are sentences (sequences of words). Which of the following refers to the jthj^{th}jth word in the ithi^{th}ith training example?
Answer: x<j>(i)x^{(i)}_{<j>}x<j>(i)
Explanation: The notation x<j>(i)x^{(i)}_{<j>}x<j>(i) specifies the jthj^{th}jth word (<j><j><j>) in the ithi^{th}ith training example ((i)(i)(i)).
Q2: Consider this RNN. This specific type of architecture is appropriate when:
Answer: Tx=TyT_x = T_yTx=Ty
Explanation: This RNN architecture works for sequences where the input length TxT_xTx matches the output length TyT_yTy, such as in sequence-to-sequence tasks like time-series forecasting.
Q3: To which of these tasks would you apply a many-to-one RNN architecture? (Check all that apply)
Answer:
- Sentiment classification (input a piece of text and output a 0/1 to denote positive or negative sentiment)
- Gender recognition from speech (input an audio clip and output a label indicating the speaker’s gender)
Explanation: A many-to-one RNN architecture maps a sequence of inputs to a single output, which is suitable for tasks like sentiment analysis and gender classification.
Q4: At the ttht^{th}tth time step, what is the RNN doing? Choose the best answer:
Answer: Estimating P(y<t>∣y<1>,y<2>,…,y<t−1>)P(y^{<t>} \mid y^{<1>}, y^{<2>}, \dots, y^{<t-1>})P(y<t>∣y<1>,y<2>,…,y<t−1>)
Explanation: At each time step ttt, the RNN computes the probability of the current word y<t>y^{<t>}y<t> given all the previous words y<1>,…,y<t−1>y^{<1>}, \dots, y^{<t-1>}y<1>,…,y<t−1>.
Q5: What are you doing at each time step ttt when sampling random sentences from a trained RNN?
Answer:
(i) Use the probabilities output by the RNN to randomly sample a chosen word for that time step as y^<t>\hat{y}^{<t>}y^<t>.
(ii) Then pass this selected word to the next time step.
Explanation: During sampling, you use the RNN’s output probabilities to randomly sample the next word and feed it as input to the next time step.
Q6: You are training an RNN, and find that your weights and activations are all taking on the value of NaN (“Not a Number”). Which of these is the most likely cause of this problem?
Answer: Exploding gradient problem.
Explanation: Exploding gradients occur when gradients grow excessively large during backpropagation, leading to numerical instability and NaN values.
Q7: Suppose you are training an LSTM. You have a 10000-word vocabulary and are using an LSTM with 100-dimensional activations a<t>a^{<t>}a<t>. What is the dimension of Γu\Gamma_uΓu at each time step?
Answer: 100
Explanation: Γu\Gamma_uΓu (update gate) is computed for each LSTM unit, so its dimension matches the size of the activations, which is 100.
Q8: Alice proposes to simplify the GRU by always removing Γu\Gamma_uΓu. Betty proposes to simplify the GRU by removing Γr\Gamma_rΓr. Which model is more likely to work without vanishing gradient problems?
Answer: Betty’s model (removing Γr\Gamma_rΓr), because if Γu≈1\Gamma_u \approx 1Γu≈1 for a timestep, the gradient can propagate back through that timestep without much decay.
Explanation: Γu\Gamma_uΓu (update gate) plays a critical role in controlling how much of the previous hidden state is retained. Removing Γr\Gamma_rΓr (reset gate) can allow gradients to flow more easily.
Q9: From the GRU and LSTM equations, what are the roles of the Update Gate and Forget Gate in the GRU?
Answer: Γu\Gamma_uΓu and 1−Γu1-\Gamma_u1−Γu
Explanation: In the GRU, Γu\Gamma_uΓu acts like the update gate, and 1−Γu1-\Gamma_u1−Γu plays a role similar to the forget gate in the LSTM.
Q10: Should you use a Unidirectional RNN or Bidirectional RNN for predicting your dog’s mood based on weather data?
Answer: Unidirectional RNN, because the value of y<t>y^{<t>}y<t> depends only on x<1>,…,x<t>x^{<1>}, \dots, x^{<t>}x<1>,…,x<t>, but not on x<t+1>,…,x<365>x^{<t+1>}, \dots, x^{<365>}x<t+1>,…,x<365>.
Explanation: Predictions for y<t>y^{<t>}y<t> (day ttt) depend only on weather data up to and including day ttt, making a unidirectional RNN the correct choice.
Sequence Models Module 02 Quiz Answers
Q1: Suppose you learn a word embedding for a vocabulary of 10000 words. Then the embedding vectors should be 10000 dimensional, so as to capture the full range of variation and meaning in those words.
Answer: False
Explanation: Word embeddings reduce the dimensionality of the vocabulary by representing words in a lower-dimensional space (e.g., 300 or 500 dimensions). The 10000 refers to the vocabulary size, not the embedding dimension.
Q2: What is t-SNE?
Answer: A non-linear dimensionality reduction technique
Explanation: t-SNE (t-Distributed Stochastic Neighbor Embedding) is a technique for visualizing high-dimensional data by projecting it into a lower-dimensional space, often 2D or 3D.
Q3: Suppose you download a pre-trained word embedding… Then even if the word “ecstatic” does not appear in your small training set, your RNN might reasonably be expected to recognize “I’m ecstatic” as deserving a label.
Answer: True
Explanation: Pre-trained word embeddings capture semantic relationships. Since “ecstatic” is semantically similar to “wonderful” or “enjoying,” the embedding would help the RNN generalize to unseen words.
Q4: Which of these equations should hold for a good word embedding? (Check all that apply)
Answer:
- eboy−egirl≈ebrother−esistere_{boy} – e_{girl} \approx e_{brother} – e_{sister}eboy−egirl≈ebrother−esister
- eboy−ebrother≈egirl−esistere_{boy} – e_{brother} \approx e_{girl} – e_{sister}eboy−ebrother≈egirl−esister
Explanation: Word embeddings encode semantic relationships. These equations represent analogous relationships like gender (boy/girl) or family roles (brother/sister).
Q5: Let EEE be an embedding matrix, and o1234o_{1234}o1234 be a one-hot vector corresponding to word 1234. Why don’t we call E∗o1234E * o_{1234}E∗o1234 in Python?
Answer: It is computationally wasteful.
Explanation: Multiplying EEE (embedding matrix) by a one-hot vector directly retrieves the corresponding embedding. Using matrix multiplication is computationally wasteful because it computes all rows unnecessarily.
Q6: When learning word embeddings, we create an artificial task of estimating P(target∣context)P(\text{target} \mid \text{context})P(target∣context). It is okay if we do poorly on this artificial task; the more important by-product of this task is that we learn a useful set of word embeddings.
Answer: True
Explanation: The goal of training word embeddings is not the accuracy of the artificial task but rather obtaining embeddings that capture semantic relationships.
Q7: In the word2vec algorithm, you estimate P(t∣c)P(t \mid c)P(t∣c), where ttt is the target word and ccc is a context word. How are ttt and ccc chosen from the training set?
Answer: ccc and ttt are chosen to be nearby words.
Explanation: In word2vec, context and target words are selected as nearby words within a fixed window size in a sentence.
**Q8: Suppose you have a 10000-word vocabulary and are learning 500-dimensional word embeddings. The word2vec model uses the following softmax function:P(t∣c)=eθtTec∑t′=110000eθt′TecP(t \mid c) = \frac{e^{\theta_t^T e_c}}{\sum_{t’=1}^{10000} e^{\theta_{t’}^T e_c}}P(t∣c)=∑t′=110000eθt′TeceθtTec
Which of the following are true?**
Answer:
- θt\theta_tθt and ece_cec are both 500-dimensional vectors.
- θt\theta_tθt and ece_cec are both trained with an optimization algorithm such as Adam or gradient descent.
Explanation: Both θt\theta_tθt (target word embedding) and ece_cec (context word embedding) have the same embedding dimension (500). They are updated during training using optimization techniques.
**Q9: Suppose you have a 10000-word vocabulary and are learning 500-dimensional word embeddings. The GloVe model minimizes the objective:min∑i=110000∑j=110000f(Xij)(θiTej+bi+bj′−logXij)2\min \sum_{i=1}^{10000} \sum_{j=1}^{10000} f(X_{ij})(\theta_i^T e_j + b_i + b_j’ – \log X_{ij})^2mini=1∑10000j=1∑10000f(Xij)(θiTej+bi+bj′−logXij)2
Which of these statements are correct?**
Answer:
- The weighting function f(.)f(.)f(.) must satisfy f(0)=0f(0) = 0f(0)=0.
- XijX_{ij}Xij is the number of times word jjj appears in the context of word iii.
- θi\theta_iθi and eje_jej should be initialized randomly at the beginning of training.
Explanation: - The weighting function f(.)f(.)f(.) ensures that Xij=0X_{ij} = 0Xij=0 contributes nothing to the objective.
- XijX_{ij}Xij represents co-occurrence counts.
- Embedding vectors are initialized randomly to allow optimization algorithms to adjust them.
Q10: Under which circumstances would you expect word embeddings to be helpful?
Answer: m1≫m2m_1 \gg m_2m1≫m2
Explanation: Word embeddings are a form of transfer learning. If the pre-trained dataset (m1m_1m1) is much larger than the labeled dataset (m2m_2m2), the embeddings can significantly improve performance on the task.
Sequence Models Module 03 Quiz Answers
Q1: Consider using this encoder-decoder model for machine translation. This model is a “conditional language model” in the sense that the encoder portion (shown in green) is modeling the probability of the input sentence xxx.
Answer: False
Explanation: The encoder-decoder model is a “conditional language model” because it models P(y∣x)P(y \mid x)P(y∣x), the probability of the output sentence yyy given the input sentence xxx. The encoder does not model P(x)P(x)P(x), as its primary task is to generate a representation of xxx.
Q2: In beam search, if you increase the beam width BBB, which of the following would you expect to be true?
Answer:
- Beam search will generally find better solutions (i.e., do a better job maximizing P(y∣x)P(y \mid x)P(y∣x)).
- Beam search will use up more memory.
- Beam search will run more slowly.
Explanation: - A larger beam width increases the number of candidate sequences explored, improving the chances of finding a better solution.
- More candidates require more memory.
- Increasing the beam width also increases computational complexity, making the algorithm slower.
Q3: In machine translation, if we carry out beam search without using sentence normalization, the algorithm will tend to output overly short translations.
Answer: True
Explanation: Without sentence normalization, beam search favors shorter translations because shorter sequences tend to have higher probabilities due to the multiplicative nature of the softmax scores.
Q4: Would you expect increasing the beam width BBB to help correct this example?
Answer: No, because P(y∗∣x)≤P(y^∣x)P(y^* \mid x) \leq P(\hat{y} \mid x)P(y∗∣x)≤P(y^∣x) indicates the error should be attributed to the RNN rather than to the search algorithm.
Explanation: If the RNN itself assigns a higher probability to an incorrect sequence y^\hat{y}y^, increasing the beam width will not correct the issue because the problem lies in the RNN’s predictions, not in the search algorithm.
Q5: For most mistakes, P(y∗∣x)>P(y^∣x)P(y^* \mid x) > P(\hat{y} \mid x)P(y∗∣x)>P(y^∣x). This suggests you should focus on improving the search algorithm.
Answer: False
Explanation: If P(y∗∣x)>P(y^∣x)P(y^* \mid x) > P(\hat{y} \mid x)P(y∗∣x)>P(y^∣x), the search algorithm failed to find the optimal solution, so the focus should be on improving the search algorithm. However, the question indicates that we’ve already determined the search algorithm is not at fault.
Q6: Consider the attention model for machine translation. Which of the following statements about α<t,t′>\alpha^{<t,t’>}α<t,t′> are true?
Answer:
- We expect α<t,t′>\alpha^{<t,t’>}α<t,t′> to be generally larger for values of a<t′>a^{<t’>}a<t′> that are highly relevant to the value the network should output for y<t>y^{<t>}y<t>.
- ∑t′α<t,t′>=1\sum_{t’} \alpha^{<t,t’>} = 1∑t′α<t,t′>=1 (Note the summation is over t′t’t′).
Explanation: - The attention weights α<t,t′>\alpha^{<t,t’>}α<t,t′> represent how much focus is given to each part of the input sequence when generating the output at time ttt.
- The weights sum to 1 because they are generated using a softmax function.
Q7: We can’t replace s<t−1>s^{<t-1>}s<t−1> with s<t>s^{<t>}s<t> as an input to the attention network. True/False?
Answer: True
Explanation: At time ttt, s<t>s^{<t>}s<t> depends on α<t,t′>\alpha^{<t,t’>}α<t,t′>, which in turn depends on e<t,t′>e^{<t,t’>}e<t,t′>. Since s<t>s^{<t>}s<t> hasn’t been computed yet when evaluating e<t,t′>e^{<t,t’>}e<t,t′>, we must use s<t−1>s^{<t-1>}s<t−1> instead.
Q8: Compared to the encoder-decoder model shown in Question 1 (which does not use attention), we expect the attention model to have the greatest advantage when:
Answer: The input sequence length TxT_xTx is large.
Explanation: Attention mechanisms allow the model to focus on relevant parts of the input sequence, which is especially beneficial for long input sequences where information might otherwise be diluted or forgotten.
Q9: Under the CTC model, identical repeated characters not separated by the “blank” character (_) are collapsed. What does the following string collapse to?
_c_oo_o_kk___b_ooooo__oo__kkk
Answer: “cookbook”
Explanation: In CTC, repeated characters separated by a blank are merged, and blanks are removed. This collapses the input sequence to “cookbook.”
Q10: In trigger word detection, x<t>x^{<t>}x<t> is:
Answer: Features of the audio (such as spectrogram features) at time ttt.
Explanation: In trigger word detection, x<t>x^{<t>}x<t> represents the features extracted from the audio (e.g., spectrogram or MFCC features) at time step ttt, which are used to detect whether the trigger word is being spoken
Sequence Models Module 04 Quiz Answers
Q1: A Transformer Network, like its predecessors RNNs, GRUs, and LSTMs, can process information one word at a time. (Sequential architecture).
Answer: False
Explanation: Unlike RNNs, GRUs, and LSTMs, which process information sequentially, a Transformer network uses a parallel architecture and processes all words in a sentence at the same time, thanks to its attention mechanism. It does not rely on sequential processing.
Q2: Transformer Network methodology is taken from: (Check all that apply)
Answer:
- Attention mechanism.
Explanation:
The Transformer architecture is based on the attention mechanism and does not rely on Convolutional Neural Networks (CNNs) for processing. CNNs are not directly involved in the Transformer architecture.
Q3: The concept of Self-Attention is that:
Answer: Given a word, its neighbouring words are used to compute its context by selecting the highest of those word values to map the Attention related to that given word.
Explanation: Self-attention calculates the context of a word by considering its relationships with other words in the sentence. In practice, it computes the attention scores, and the word that most strongly influences the context is given more weight.
Q4: Which of the following correctly represents Attention?
Answer: Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)VAttention(Q,K,V)=softmax(dkQKT)V
Explanation: This is the standard formula for the attention mechanism, where QQQ, KKK, and VVV are the query, key, and value matrices, respectively. The softmax operation ensures that the attention weights sum to 1.
Q5: Are the following statements true regarding Query (Q), Key (K), and Value (V)?
Answer: True
Explanation:
- QQQ represents the queries, which are interesting questions about the words in a sentence.
- KKK represents the keys, which are the specific representations of words given a query.
- VVV represents the values, which are the qualities or features of words given a query.
Q6: ii here represents the computed attention weight matrix associated with the ithi^{th}ith word in a sentence.
Answer: True
Explanation: The attention weight matrix is computed for each word in the sentence, and the result indicates how much focus each word should receive when computing the context for the ithi^{th}ith word.
Q7: What information does the Decoder take from the Encoder for its second block of Multi-Head Attention? (Marked XX, pointed by the independent arrow)
Answer:
- KKK
- VVV
Explanation:
The Decoder in a Transformer takes both the keys (KKK) and values (VVV) from the Encoder’s output to compute the attention mechanism in the second block. The queries QQQ are generated from the previous decoder layer.
Q8: What is the output layer(s) of the Decoder? (Marked YY, pointed by the independent arrow)
Answer: Linear layer followed by a softmax layer.
Explanation: The output of the decoder is passed through a linear layer, which produces logits, and then a softmax layer is applied to generate probabilities for the predicted tokens.
Q9: Why is positional encoding important in the translation process? (Check all that apply)
Answer:
- Position and word order are essential in sentence construction of any language.
- It helps to locate every word within a sentence.
- Providing extra information to our model.
Explanation:
Positional encoding is used to inject information about the position of each word in the sequence since Transformers do not have any inherent understanding of word order.
Q10: Which of these is a good criterion for a good positional encoding algorithm?
Answer:
- It should output a unique encoding for each time-step (word’s position in a sentence).
- The algorithm should be able to generalize to longer sentences.
Explanation:
A good positional encoding algorithm should generate distinct encodings for each position in the sequence and should generalize to sequences longer than those seen during training, ensuring that the model can handle varying sentence lengths.
Frequently Asked Questions (FAQ)
Are the Sequence Models Coursera quiz answers accurate?
Yes, these answers are thoroughly verified to align with the latest course material on sequence models and their applications in deep learning.
Can I use these answers for both practice and graded quizzes?
Absolutely! These answers are applicable to both practice quizzes and graded assessments, ensuring comprehensive preparation for all evaluations.
Does this guide cover all modules in the course?
Yes, this guide provides answers for all modules, ensuring complete coverage of the course.
Will this guide help me understand sequence models better?
Yes, this guide reinforces key concepts such as RNN architectures, LSTMs, GRUs, sequence-to-sequence models, and attention mechanisms, helping you build a strong foundation in sequence modeling.
Conclusion
We hope this guide to Sequence Models Quiz Answers helps you master the concepts of sequential data processing and succeed in your course. Bookmark this page for quick reference and share it with your classmates. Ready to explore the power of sequence models and ace your quizzes? Let’s get started!
Get all Course Quiz Answers of Deep Learning Specialization
Course 01: Neural Networks and Deep Learning Coursera Quiz Answers
Course 03: Structuring Machine Learning Projects Coursera Quiz Answers
Course 04: Convolutional Neural Networks Coursera Quiz Answers
Course 05: Sequence Models Coursera Quiz Answers