## Get All Weeks Sample-based Learning Methods Coursera Quiz Answers

In this course, you will learn about several algorithms that can learn near-optimal policies based on trial and error interaction with the environment—learning from the agent’s own experience. Learning from actual experience is striking because it requires no prior knowledge of the environment’s dynamics, yet can still attain optimal behavior.

We will cover intuitively simple but powerful Monte Carlo methods, and temporal difference learning methods including Q-learning. We will wrap up this course investigating how we can get the best of both worlds: algorithms that can combine model-based planning (similar to dynamic programming) and temporal difference updates to radically accelerate

### Week 01 Quiz Answers

Q1. Which approach ensures continual (never-ending) exploration? (**Select all that apply**)

- Exploring starts
- On-policy learning with a
**deterministic**policy - On-policy learning with an \epsilon
*ϵ*-soft policy - Off-Policy learning with an \epsilon
*ϵ*-soft behavior policy and a**deterministic**target policy - Off-Policy learning with an \epsilon
*ϵ*-soft target policy and a**deterministic**behavior policy

Q2. When can Monte Carlo methods, as defined in the course, be applied? (Select all that apply)

- When the problem is
**continuing**and given a batch of data containing sequences of states, actions, and rewards - When the problem is
**continuing**and there is a model that produces samples of the next state and reward - When the problem is
**episodic**and given a batch of data containing sample episodes (sequences of states, actions, and rewards) - When the problem is
**episodic**and there is a model that produces samples of the next state and reward

Q3. Which of the following learning settings are examples of off-policy learning? (Select all that apply)

- Learning the optimal policy while continuing to explore
- Learning from data generated by a human expert

Q4. If a trajectory starts at time t*t* and ends at time T*T*, what is its relative probability under the target policy \pi*π* and the behavior policy b*b*?

Hint: pay attention to the time subscripts of A*A* and S*S* in the answers below.

Hint: Sums and products are not the same things!

- {\displaystyle \prod_{k=t}^{T-1}\frac{\pi(A_k\mid S_k)}{b(A_k\mid S_k)}}
*k*=*t*∏*T*−1*b*(*Ak*∣*Sk*)*π*(*Ak*∣*Sk*) - {\displaystyle \sum_{k=t}^{T-1}\frac{\pi(A_k\mid S_k)}{b(A_k\mid S_k)}}
*k*=*t*∑*T*−1*b*(*Ak*∣*Sk*)*π*(*Ak*∣*Sk*) - {\displaystyle\frac{\pi(A_{T-1}\mid S_{T-1})}{b(A_{T-1}\mid S_{T-1})}}
*b*(*AT*−1∣*ST*−1)*π*(*AT*−1∣*ST*−1) - {\displaystyle\frac{\pi(A_{t}\mid S_{t})}{b(A_{t}\mid S_{t})}}
*b*(*At*∣*St*)*π*(*At*∣*St*)

Q5. When is it possible to determine a policy that is greedy with respect to the value functions v_{\pi}, q_{\pi}*vπ*,*qπ* for the policy \pi*π*? (Select all that apply)

- When state values v_{\pi}
*vπ* and a model are available - When state values v_{\pi}
*vπ* are available but no model is available. - When action values q_{\pi}
*qπ* and a model are available - When action values q_{\pi}
*qπ* are available but no model is available.

Q6. Monte Carlo methods in Reinforcement Learning work by…

Hint: recall we used the term *sweep* in dynamic programming to discuss updating all the states systematically. This is **not** the same as visiting a state.

- Performing
**sweeps**through the state set - Averaging sample returns
- Averaging sample rewards
**Planning**with a model of the environment

Q7. Suppose the state s*s* has been visited three times, with corresponding returns 88, 44, and 33. What is the current Monte Carlo estimate for the value of s*s*?

- 33
- 1515
- 55
- 3.53.5

Q8. When does Monte Carlo prediction perform its first update?

- After the first time step
- After every state is visited at least once
- At the end of the first episode

Q9. In Monte Carlo prediction of state-values, **memory **requirements depend on (Select all that apply).

Hint: think of the two data structures used in the algorithm

- The number of states
- The number of possible actions in each state
- The length of episodes

Q10. In an \epsilon*ϵ*-greedy policy over \mathcal{A}A actions, what is the probability of the highest valued action if there are no other actions with the same value?

- 1-\epsilon1−
*ϵ* - \epsilon
*ϵ* - 1-\epsilon+\frac{\epsilon}{\mathcal{A}}1−
*ϵ*+A*ϵ* - \frac{\epsilon}{\mathcal{A}}A
*ϵ*

### Week 02 Quiz Answers

Q1. TD(0) is a solution method for:

- Control
- Prediction

Q2. Which of the following methods use bootstrapping? (Select all that apply)

- Dynamic Programming
- Monte Carlo
- TD(0)

Q3. Which of the following is the correct characterization of Dynamic Programming (DP) and Temporal Difference (TD) methods?

- Both TD methods and DP methods require a model: the dynamics function p.
- Neither TD methods nor DP methods require a model: the dynamics function p.
- TD methods require a model, the dynamics function p, but Monte-Carlo methods do not.
- DP methods require a model, the dynamics function p, but TD methods do not.

Q4. Match the algorithm name to its correct update (**select all that apply**)

- TD(0): V(S_t) \leftarrow V(S_t) + \alpha [G_t – V(S_t)]
*V*(*St*)←*V*(*St*)+*α*[*Gt*−*V*(*St*)] - Monte Carlo: V(S_t) \leftarrow V(S_t) + \alpha [G_t – V(S_t)]
*V*(*St*)←*V*(*St*)+*α*[*Gt*−*V*(*St*)] - TD(0): V(S_t) \leftarrow V(S_t) + \alpha [R_{t + 1} + \gamma V(S_{t + 1}) – V(S_t)]
*V*(*St*)←*V*(*St*)+*α*[*Rt*+1+*γV*(*St*+1)−*V*(*St*)] - Monte Carlo: V(S_t) \leftarrow V(S_t) + \alpha [R_{t + 1} + \gamma V(S_{t + 1}) – V(S_t)]
*V*(*St*)←*V*(*St*)+*α*[*Rt*+1+*γV*(*St*+1)−*V*(*St*)]

Q5. Which of the following well-describe Temporal Difference (TD) and Monte-Carlo (MC) methods?

- TD methods can be used in
*continuing*tasks. - MC methods can be used in
*continuing*tasks. - TD methods can be used in
*episodic*tasks. - MC methods can be used in
*episodic*tasks.

Q6. In an episodic setting, we might have different updates depending on whether the next state is terminal or non-terminal. Which of the following TD error calculations are correct?

- S_{t + 1}
*St*+1 is non-terminal: \delta_t = R_{t + 1} + \gamma V(S_{t + 1}) – V(S_t)*δt*=*Rt*+1+*γV*(*St*+1)−*V*(*St*) - S_{t + 1}
*St*+1 is non-terminal: \delta_t = R_{t + 1} – V(S_t)*δt*=*Rt*+1−*V*(*St*) - S_{t + 1}
*St*+1 is terminal: \delta_t = R_{t + 1} + \gamma V(S_{t + 1}) – V(S_t)*δt*=*Rt*+1+*γV*(*St*+1)−*V*(*St*) with V(S_{t + 1}) = 0*V*(*St*+1)=0 - S_{t + 1}
*St*+1 is terminal: \delta_t = R_{t + 1} – V(S_t)*δt*=*Rt*+1−*V*(*St*)

Q7. Suppose we have current estimates for the value of two states: V(A) = 1.0, V(B) = 1.0 in an episodic setting. We observe the following trajectory: A, 0, B, 1, B, 0, T where T is a terminal state. Apply TD(0) with step-size, \alpha = 1*α*=1, and discount factor, \gamma = 0.5*γ*=0.5. What are the value estimates for state A and state B at the end of the episode?

- (1.0, 1.0)
- (0.5, 0)
- (0, 1.5)
- (1, 0)
- (0, 0)

Q8. Which of the following pairs is the correct characterization of the targets used in TD(0) and Monte Carlo?

- TD(0): High Variance Target, Monte Carlo: High Variance Target
- TD(0): High Variance Target, Monte Carlo: Low Variance Target
- TD(0): Low Variance Target, Monte Carlo: High Variance Target
- TD(0): Low Variance Target, Monte Carlo: Low Variance Target

Q9. Suppose you observe the following episodes of the form (State, Reward, …) from a Markov Decision Process with states A and B:

Episodes |
---|

A, 0, B, 0 |

B, 1 |

B, 1 |

B, 1 |

B, 0 |

B, 0 |

B, 1 |

B, 0 |

What would batch Monte Carlo methods give for the estimates V(A) and V(B)? What would batch TD(0) give for the estimates V(A) and V(B)? Use a discount factor, \gamma*γ*, of 1.

For Batch MC: compute the average returns observed from each state. For Batch TD: You can start with state B. What is its expected return? Then figure out V(A) using the temporal difference equation: V(S_t) = E [R_{t+1} + \gamma V(S_{t+1})]*V*(*S**t*)=*E*[*R**t*+1+*γ**V*(*S**t*+1)].

Answers are provided in the following format:

- V^\text{batch-MC}(A)
*V*batch-MC(*A*) is the value for state A*A*under Monte Carlo learning - V^\text{batch-MC}(B)
*V*batch-MC(*B*) is the value of state B*B*under Monte Carlo learning - V^\text{batch-TD}(A)
*V*batch-TD(*A*) is the value of state A*A*under TD learning - V^\text{batch-TD}(B)
*V*batch-TD(*B*) is the value of state B*B*under TD learning

Hint: review example 6.3 in Sutton and Barto; this question is the same, just with different numbers.

- V^\text{batch-MC}(A)=0
*V*batch-MC(*A*)=0 - V^\text{batch-MC}(B)=0.5
*V*batch-MC(*B*)=0.5 - V^\text{batch-TD}(A)=0.5
*V*batch-TD(*A*)=0.5 - V^\text{batch-TD}(B)=0.5
*V*batch-TD(*B*)=0.5 - V^\text{batch-MC}(A)=0
*V*batch-MC(*A*)=0 - V^\text{batch-MC}(B)=0.5
*V*batch-MC(*B*)=0.5 - V^\text{batch-TD}(A)=0
*V*batch-TD(*A*)=0 - V^\text{batch-TD}(B)=0.5
*V*batch-TD(*B*)=0.5 - V^\text{batch-MC}(A)=0
*V*batch-MC(*A*)=0 - V^\text{batch-MC}(B)=0.5
*V*batch-MC(*B*)=0.5 - V^\text{batch-TD}(A)=0
*V*batch-TD(*A*)=0 - V^\text{batch-TD}(B)=0
*V*batch-TD(*B*)=0 - V^\text{batch-MC}(A)=0
*V*batch-MC(*A*)=0 - V^\text{batch-MC}(B)=0.5
*V*batch-MC(*B*)=0.5 - V^\text{batch-TD}(A)=1.5
*V*batch-TD(*A*)=1.5 - V^\text{batch-TD}(B)=0.5
*V*batch-TD(*B*)=0.5 - V^\text{batch-MC}(A)=0.5
*V*batch-MC(*A*)=0.5 - V^\text{batch-MC}(B)=0.5
*V*batch-MC(*B*)=0.5 - V^\text{batch-TD}(A)=0.5
*V*batch-TD(*A*)=0.5 - V^\text{batch-TD}(B)=0.5
*V*batch-TD(*B*)=0.5

Q10. True or False: “Both TD(0) and Monte-Carlo (MC) methods converge to the true value function asymptotically, given that the environment is Markovian.”

- True
- False

Q11. Which of the following pairs is the correct characterization of the TD(0) and Monte-Carlo (MC) methods?

- Both TD(0) and MC are offline methods.
- Both TD(0) and MC are online methods.
- TD(0) is an online method while MC is an offline method.
- MC is an online method while TD(0) is an offline method.

### Week 03 Quiz Answers

Q1. What is the target policy in Q-learning?

- \epsilon
*ϵ*-greedy with respect to the current action-value estimates - Greedy with respect to the current action-value estimates

Q2. Which Bellman equation is the basis for the Q-learning update?

- Bellman equation for state values
- Bellman equation for action values
- Bellman optimality equation for state values
- Bellman optimality equation for action values

Q3. Which Bellman equation is the basis for the Sarsa update?

- Bellman equation for state values
- Bellman equation for action values
- Bellman optimality equation for state values
- Bellman optimality equation for action values

Q4. Which Bellman equation is the basis for the Expected Sarsa update?

- Bellman equation for state values
- Bellman equation for action values
- Bellman optimality equation for state values
- Bellman optimality equation for action values

Q5. Which algorithm’s update requires more computation per step?

- Expected Sarsa
- Sarsa

Q6. Which algorithm has a higher variance target?

- Expected Sarsa
- Sarsa

Q7. Q-learning does not learn about the outcomes of exploratory actions.

- True
- False

Q8. Sarsa, Q-learning, and Expected Sarsa have similar targets on a transition to a terminal state.

- True
- False

Q9. Sarsa needs to wait until the end of an episode before performing its update.

- True
- False

### Week 04 Quiz Answers

Q1. Which of the following are the most accurate characterizations of sample models and distribution models? (Select all that apply)

- Both sample models and distribution models can be used to obtain a possible next state and reward, given the current state and action.
- A distribution model can be used as a sample model.
- A sample model can be used to compute the probability of all possible trajectories in an episodic task based on the current state and action.
- A sample model can be used to obtain a possible next state and reward given the current state and action, whereas a distribution model can only be used to compute the probability of this next state and reward given the current state and action.

Q2. Which of the following statements are TRUE for Dyna architecture? (Select all that apply)

- Real experience can be used to improve the value function and policy
- Simulated experience can be used to improve the model
- Real experience can be used to improve the model
- Simulated experience can be used to improve the value function and policy

Q3. Mark all the statements that are TRUE for the tabular Dyna-Q algorithm. (Select all that apply)

- The memory requirements for the model in case of a deterministic environment are quadratic in the number of states
- The environment is assumed to be deterministic.
- The algorithm
**cannot**be extended to stochastic environments. - For a given state-action pair, the model predicts the next state and reward

Q4. Which of the following statements are TRUE? (Select all the apply)

- Model-based methods often suffer more from bias than model-free methods, because of inaccuracies in the model.
- Model-based methods like Dyna typically require more memory than model-free methods like Q-learning.
- When compared with model-free methods, model-based methods are relatively more sample efficient. They can achieve a comparable performance with comparatively fewer environmental interactions.
- The amount of computation per interaction with the environment is larger in the Dyna-Q algorithm (with non-zero planning steps) as compared to the Q-learning algorithm.

Q5. Which of the following is generally the most computationally expensive step of the Dyna-Q algorithm? Assume N>1 planning steps are being performed (e.g., N=20).

- Model learning (step e)
- Direct RL (step d)
- Action selection (step b)
- Planning (Indirect RL; step f)

Q6. What are some possible reasons for a learned model to be inaccurate? (Select all that apply)

- The agent’s policy has changed significantly from the beginning of training.
- There is too much exploration (e.g., epsilon is epsilon-greedy exploration is set to a high value of 0.5)
- The environment has changed.
- The transition dynamics of the environment are stochastic, and only a few transitions have been experienced.

Q7. In search control, which of the following methods is likely to make a Dyna agent perform better in problems with a large number of states (like the rod maneuvering problem in Chapter 8 of the textbook)? Recall that search control is the process that selects the starting states and actions in planning. Also recall the navigation example in the video lectures in which a large number of wasteful updates were being made because of the basic search control procedure in the Dyna-Q algorithm. (Select the best option)

- Select state-action pairs uniformly at random from all previously experienced pairs.
- Start backwards from state-action pairs that have had a non-zero update (e.g., from the state right beside a goal state). This avoids the otherwise wasteful computations from state-action pairs which have had no updates.
- Start with state-action pairs enumerated in a fixed order (e.g., in a gridworld, states top-left to bottom-right, actions up, down, left, right)
- All of these are equally good/bad.

Q8. In the lectures, we saw how the Dyna-Q+ agent found the newly-opened shortcut in the shortcut maze, whereas the Dyna-Q agent didn’t. Which of the following implications drawn from the figure are TRUE? (Select all that apply)

- The Dyna-Q+ agent performs better than the Dyna-Q agent even in the first half of the experiment because of the increased exploration.
- The Dyna-Q agent can never discover shortcuts (i.e., when the environment changes to become better than it was before).
- The difference between Dyna-Q+ and Dyna-Q narrowed slightly over the first part of the experiment. This is because the Dyna-Q+ agent keeps exploring even when the environment isn’t changing.
- None of the above are true.

Q9. Consider the gridworld depicted in the diagram below. There are four actions corresponding to up, down, right, and left movements. Marked is the path taken by an agent in a single episode, ending at a location of high reward, marked by the G. In this example the values were all zero at the start of the episode, and all rewards were zero during the episode except for a positive reward at G.

- Now which of the following figures best depicts the action values that would’ve increased by the end of the episode using
and*one*-step Sarsa? (Select the best option)*500*-step-planning Dyna-Q

Q10. Which of the following are planning methods? (Select all that apply)

- Dyna-Q
- Expected Sarsa
- Value Iteration
- Q-learning

##### Sample-based Learning Methods Course Review

In our experience, we suggest you enroll in Sample-based Learning Methods courses and gain some new skills from Professionals completely free and we assure you will be worth it.

Sample-based Learning Methods Course for free, if you are stuck anywhere between a quiz or a graded assessment quiz, just visit Networking Funda to get Sample-based Learning Methods Coursera Quiz Answers.

##### Conclusion:

I hope this Sample-based Learning Methods Coursera Quiz Answer would be useful for you to learn something new from this Course. If it helped you then don’t forget to bookmark our site for more Quiz Answers.

This course is intended for audiences of all experiences who are interested in learning about new skills in a business context; there are no prerequisite courses.

Keep Learning!

##### Get All Course Quiz Answers of Reinforcement Learning Specialization

Fundamentals of Reinforcement Learning Quiz Answers

Sample-based Learning Methods Coursera Quiz Answers

Prediction and Control with Function Approximation Quiz Answers