Book Appointment Now

# Fundamentals of Reinforcement Learning Quiz Answers

## Get All Weeks Fundamentals of Reinforcement Learning Quiz Answers

### Fundamentals of Reinforcement Learning Week 01 Quiz Answers

#### Quiz 1: Sequential Decision-Making

Q1. What is the incremental rule (sample average) for action values?

[expand title=View Answer] Q_{n+1}= Q_n + \frac{1}{n} [R_n – Q_n][/expand]

Q2. Equation 2.5 (from the SB textbook, 2nd edition) is a key update rule we will use throughout the Specialization. We discussed this equation extensively in video. This exercise will give you a better hands-on feel for how it works. The blue line is the target that we might estimate with equation 2.5. The red line is our estimate plotted over time.

[expand title=View Answer] q_{n+1}=q_n+\alpha_n[R_n -q_n] [/expand]

Given the estimated update in red, what do you think was the value of the step size parameter we used to update the estimate on each time step?

[expand title=View Answer] 1/2 [/expand]

Q3. Equation 2.5 (from the SB textbook, 2nd edition) is a key update rule we will use throughout the Specialization. We discussed this equation extensively in video. This exercise will give you a better hands-on feel for how it works. The blue line is the target that we might estimate with equation 2.5. The red line is our estimate plotted over time.

[expand title=View Answer] q_{n+1}=q_n+\alpha_n[R_n -q_n][/expand]

Given the estimated update in red, what do you think was the value of the step size parameter we used to update the estimate on each time step?

[expand title=View Answer] 1/8 [/expand]

Q4. Equation 2.5 (from the SB textbook, 2nd edition) is a key update rule we will use throughout the Specialization. We discussed this equation extensively in video. This exercise will give you a better hands-on feel for how it works. The blue line is the target that we might estimate with equation 2.5. The red line is our estimate plotted over time.

[expand title=View Answer] q_{n+1}=q_n+\alpha_n[R_n -q_n][/expand]

Given the estimated update in red, what do you think was the value of the step size parameter we used to update the estimate on each time step?

[expand title=View Answer]1.0 [/expand]

Q5. Equation 2.5 (from the SB textbook, 2nd edition) is a key update rule we will use throughout the Specialization. We discussed this equation extensively in video. This exercise will give you a better hands-on feel for how it works. The blue line is the target that we might estimate with equation 2.5. The red line is our estimate plotted over time.

q_{n+1}=q_n+\alpha_n[R_n -q_n]

[expand title=View Answer] 1 / (t – 1) [/expand]

Q6. What is the exploration/exploitation tradeoff?

[expand title=View Answer]The agent wants to explore to get more accurate estimates of its values. The agent also wants to exploit to get more rewards. The agent cannot, however, choose to do both simultaneously.[/expand]

Q7. Why did an epsilon of 0.1 perform better over 1000 steps than an epsilon of 0.01?

[expand title=View Answer] The 0.01 agent did not explore enough. Thus it ended up selecting a suboptimal arm for longer. [/expand]

Q8. If exploration is so great why did an epsilon of 0.0 (a greedy agent) perform better than an epsilon of 0.4?

[expand title=View Answer] Epsilon of 0.4 explores too often that it takes many sub-optimal actions causing it to do worse over the long term. [/expand]

### Fundamentals of Reinforcement Learning Week 02 Quiz Answers

#### Quiz 1: MDPs Quiz Answers

Q1. The learner and decision maker is the ** _**.

[expand title=View Answer] Agent[/expand]

Q2. At each time step the agent takes an ** _**.

[expand title=View Answer] Action [/expand]

Q3. Imagine the agent is learning in an episodic problem. Which of the following is true?

[expand title=View Answer] The number of steps in an episode is stochastic: each episode can have a different number of steps. [/expand]

Q4. If the reward is always +1 what is the sum of the discounted infinite return when \gamma < 1*γ*<1

G_t=\sum_{k=0}^{\infty} \gamma^{k}R_{t+k+1}*Gt*=∑*k*=0∞*γkRt*+*k*+1

[expand title=View Answer] Gt=11−γ [/expand]

Q5. How does the magnitude of the discount factor (gamma/\gammaγ) affect learning?

[expand title=View Answer] With a larger discount factor the agent is more far-sighted and considers rewards farther into the future. [/expand]

Q6. Suppose \gamma=0.8*γ*=0.8 and we observe the following sequence of rewards: R_1 = -3*R*1=−3, R_2 = 5*R*2=5, R_3=2*R*3=2, R_4 = 7*R*4=7, and R_5 = 1*R*5=1, with T=5*T*=5. What is G_0*G*0? Hint: Work Backwards and recall that G_t = R_{t+1} + \gamma G_{t+1}*Gt*=*Rt*+1+*γGt*+1.

[expand title=View Answer] 6.2736[/expand]

Q7. What does MDP stand for?

[expand title=View Answer] Markov Decision Process [/expand]

Q8. Suppose reinforcement learning is being applied to determine moment-by-moment temperatures and stirring rates for a bioreactor (a large vat of nutrients and bacteria used to produce useful chemicals). The actions in such an application might be target temperatures and target stirring rates that are passed to lower-level control systems that, in turn, directly activate heating elements and motors to attain the targets. The states are likely to be thermocouples and other sensory readings, perhaps filtered and delayed, plus symbolic inputs representing the ingredients in the vat and the target chemical. The rewards might be moment-by-moment measures of the rate at which the useful chemical is produced by the bioreactor.

Notice that here each state is a list, or vector, of sensor readings and symbolic inputs, and each action is a vector consisting of a target temperature and a stirring rate.

Is this a valid MDP?

[expand title=View Answer] Yes. Assuming the state captures the relevant sensory information (inducing historical values to account for sensor delays). It is typical of reinforcement learning tasks to have states and actions with such structured representations; the states might be constructed by processing the raw sensor information in a variety of ways. [/expand]

Q9. Case 1: Imagine that you are a vision system. When you are first turned on for the day, an image floods into your camera. You can see lots of things, but not all things. You can’t see objects that are occluded, and of course, you can’t see objects that are behind you. After seeing that first scene, do you have access to the Markov state of the environment?

Case 2: Imagine that the vision system never worked properly: it always returned the same static image, forever. Would you have access to the Markov state then? (Hint: Reason about P(S_{t+1} | S_t, …, S_0)P(S

t+1= AllWhitePixels)

[expand title=View Answer] You have access to the Markov state in Case 1, but you don’t have access to the Markov state in Case 2. [/expand]

Q10. What is the reward hypothesis?

[expand title=View Answer] That all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward) [/expand]

Q11. Imagine, an agent is in a maze-like grid world. You would like the agent to find the goal, as quickly as possible. You give the agent a reward of +1 when it reaches the goal and the discount rate is 1.0 because this is an episodic task. When you run the agent it finds the goal but does not seem to care how long it takes to complete each episode. How could you fix this? (Select all that apply)

[expand title=View Answer]

Give the agent a reward of 0 at every time step so it wants to leave.

Set a discount rate less than 1 and greater than 0, like 0.9.

Give the agent -1 at each time step.

[/expand]

Q12. When may you want to formulate a problem as episodic?

[expand title=View Answer]When the agent-environment interaction naturally breaks into sequences. Each sequence begins independently of how the episode ended.[/expand]

### Fundamentals of Reinforcement Learning Week 03 Quiz Answers

#### Quiz 1: [Practice] Value Functions and Bellman Equations Quiz Answers

Q1. A policy is a function which maps ** _ to _**.

[expand title=View Answer] States to actions. [/expand]

Q2. The term “backup” most closely resembles the term *_* in meaning.

[expand title=View Answer]Update[/expand]

Q3. At least one deterministic optimal policy exists in every Markov decision process.

[expand title=View Answer] False [/expand]

Q4. The optimal state-value function:

[expand title=View Answer] Is unique in every finite Markov decision process. [/expand]

Q5. Does adding a constant to all rewards change the set of optimal policies in episodic tasks?

[expand title=View Answer] No, as long as the relative differences between rewards remain the same, the set of optimal policies is the same.[/expand]

Q6. Does adding a constant to all rewards change the set of optimal policies in continuing tasks?

[expand title=View Answer] No, as long as the relative differences between rewards remain the same, the set of optimal policies is the same. [/expand]

#### Quiz 2: Value Functions and Bellman Equations Quiz Answers

Q1. function which maps **_ to _** is a value function. [Select all that apply]

[expand title=View Answer]

State-action pairs to expected returns.

States to expected returns.

[/expand]

Q2. Every finite Markov decision process has __. [Select all that apply]

[expand title=View Answer] A unique optimal value function [/expand]

Q3. The Bellman equation for a given a policy \piπ: [Select all that apply]

[expand title=View Answer]

1.Expresses state values v(s)v(s) in terms of state values of successor states.

2.Holds only when the policy is greedy with respect to the value function.

[/expand]

Q6. An optimal policy:

[expand title=View Answer] Is not guaranteed to be unique, even in finite Markov decision processes.[/expand]

.

### Fundamentals of Reinforcement Learning Week 04 Quiz Answers

#### Quiz 1: Dynamic Programming Quiz Answers

Q1. The value of any state under an optimal policy is *_* the value of that state under a non-optimal policy. [Select all that apply]

[expand title=View Answer] Greater than or equal to [/expand]

Q2. If a policy is greedy with respect to the value function for the

equiprobable random policy, then it is guaranteed to be an optimal policy.

[expand title=View Answer] False [/expand]

Q3. Let v_{\pi}v

[expand title=View Answer]False [/expand]

Q4. What is the relationship between value iteration and policy iteration? [Select all that apply]

[expand title=View Answer]

Value iteration is a special case of policy iteration.

Policy iteration is a special case of value iteration.

Value iteration and policy iteration are both special cases of generalized policy iteration.

[/expand]

Q5. The word synchronous means “at the same time”. The word asynchronous means “not at the same time”. A dynamic programming algorithm is: [Select all that apply]

[expand title=View Answer]

Asynchronous, if it does not update all states at each iteration.

Synchronous, if it systematically sweeps the entire state space at each iteration.

Asynchronous, if it updates some states more than others.

[/expand]

Q6. All Generalized Policy Iteration algorithms are synchronous.

[expand title=View Answer] False[/expand]

Q7. Which of the following is true?

[expand title=View Answer] Asynchronous methods generally scale to large state spaces better than synchronous methods. [/expand]

Q8. Why are dynamic programming algorithms considered planning methods? [Select all that apply]

[expand title=View Answer]

1.They use a model to improve the policy.

2.They compute optimal value functions.

[/expand]

Q9. Consider the undiscounted, episodic MDP below. There are four actions possible in each state, A = {up, down, right, left}, which deterministically cause the corresponding state transitions, except that actions that would take the agent off the grid in fact leave the state unchanged. The right half of the figure shows the value of each state under the equiprobable random policy. If \piπ is the equiprobable random policy, what is q(7,down)?

[expand title=View Answer] q(7,down)=−14[/expand]

Q10. Consider the undiscounted, episodic MDP below. There are four actions possible in each state, A = {up, down, right, left}, which deterministically cause the corresponding state transitions, except that actions that would take the agent off the grid in fact leave the state unchanged. The right half of the figure shows the value of each state under the equiprobable random policy. If \piπ is the equiprobable random policy, what is v(15)v(15)? Hint: Recall the Bellman equation v(s) = \sum_a \pi(a | s) \sum_{s’, r} p(s’, r | s, a) [r + ]

p(s’,r∣s,a)[r+γv(s’)].

[expand title=View Answer] v(15) = -23v(15)=−23 [/expand]

##### Get All Course Quiz Answers of Reinforcement Learning Specialization

Fundamentals of Reinforcement Learning Quiz Answers

Sample-based Learning Methods Coursera Quiz Answers

Prediction and Control with Function Approximation Quiz Answers