Monte Carlo Methods

Monte Carlo Prediction
- Pseudocode: First-visit MC method
Monte Carlo Estimation of Action Values
Monte Carlo Control
- Pseudocode: Monte Carlo with Exploring Starts
Monte Carlo Control without Exploring Starts
- Pseudocode: On-policy first visit Monte Carlo Control
Off-policy Prediction via Importance Sampling
Incremental Implementation
- Pseudocode - Off-policy MC prediction
- Pseudocode - Off-policy Monte Carlo Control
Problems

Monte Carlo Methods only requires experience - sample sequence of states, actions, and rewards from actual or simulated interaction with the environment. It can obtain the optimal policy, even though its learning from the actual experience without complete knowledge of the environment’s dynamics. The model requires only generated sample transitions, not the complete probability distribution of all possible transitions which is required in dynamic programming.

Monte Carlo methods are based on the averaging sample returns. To ensure that well-defined returns are available, Monte Carlo methods are only for episodic tasks. Value estimates and policies are changed only on completion of an episode. Monte Carlo methods can thus be incremental in an episode-by-episode basis, but not in a step-by-step (online) basis. The term “Monte Carlo” is often used more broadly for any estimation methods whose operation involves a significant random component.

Monte Carlo Prediction

We wisht to estimate $v_\pi(s)$, the value of a state $s$ under the policy $\pi$, given a set of episodes obtained by following $\pi$ and passing through $s$. Each occurrence of state $s$ in an episode is called a visit to $s$. When we visit $s$ for the first time, we call it a first visit to $s$. The first-visit MC method estimates $v_\pi(s)$ as the average of returns following the first visit to $s$, whereas the every-visit MC method estimates $v_\pi(s)$ as the average of returns following all visit to $s$.

Pseudocode: First-visit MC method

First-visit MC prediction, for estimating $V \approx v_\pi$

Input: a policy $\pi$ to be evaluated

Initialize:

$V(s) \in \mathbb{R}$, arbitrarily, $\forall s \in \mathcal{S}$
$Returns(s) \leftarrow \text{an empty list}, \forall s \in \mathcal{S}$

Loop forever (for each episode):

Generate an episode following $\pi: S_0, A_0, R_1, S_1, A_1, R_2, ..., S_{T-1}, A_{T-1}, R_T$
$G \leftarrow 0$
Loop for each step of the episode, $t = T-1, T-2, ..., 0$:
- $G \leftarrow \gamma G + R_{t+1}$
- Unless $S_t$ appears in $S_0, S_1, ..., S_{t-1}$:
  - Append $G$ to $Returns(S_t)$
  - $V(S_t) \leftarrow average(Returns(S_t))$

Both first-visit MC and every-visit MC converge to $v_\pi(s)$ as the number of visits(or first visits) to $s$ goes to infinity. In case of first-visit MC, each return is independent, identically distributed estimate of $v_\pi(s)$ with finite variance. By law of large numbers the sequence of averages of these estimates converges to their expected value. Each average is itself an unbiased estimate, and the standard deviation of its erroe falls as $\frac{1}{\sqrt{n}}$ where $n$ is the numbers of returns averaged. Every-visit MC is less straightforward, but its estimate also converge quadratically to $v_\pi(s)$.

Monte Carlo Estimation of Action Values

If a model is available, then state values alone are enough to determine the policy. However, without a model state values alone are not sufficient. We should explicitly estimate the value of each action in order to determine the policy. Thus, one of the primary goal in MC methods is to estimate $q_\star$.

The policy evaluation for action values is to estimate $q_\pi(s,a)$. The MC method is same as presented for state values, but instead we talk about visits to a state-action pair rather than to a state. A state-action pair s,a is said to be visited in a episode if ever the state $s$ is visited and the action $a$ is taken. The every-visit MC estimates the value of state-action pair as the average of the returns that have followed all visits to it. The first-visit MC method average the returns following the first time in each episode that the state was visited and the action was selected. These methods converge quadratically to the true expected value as the number of visits to each state-action pair goes to infinity.

The only complication is that many state-action pairs may never be visited. If $\pi$ is a deterministic policy, then in following $\pi$ one will observer returns only for one of the actions from each state. With no returns to average, the MC estimates of the other actions will not improve with experience. This is a problem because the purpose of learning action value is to help in choosing among the actions available in each state. To compare we need to estimate all action from each state, not just one we favor.

For policy evaluation to work for action values, we must assure continual exploration. One way to do this is by specifying that the episodes start in a state-action pair, and that every pair has a nonzero probability of begin selected as the start. This guarantees that all state-action pairs will be visited an infinite number of times in the limit of an infinite number of episodes. We call this the assumption of exploring starts.

The assumption of exploring starts cannot be relied in general, particularly when learning directly from the environment. The most common alternative approach to assure that all state-action pairs are encountered is to consider only policies that are stochastic with a non-zero probability of selecting all actions in each state.

Monte Carlo Control

The overall idea of approximating the optimal policy in MC is according to the idea of generalized policy iteration (GPI). In GPI one maintains both an approximate policy and approximate value function. The value function is repeatedly altered to more closely approximate the value function for current policy, and the policy is repeatedly improved with respect to the current value function. These two kinds of changes work against each other to some extent, as each create moving target for the each other, but together they cause both policy and value function to approach optimality.

Lets us consider a Monte Carlo version of classical policy iteration. Lets assume that we observe an infinite number of episodes and that, in addition, the episodes are generated with exploring starts. Under these assumptions, the MC methods will compute $q_{\pi_k}$ exactly, for arbitrary $\pi_k$.

Policy improvement is done by making the policy greedy with respect to the current value function. For any action-value function $q$, the corresponding greedy policy is the one that, for each $s \in \mathcal{S}$, deterministically chooses an action with maximal action-value:

\[\pi(s) \dot{=} \arg\max_a q(s,a)\]

Policy improvement then can be done by constructing each $\pi_{k+1}$ as the greedy policy with respect to $q_{\pi_k}$. The policy improvement theorem then applies to $\pi_k$ and $\pi_{k+1}$ because, for all $s \in \mathcal{S}$,

\[\begin{align} q_{\pi_k}(s, \pi_{k+1}(s)) &= q_{\pi_k}(s, \arg\max_a q_{\pi_k}(s,a)) \\ &= \max_a q_{\pi_k}(s,a) \\ &\geq q_{\pi_k}(s, \pi_k(s)) \\ &\geq v_{\pi_k}(s) \\ \end{align}\]

We made two unlikely assumption that the episodes have exploring starts and policy evaluation could be done with an infinte number of episodes. We will have to remove these assumptions to obtain a practical algorithm. For now we will consider the first assumption is true. The second assumption is easy to remove. This issue is there in classical DP methods such as iterative policy evaluation, which also converges only asymptotically to the true value function. In both DP and MC methods there are two ways to solve this issue. One is to hold firm to the idea of approximating $q_{\pi_k}$ in each policy evaluation. Measurements and assumptions are made to obtain bounds on the magnitude and probability of error in the estimates, and then sufficient steps are taken during each policy evaluation to ensure that the error is small enough. Even if it can guarantee correct convergence up to some level of approximation, it may requires too many episodes to be useful in practice.

The second way to avoid infinite number of episodes is to give up trying to complete policy evaluation before returning to policy improvement. In each step we move value towards $q_{\pi_k}$, but do not expect it to get close without many steps. One extreme form of the idea is value iteration, in which only one iteration of iterative policy evaluation is performed between each step of policy improvement. The in-place version of value iteration is even more extreme; there we alternate between improvement and evaluation steps for single states.

Pseudocode: Monte Carlo with Exploring Starts

For MC policy evaluation it is natural to alternate between evaluation and improvement on an episode-by-episode basis. After each episode, the observed returns are used for policy evaluation, and then the policy is improved at all the states visited in the episode.

Monte Carlo ES (Exploring Starts), for estimating $\pi \approx \pi_\star$

Initialize

$\pi(s) \in \mathcal{A}(s)$(arbitrarily), for all $s \in \mathcal{S}$
$Q(s,a) \in \mathbb{R}$(arbitrarily), for all $s \in \mathcal{S}, a \in \mathcal{A}(s)$
$Returns(s,a) \leftarrow $empty list, for all $s \in \mathcal{S}, a \in \mathcal{A}(s)$

Loop forever (for each episode):

Choose $S_0 \in \mathcal{S}, A_0 \in \mathcal{A}(S_0)$ randomly such that all pairs have probability > 0
Generate an episode from $S_0, A_0$, following $\pi$: $S_0, A_0, R_1, \cdots, S_{T-1}, A_{T-1}, R_T$
$G \leftarrow 0$
Loop for each step of episode, $t = T-1, T-2,\cdots,0$:
- $G \leftarrow \gamma G + R_{t+1}$
- Unless the pair $S_t, A_t$ appears in $S_0, A_0, S_1, A_1, \cdots, S_{t-1}, A_{t-1}$:
  - Append $G$ to $Returns(S_t, A_t)$
  - $Q(S_t, A_t) \leftarrow$ average($Returns(S_t, A_t)$)
  - $\pi(S_t) \leftarrow \arg\max_a Q(S_t, a)$

In Monte Carlo ES, all the returns for each state-action pair are accumulated and averaged, irrespective of what policy was in force when they were observed. It is easy to see that Monte Carlo ES cannot converge to any suboptimal policy. If it did, then the value function would eventually converge to the value function for that policy, and that in turn would cause the policy to change. Stability is achieved only when both the policy and the value function are optimal. Convergence to this optimal fixed point seems inevitable as the changes to the action-value function decrease over time, but has not yet been formally proved.

Monte Carlo Control without Exploring Starts

The only general way to avoid the unlikely assumption of exploring starts is for the agent to continue to select them. There are two approaches to ensuring this, in resulting what we call on-policy methods and off-policy methods. On-policy methods attempt to evaluate or improve the policy that is used to make decisions, whereas off-policy methods evaluate or improve a policy different from that used to generate data. The MC ES method is based on the on-policy method.

In on-policy control methods the policy is generally soft, meaning that $\pi(a|s) > 0$ for all $s \in \mathcal{S}$ and $a \in \mathcal{A}(s)$, but gradually shifted closer and closer to a deterministic optimal policy. The on-policy methods used in this section uses $\epsilon-$greedy policies, meaning that most of the time they choose an action that has maximal estimated action value, but with a probability of $\epsilon$ they instead select a random action. That is, all non-greedy actions are given the minimal probability of selection, $\frac{\epsilon}{|\mathcal{A}(s)|}$, and the remaining bulk of the probability, $1 - \epsilon + \frac{\epsilon}{|\mathcal{A}(s)|}$, is given to the greedy action. The $\epsilon-$greedy policies are examples of $epsilon$-$soft$ policies, defined as policies for which $\pi(a|s) \geq \frac{\epsilon}{|\mathcal{A}(s)|}$ for all states and actions, with $\epsilon > 0$.

Pseudocode: On-policy first visit Monte Carlo Control

The overall idea of on-policy MC control is still that of GPI. Without the assumption of exploring starts, however, we cannot simply improve the policy by making it greedy with respect to the current value function, because that would prevent further exploration of non-greedy actions. Fortunately, GPI doesn’t requires that the policy be taken all the way to a greedy policy, only that it be moved towards a greedy policy.

On-policy first-visit MC control (for $\epsilon-$soft policies), estimates $\pi \approx \pi_\star$

Algorithm parameter: small $epsilon > 0$

Initialize:

$\pi \leftarrow $ an arbitrary $\epsilon-$soft policy
$Q(s,a) \in \mathbb{R}$ (arbitrarily), for all $s \in \mathcal{S}, a \in \mathcal{A}(s)$
$Returns(s,a) \leftarrow $empty list, for all $s \in \mathcal{S}, a \in \mathcal{A}(s)$

Repeat forever (for each episode):

Generate an episode following $\pi$: $S_0, A_0, R_1, \cdots, S_{T-1}, A_{T-1}, R_T$
$G \leftarrow 0$
Loop for each step of episode, $t = T-1, T-2,\cdots,0$:
- $G \leftarrow \gamma G + R_{t+1}$
- Unless the pair $S_t, A_t$ appears in $S_0, A_0, S_1, A_1, \cdots, S_{t-1}, A_{t-1}$:
  - Append $G$ to $Returns(S_t, A_t)$
  - $Q(S_t, A_t) \leftarrow$ average($Returns(S_t, A_t)$)
  - $A^\star \leftarrow \arg\max_a Q(S_t, a)$
  - For all $a \in \mathcal{A}(S_t)$:

Any $\epsilon-$greedy policy with respect to $q_\pi$ is an improvement over any $\epsilon-$soft policy $\pi$ is assured by the policy improvement theorem. Let $\pi’$ be the $\epsilon-$greedy policy. The conditions of the policy improvement theorem apply because for any $s \in \mathcal{S}$:

\[\begin{align} q_\pi(s, \pi'(s)) &= \sum_a \pi'(a|s) q_\pi(s, a) \\ &= \frac{\epsilon}{|\mathcal{A}(s)|} \sum_a q_\pi(s, a) + (1 - \epsilon) \max_a q_\pi(s, a) \\ &\geq \frac{\epsilon}{|\mathcal{A}(s)|} \sum_a q_\pi(s, a) + (1 - \epsilon) \sum_a \frac{\pi(a|s) - \frac{\epsilon}{|\mathcal{A}(s)|}}{1 - \epsilon} q_\pi(s, a) \\ \end{align}\]

The sum is a weighted average with nonnegative weights summing to 1, and as such it must be less than or equal to the largest number averaged. (To understand why last equation work, substitute values of $\pi(a|s)$ (check pseudocode))

\[\begin{align} &= \frac{\epsilon}{|\mathcal{A}(s)|} \sum_a q_\pi(s, a) - \frac{\epsilon}{|\mathcal{A}(s)|} \sum_a q_\pi(s, a) + \sum_a \pi(a|s) q_\pi(s, a) \\ &= v_\pi(s) \end{align}\]

Thus, by the policy improvement theorem, $\pi’ \geq \pi$.

Consider a new environment that is just like the original environment, except with the requirement that policies be $\epsilon-$soft “moved inside” the environment. The new environment has the same action and state set as the original and behaves as follows. If in state $s$ and taking action $a$, then with probability $1 - \epsilon$ the new environment behaves exactly like the old environment. With probability $\epsilon$ it repicks the action at random, with equal probabilities, and then behaves like the old environment with the new, random action. The best one can do in this new environment with general policies is the same as the best one could do in the original environment with $\epsilon$-soft policies. Let $\overset{\sim}{v_\star}$ and $\overset{\sim}{q_\star}$ denote the optimal value functions for the new environment, Then a policy $\pi$ is optimal among the $\epsilon$-soft policies if and only if $v_\pi = \overset{\sim}{v_\star}$. From the definition of $\overset{\sim}{v_\star}$, we know that it is unique solution to:

\[\begin{align} \overset{\sim}{v_\star} &= (1 - \epsilon)\max_a \overset{\sim}{q_\star}(s,a) + \frac{\epsilon}{|\mathcal{A}(s)|} \sum_a \overset{\sim}{q_\star}(s,a) \\ &= (1 - \epsilon) \max_a \sum_{s', r} p(s', r | s, a)\left[r + \gamma \overset{\sim}{v_\star}(s') \right] \\ &\quad + \frac{\epsilon}{|\mathcal{A}(s)|} \sum_a \sum_{s', r} p(s', r | s, a)\left[r + \gamma \overset{\sim}{v_\star}(s') \right] \\ \end{align}\]

When equality holds and the $\epsilon$-soft policy $\pi$ is no longer improved, then we also know that,

\[\begin{align} v_\pi &= (1 - \epsilon)\max_a q_\pi(s,a) + \frac{\epsilon}{|\mathcal{A}(s)|} \sum_a q_\pi(s,a) \\ &= (1 - \epsilon) \max_a \sum_{s', r} p(s', r | s, a)\left[r + \gamma v_\pi(s') \right] \\ &\quad + \frac{\epsilon}{|\mathcal{A}(s)|} \sum_a \sum_{s', r} p(s', r | s, a)\left[r + \gamma v_\pi(s') \right] \\ \end{align}\]

However, this equation is the same as the previous one, except for the substitution of $v_\pi$ for $\overset{\sim}{v_\star}$. Because $\overset{\sim}{v_\star}$ is unique, it must be that $v_\pi = \overset{\sim}{v_\star}$.

Off-policy Prediction via Importance Sampling

The problem with learning control methods are they seek to learn action values conditioned according to the optimal behavior. But they also need to behave non-optimally in order to explore more. The on-policy method learns actions values not for optimal but for a near-optimal policy. A more straightforward approach is to use two policies, one that becomes the optimal policy and the other one that is more exploratory and is used to generate behavior. The policy being learned is target policy and the policy being used to generate behavior is behavior policy. In this case learning is from data “off” the target policy, and this methods are called off-policy learning.

On-policy methods are most easier and considered first, meanwhile off-policy methods requires additional notations and concepts, because the data is due to different policy. Even though off-policy methods often have greater variance and are slower to converge, they are more powerful and general. On-policy method can be considered as a special case of off-policy where the behavior policy is the same as the target policy.

Consider a prediction problem with both target and behavior policies fixed. That is, we wish to estimate $v_\pi$ or $q_\pi$, but all we have are episodes following another policy $b$, where $b \neq \pi$. In this case, $\pi$ is the target policy and $b$ is the behavior policy, and both policies are considered fixed and given.

In order to use episodes from $b$ to estimate values for $\pi$, we require that every action taken under $\pi$ is also taken, at least occasionally under $b$. That is, $\pi(a|s) > 0$ implies $b(a|s) > 0$. This is called the assumption of coverage. It follows from coverage that $b$ must be stochastic in states where it is not identical to $\pi$. The target policy $\pi$, might be deterministic, and, in fact, this is a case of particular interest in control applications. In control, the target policy is typically deterministic greedy policy with respect to current estimate of the action-value function. This policy becomes a deterministic optimal policy while the behavior policy remains stochastic and more exploratory.

Almost all off-policy methods utilizes importance sampling, a general technique for estimating expected values under one distribution given samples from another. We apply importance sampling to off-policy learning by weighting returns according to relative probabilities of their trajectories occurring under the target and behavior policies, called importance sampling ratio. Given a starting state $S_t$, the probability of subsequent state-action trajectory, $A_t, S_{t+1}, A_{t+1}, \cdots, S_T$, occuring under any policy $\pi$ is:

\[\begin{align} Pr&\{A_t, S_{t+1}, A_{t+1}, \cdots, S_T | S_t, A_{t:T-1} \sim \pi\} \\ &= \pi(A_t | S_t)p(S_{t+1} | S_t, A_t) \pi(A_{t+1} | S_{t+1}) \cdots p(S_T | S_{T-1}, A_{T-1}) \\ &= \prod^{T-1}_{k=t}\pi(A_k | S_k)p(S_{k+1} | S_k, A_k) \\ \end{align}\]

where $p$ is state-transition probability function. Thus, the relative probability of the trajectory under the target and behavior policies (the importance-sampling ratio) is:

\[\begin{align} \rho_{t:T-1}\:&\dot{=}\:\frac{ \prod_{k=t}^{T-1}\pi(A_k | S_k)p(S_{k+1} | S_k, A_k) }{ \prod_{k=t}^{T-1}b(A_k | S_k)p(S_{k+1} | S_k, A_k) } &= \prod_{k=t}^{T-1}\frac{\pi(A_k | S_k)}{b(A_k | S_k)} \\ \end{align}\]

The importance sampling ratio depends on the two policies and sequence, but not on the MDP.

We have the estimate of expected returns (values), $G_t$ of behavior policy but not the target policy. These returns have wrong expectation $\mathbb{E}[G_t | S_t = s] = v_b(s)$ and so cannot be averaged to obtain $v_\pi$. Here we can use importance sampling. The ratio $\rho_{t:T-1}$ transforms the returns to have the right expected value:

\[\mathbb{E}[\rho_{t:T-1}G_t | S_t = s] = v_\pi(s)\]

The following is a Monte Carlo algorithm that averages returns from a batch of observed episodes following policy $b$ to estimate $v_\pi(s)$. Here, number of time steps is taken in a way that it increases across episode boundary. That is, if the first episode of a batch ends in 100, next episode starts at time step 101. This enables to use time-step numbers to refer particular steps in particular episodes. In particular, we can define the set of all time steps in which state $s$ is visited, denoted by $\mathcal{T}(s)$. This is for an every-visit method; for a first-visit method, $\mathcal{T}(s)$ would only include time steps that were first visits to $s$ within their episodes. Also, let $T(t)$ denote the first time of termination following time $t$, and $G_t$ denotes the return after $t$ up through $T(t)$. Then $\{G_t\}_{t \in \tau(s)}$ are the returns that pertain to state $s$, and $\{\rho_{t:T-1}\}_{t \in \tau(s)}$ are the corresponding importance-sampling ratio. To estimate $v_\pi(s)$, we scale returns by rations and averages the results:

\[V(s)\:\dot{=}\:\frac{\sum_{t \in \tau(s)} \rho_{t:T-1}G_t}{|\tau(s)|}\]

Importance sampling in this way(simple average) is ordinary sampling.

An important alternative is weighted importance sampling, which uses a weighted average, defined as:

\[V(s)\:\dot{=}\:\frac{\sum_{t \in \tau(s)} \rho_{t:T-1}G_t}{\sum_{t \in \tau(s)}}\]

or zero if the denominator is zero. The difference between these two methods is that the first one is unbiased where the second one is biased(though bias converge asymptotically to zero). Variance of ordinary importance sampling is in general unbounded because the variance of the ratios can be unbounded, wheresas for weighted estimator the largest weight on any single return is one. In fact, assuming bounded returns, the variance of weighted importance-sampling estimator converges to zero even if the variance of the ratios themselves is infinite.

The every-visit methods for ordinary and weighted importance sampling are both biased,though, again, the bias falls asymptotically to zero as the number of samples increases. In practice, every-visit methods are often preferred because they remove the need to keep track of which states have been visited and because they are much easier to extend to approximations

Incremental Implementation

Monte Carlo prediction methods can be implemented incrementally, on an episode-by-episode basis. For off-policy MC methods, we need to seperately consider those that use ordinary importance sampling and those that use weighted importance sampling.

In ordinary importance sampling, the returns are scaled by the importance sampling ratio $\rho_{t:T-1}$, then simply averaged. For these methods we can use incremental methods, but using scaled returns(in incremental method discussed in the second chapter, we use rewards). For weighted importance sampling, we have to form a weighted average of returns, and a slightly different incremental algorithm.

Suppose we have a sequence of returns $G_1, G_2, \cdots, G_{n-1}$, all starting in the same state and each with a corresponding random weight $W_i$. We wish to form the estimate

\[V_n\:\dot{=}\:\frac{\sum_{k=1}^{n-1}W_kG_k}{\sum_{k=1}^{n-1}W_k}, \quad n \geq 2\]

and keep it up-to-date as we obtain a single additional return $G_n$. In additional to keeping track of $V_n$, we must maintain for each state the cumulative sum $C_n$ of weights given to the first $n$ returns. The update rule for $V_n$ is:

\[V_{n+1}\:\dot{=}\:V_n + \frac{W_n}{C_n}\left[G_n - V_n\right], \quad n \geq 1\]

and,

\[C_{n+1}\:\dot{=}\:C_n + W_{n+1}\]

where $C_0 = 0$.

Pseudocode - Off-policy MC prediction

The algorithm in nominally for the off-policy case, using weighted importance sampling, but applies as well to the on-policy case by choosing same target and behavior policy (in this case W is always 1). The approximation $Q$ converges to $q_\pi$ while actions are selected according to a potentially different policy, $b$.

Off-policy MC prediction (policy evaluation) for estimating $Q \approx q_\pi$

Input: an arbitrary target policy $\pi$

Initialize:

$Q(s,a) \in \mathbb{R}$ (arbitrarily)
$C(s, a) \leftarrow 0$

Loop forever (for each episode):

$b \leftarrow$ any policy with coverage of $\pi$
Generate an episode following $b$: $S_0, A_0, R_1, \cdots, S_{T-1}, A_{T-1}, R_T$
$G \leftarrow 0$
$W \leftarrow 1$
Loop for each step of episode, $t = T-1, T-2,\cdots,0$, while $W \neq 0$:
- $G \leftarrow \gamma G + R_{t+1}$
- $C(S_t, A_t) \leftarrow C(S_t, A_t) + W$
- $Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \frac{W}{C(S_t, A_t)} \left[G - Q(S_t, A_t)\right]$
- $W \leftarrow W \frac{\pi(A_t|S_t)}{b(A_t|S_t)}$

Pseudocode - Off-policy Monte Carlo Control

Off-policy MC control methods follow the behavior policy while learning about and improving the target policy. These techniques require that the behavior policy has a nonzero probability of selecting all actions that might be selected by the target policy. To explore all possibilities, we require that the behavior policy is soft.

Off-policy MC control, for estimating $\pi \approx \pi_\star$

Initialize, for all $s \in \mathcal{S}, a \in \mathcal{A}$:

$Q(s,a) \in \mathbb{R}$ (arbitrarily)
$C(s, a) \leftarrow 0$
$\pi(s) \leftarrow \arg\max_a Q(s,a)$ (with ties broken consistently)

Loop forever (for each episode):

$b \leftarrow$ any soft policy$
Generate an episode following $b$: $S_0, A_0, R_1, \cdots, S_{T-1}, A_{T-1}, R_T$
$G \leftarrow 0$
$W \leftarrow 1$
Loop for each step of episode, $t = T-1, T-2,\cdots,0$, while $W \neq 0$:
- $G \leftarrow \gamma G + R_{t+1}$
- $C(S_t, A_t) \leftarrow C(S_t, A_t) + W$
- $Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \frac{W}{C(S_t, A_t)} \left[G - Q(S_t, A_t)\right]$
- $\pi(S_t) \leftarrow \arg\max_a Q(S_t,a)$ (with ties broken consistently)
- If $A_T \neq \pi(S_t)$ then exit inner loop (proceed to next episode)
- $W \leftarrow W \frac{1}{b(A_t|S_t)}$

Problems

Consider the diagrams on the right in Figure 5.1. Why does the estimated value function jump up for the last two rows in the rear? Why does it drop off for the whole last row on the left? Why are the frontmost values higher in the upper diagrams than in the lower?
1. If the sum is 20 or 21, means its very likely to win. The chances of dealer to get higher value without going bust might be very low.
2. Last row on the left means dealer holds an ace. So there is a chance that the dealer will get higher score without going bust.
3. Upper diagram shows the state-value with usable-ace. Usable-ace reduce the risk of going bust by hitting and therefore states have higher value.
Suppose every-visit MC was used instead of first-visit MC on the blackjack task. Would you expect the results to be very different? Why or why not?

Results will be the same because during an episode every state is only visited once, that is we cannot return to an old state. Thus every-visit MC has no effect on the results.
What is the backup diagram for Monte Carlo estimation of $q_\pi$
The pseudocode for Monte Carlo ES is inefficient because, for each state–action pair, it maintains a list of all returns and repeatedly calculates their mean. It would be more efficient to use techniques similar to those explained in Section 2.4 to maintain just the mean and a count (for each state–action pair) and update them incrementally. Describe how the pseudocode would be altered to achieve this.

Monte Carlo ES (Exploring Starts), for estimating $\pi \approx \pi_\star$
- Initialize
  - $\pi(s) \in \mathcal{A}(s)$(arbitrarily), for all $s \in \mathcal{S}$
  - $Q(s,a) \in \mathbb{R}$(arbitrarily), for all $s \in \mathcal{S}, a \in \mathcal{A}(s)$
  - $N(s,a) \leftarrow 0$, for all $s \in \mathcal{S}, a \in \mathcal{A}(s)$
- Loop forever (for each episode):
  - Choose $S_0 \in \mathcal{S}, A_0 \in \mathcal{A}(S_0)$ randomly such that all pairs have probability > 0
  - Generate an episode from $S_0, A_0$, following $\pi$: $S_0, A_0, R_1, \cdots, S_{T-1}, A_{T-1}, R_T$
  - $G \leftarrow 0$
  - Loop for each step of episode, $t = T-1, T-2,\cdots,0$:
    - $G \leftarrow \gamma G + R_{t+1}$
    - Unless the pair $S_t, A_t$ appears in $S_0, A_0, S_1, A_1, \cdots, S_{t-1}, A_{t-1}$:
      
      $N(S_t, A_t) \leftarrow N(S_t, A_t) + 1$
      
      $Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \large\frac{G - Q(S_t, A_t)}{N(S_t, A_t)}$
      
      $\pi(S_t) \leftarrow \arg\max_a Q(S_t, a)$
Consider an MDP with a single nonterminal state and a single action that transitions back to the nonterminal state with probability $p$ and transitions to the terminal state with probability $1 - p$. Let the reward be +1 on all transitions, and let $\gamma = 1$. Suppose you observe one episode that lasts 10 steps, with a return of 10. What are the first-visit and every-visit estimators of the value of the nonterminal state?

For first-visit estimator we consider only the first visit, so G = 10/1 = 10

For every-visit estimator we consider all the visits, so G = (10 + 9 + … + 1)/10 = 5.5
What is the equation analogous to (5.6) for action values $Q(s, a)$ instead of state values $V(s)$, again given returns generated using $b$?
\[V(s)\:\dot{=}\:\frac{\sum_{t \in \tau(s,a)} \rho_{t:T-1}G_t}{\sum_{t \in \tau(s,a)}}\]
In learning curves such as those shown in Figure 5.3 error generally decreases with training, as indeed happened for the ordinary importance-sampling method. But for the weighted importance-sampling method error first increased and then decreased. Why do you think this happened?

In the initial episodes, it is unlikely to see behavior policy matches with target policy, therefore estimate $V(s)$ is close to $V_\pi(s)$. Variance in output will be high where trajectories of $b$ matches with target policy, leading to high error. This will gradually decrease as the number of episodes increases.
The results with Example 5.5 and shown in Figure 5.4 used a first-visit MC method. Suppose that instead an every-visit MC method was used on the same problem. Would the variance of the estimator still be infinite? Why or why not?

Yes, because the estimated return is still 1 for every-visit.
Modify the algorithm for first-visit MC policy evaluation (Section 5.1) to use the incremental implementation for sample averages described in Section 2.4.

First-visit MC prediction, for estimating $V \approx v_\pi$
- Input: a policy $\pi$ to be evaluated
- Initialize:
  - $V(s) \in \mathbb{R}$, arbitrarily, $\forall s \in \mathcal{S}$
  - $N(s) \leftarrow 0$, $\forall s \in \mathcal{S}$
  - $Returns(s) \leftarrow \text{an empty list}, \forall s \in \mathcal{S}$
- Loop forever (for each episode):
  - Generate an episode following $\pi: S_0, A_0, R_1, S_1, A_1, R_2, ..., S_{T-1}, A_{T-1}, R_T$
  - $G \leftarrow 0$
  - Loop for each step of the episode, $t = T-1, T-2, ..., 0$:
    - $G \leftarrow \gamma G + R_{t+1}$
    - Unless $S_t$ appears in $S_0, S_1, ..., S_{t-1}$:
      
      $N(S_t) \leftarrow N(S_t) + 1$
      
      $V(S_t) \leftarrow V(S_t) + \large\frac{G - V(S_t)}{N(S_t)}$
Derive the weighted-average update rule (5.8) from (5.7). Follow the pattern of the derivation of the unweighted rule (2.3).

We know that $C_{n+1} = C_n + W_{n+1}$, where $C_0 = 0$, therefore we can write $C_n = \sum_{i=1}^{n} W_i$.
\[\begin{align} V_{n+1} &= \frac{\sum^{n}_{i=1} W_i G_i}{\sum^{n}_{i=1} W_i} \\ &= \frac{W_nG_n +\sum^{n-1}_{i=1} W_i G_i}{\sum^{n}_{i=1} W_i} \\ &= \frac{W_nG_n +\sum^{n-1}_{i=1} W_i G_i}{C_n} \\ &= \frac{W_nG_n}{C_n} + \frac{\sum^{n-1}_{i=1} W_i G_i}{C_n} \\ &= \frac{W_nG_n}{C_n} + \frac{\sum^{n-1}_{i=1} W_i G_i}{C_n}\times\frac{\sum^{n-1}_{i=1} W_i}{\sum^{n-1}_{i=1} W_i} \\ &= \frac{W_nG_n}{C_n} + \frac{V_n\times C_{n-1}}{C_n} \\ &= \frac{W_nG_n}{C_n} + \frac{V_n \times \left(C_n - W_n\right)}{C_n}, \quad [\because C_n = C_{n-1} + W_n] \\ &= \frac{W_nG_n}{C_n} + \frac{V_nC_n}{C_n} - \frac{V_nW_n}{C_n} \\ &= V_n + \frac{W_nG_n - V_nW_n}{C_n} \\ &= V_n + \frac{W_n}{C_n}\left[G_n - V_n\right] \\ \end{align}\]
In the boxed algorithm for off-policy MC control, you may have been expecting the $W$ update to have involved the importance-sampling ratio $\frac{\pi(A_t|S_t)}{b(A_t|S_t)}$, but instead it involves $\frac{1}{b(A_t|S_t)}$. Why is this nevertheless correct?

We are only observing trajectories with $\pi(A_t|S_t) = 1$, because it is deterministic greedy policy. So, the numerator can be 1.
Racetrack (programming): Consider driving a race car around a turn like those shown in Figure 5.5. You want to go as fast as possible, but not so fast as to run off the track. In our simplified racetrack, the car is at one of a discrete set of grid positions, the cells in the diagram. The velocity is also discrete, a number of gridcells moved horizontally and vertically per time step. The actions are increments to the velocity components. Each may be changed by +1, 1, or 0 in each step, for a total of nine (3x3) actions. Both velocity components are restricted to be nonnegative and less than 5, and they cannot both be zero except at the starting line. Each episode begins in one of the randomly selected start states with both velocity components zero and ends when the car crosses the finish line. The rewards are 1 for each step until the car crosses the finish line. If the car hits the track boundary, it is moved back to a random position on the starting line, both velocity components are reduced to zero, and the episode continues. Before updating the car’s location at each time step, check to see if the projected path of the car intersects the track boundary. If it intersects the finish line,the episode ends; if it intersects anywhere else, the car is considered to have hit the trackboundary and is sent back to the starting line. To make the task more challenging, with probability 0.1 at each time step the velocity increments are both zero, independently of the intended increments. Apply a Monte Carlo control method to this task to compute the optimal policy from each starting state. Exhibit several trajectories following the optimal policy (but turn the noise off for these trajectories).

Code