Skip to content

Commit 46d6845

Browse files
authored
Merge pull request #307 from QuantEcon/mccall_q_edits
Fix small typos in McCall Q learning
2 parents ecd4e26 + 7b0e7b3 commit 46d6845

File tree

1 file changed

+12
-15
lines changed

1 file changed

+12
-15
lines changed

lectures/mccall_q.md

Lines changed: 12 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ The Q-learning algorithm combines ideas from
2323

2424
* dynamic programming
2525

26-
* a recursive version of least squares known as **temporal difference learning**
26+
* a recursive version of least squares known as [temporal difference learning](https://en.wikipedia.org/wiki/Temporal_difference_learning).
2727

2828
This lecture applies a Q-learning algorithm to the situation faced by a McCall worker.
2929

@@ -34,7 +34,7 @@ Relative to the dynamic programming formulation of the McCall worker model that
3434

3535
The Q-learning algorithm invokes a statistical learning model to learn about these things.
3636

37-
Statistical learning often comes down to some version of least squares, and it will here too.
37+
Statistical learning often comes down to some version of least squares, and it will be here too.
3838

3939
Any time we say _statistical learning_, we have to say what object is being learned.
4040

@@ -101,9 +101,6 @@ The worker's income $y_t$ equals his wage $w$ if he is employed, and unemploymen
101101
An optimal value $V\left(w\right) $ for a McCall worker who has just received a wage offer $w$ and is deciding whether
102102
to accept or reject it satisfies the Bellman equation
103103

104-
105-
106-
107104
$$
108105
V\left(w\right)=\max_{\text{accept, reject}}\;\left\{ \frac{w}{1-\beta},c+\beta\int V\left(w'\right)dF\left(w'\right)\right\}
109106
$$ (eq_mccallbellman)
@@ -281,15 +278,15 @@ These equations are aligned with the Bellman equation for the worker's optimal
281278
Evidently, the optimal value function $V(w)$ described in that lecture is related to our Q-function by
282279
283280
$$
284-
V(w) = \max_{\textrm{accept},\textrm{reject}} \left\{ Q(w, \text{accept} \right), Q\left(w,\text{reject} \right\}
281+
V(w) = \max_{\textrm{accept},\textrm{reject}} \left\{ Q(w, \text{accept} \right), Q\left(w,\text{reject} \right)\}
285282
$$
286283
287284
If we stare at the second equation of system {eq}`eq:impliedq`, we notice that since the wage process is identically and independently distributed over time,
288285
$Q\left(w,\text{reject}\right)$, the right side of the equation is independent of the current state $w$.
289286
290287
So we can denote it as a scalar
291288
292-
$$ Q_r=Q\left(w,\text{reject}\right),\forall w\in\mathcal{W}.
289+
$$ Q_r := Q\left(w,\text{reject}\right) \quad \forall \, w\in\mathcal{W}.
293290
$$
294291

295292
This fact provides us with an
@@ -386,7 +383,7 @@ To set up such an algorithm, we first define some errors or "differences"
386383
$$
387384
\begin{aligned}
388385
w & + \beta \max_{\textrm{accept, reject}} \left\{ \hat Q_t (w_t, \textrm{accept}), \hat Q_t(w_t, \textrm{reject}) \right\} - \hat Q_t(w_t, \textrm{accept}) = \textrm{diff}_{\textrm{accept},t} \cr
389-
c & +\beta\int\max_{\text{accept, reject}}\left\{ \hat Q_t(w_{t+1}, \textrm{accept}),\hat Q_t\left(w_{t+1},\text{reject}\right)\right\} - \hat Q_t\left(w_t,\text{reject}\right) = \textrm{diff}_{\textrm{reject},t} \cr
386+
c & +\beta \max_{\text{accept, reject}}\left\{ \hat Q_t(w_{t+1}, \textrm{accept}),\hat Q_t\left(w_{t+1},\text{reject}\right)\right\} - \hat Q_t\left(w_t,\text{reject}\right) = \textrm{diff}_{\textrm{reject},t} \cr
390387
\end{aligned}
391388
$$ (eq:old105)
392389
@@ -487,7 +484,7 @@ pseudo-code for our McCall worker to do Q-learning:
487484
488485
4. Update the state associated with the chosen action and compute $\widetilde{TD}$ according to {eq}`eq:old4` and update $\widetilde{Q}$ according to {eq}`eq:old3`.
489486
490-
5. Either draw a new state $w'$ if required or else take existing wage if and update the Q-table again again according to {eq}`eq:old3`.
487+
5. Either draw a new state $w'$ if required or else take existing wage if and update the Q-table again according to {eq}`eq:old3`.
491488
492489
6. Stop when the old and new Q-tables are close enough, i.e., $\lVert\tilde{Q}^{new}-\tilde{Q}^{old}\rVert_{\infty}\leq\delta$ for given $\delta$ or if the worker keeps accepting for $T$ periods for a prescribed $T$.
493490
@@ -511,7 +508,7 @@ For example, an agent who has accepted a wage offer based on her Q-table will be
511508
512509
By using the $\epsilon$-greedy method and also by increasing the number of episodes, the Q-learning algorithm balances gains from exploration and from exploitation.
513510
514-
**Remark:** Notice that $\widetilde{TD}$ associated with an optimal Q-table defined in equation (2) automatically above satisfies $\widetilde{TD}=0$ for all state action pairs. Whether a limit of our Q-learning algorithm converges to an optimal Q-table depends on whether the algorithm visits all state, action pairs often enough.
511+
**Remark:** Notice that $\widetilde{TD}$ associated with an optimal Q-table defined in equation (2) automatically above satisfies $\widetilde{TD}=0$ for all state action pairs. Whether a limit of our Q-learning algorithm converges to an optimal Q-table depends on whether the algorithm visits all state-action pairs often enough.
515512
516513
We implement this pseudo code in a Python class.
517514
@@ -665,7 +662,7 @@ ax.legend()
665662
plt.show()
666663
```
667664
668-
Now, let us compute the case with a larger state space: $n=20$ instead of $n=10$.
665+
Now, let us compute the case with a larger state space: $n=30$ instead of $n=10$.
669666
670667
```{code-cell} ipython3
671668
n, a, b = 30, 200, 100 # default parameters
@@ -737,12 +734,12 @@ The above graphs indicates that
737734
## Employed Worker Can't Quit
738735
739736
740-
The preceding version of temporal difference Q-learning described in equation system (4) lets an an employed worker quit, i.e., reject her wage as an incumbent and instead accept receive unemployment compensation this period
737+
The preceding version of temporal difference Q-learning described in equation system {eq}`eq:old4` lets an employed worker quit, i.e., reject her wage as an incumbent and instead receive unemployment compensation this period
741738
and draw a new offer next period.
742739
743740
This is an option that the McCall worker described in {doc}`this quantecon lecture <mccall_model>` would not take.
744741
745-
See {cite}`Ljungqvist2012`, chapter 7 on search, for a proof.
742+
See {cite}`Ljungqvist2012`, chapter 6 on search, for a proof.
746743
747744
But in the context of Q-learning, giving the worker the option to quit and get unemployment compensation while
748745
unemployed turns out to accelerate the learning process by promoting experimentation vis a vis premature
@@ -759,11 +756,11 @@ $$
759756
\end{aligned}
760757
$$ (eq:temp-diff)
761758
762-
It turns out that formulas {eq}`eq:temp-diff` combined with our Q-learning recursion (3) can lead our agent to eventually learn the optimal value function as well as in the case where an option to redraw can be exercised.
759+
It turns out that formulas {eq}`eq:temp-diff` combined with our Q-learning recursion {eq}`eq:old3` can lead our agent to eventually learn the optimal value function as well as in the case where an option to redraw can be exercised.
763760
764761
But learning is slower because an agent who ends up accepting a wage offer prematurally loses the option to explore new states in the same episode and to adjust the value associated with that state.
765762
766-
This can leads to inferior outcomes when the number of epochs/episods is low.
763+
This can lead to inferior outcomes when the number of epochs/episods is low.
767764
768765
But if we increase the numb
769766

0 commit comments

Comments
 (0)