You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If we stare at the second equation of system {eq}`eq:impliedq`, we notice that since the wage process is identically and independently distributed over time,
288
285
$Q\left(w,\text{reject}\right)$, the right side of the equation is independent of the current state $w$.
@@ -487,7 +484,7 @@ pseudo-code for our McCall worker to do Q-learning:
487
484
488
485
4. Update the state associated with the chosen action and compute $\widetilde{TD}$ according to {eq}`eq:old4` and update $\widetilde{Q}$ according to {eq}`eq:old3`.
489
486
490
-
5. Either draw a new state $w'$ if required or else take existing wage if and update the Q-table again again according to {eq}`eq:old3`.
487
+
5. Either draw a new state $w'$ if required or else take existing wage if and update the Q-table again according to {eq}`eq:old3`.
491
488
492
489
6. Stop when the old and new Q-tables are close enough, i.e., $\lVert\tilde{Q}^{new}-\tilde{Q}^{old}\rVert_{\infty}\leq\delta$ for given $\delta$ or if the worker keeps accepting for $T$ periods for a prescribed $T$.
493
490
@@ -511,7 +508,7 @@ For example, an agent who has accepted a wage offer based on her Q-table will be
511
508
512
509
By using the $\epsilon$-greedy method and also by increasing the number of episodes, the Q-learning algorithm balances gains from exploration and from exploitation.
513
510
514
-
**Remark:** Notice that $\widetilde{TD}$ associated with an optimal Q-table defined in equation (2) automatically above satisfies $\widetilde{TD}=0$ for all state action pairs. Whether a limit of our Q-learning algorithm converges to an optimal Q-table depends on whether the algorithm visits all state, action pairs often enough.
511
+
**Remark:** Notice that $\widetilde{TD}$ associated with an optimal Q-table defined in equation (2) automatically above satisfies $\widetilde{TD}=0$ for all state action pairs. Whether a limit of our Q-learning algorithm converges to an optimal Q-table depends on whether the algorithm visits all state-action pairs often enough.
515
512
516
513
We implement this pseudo code in a Python class.
517
514
@@ -665,7 +662,7 @@ ax.legend()
665
662
plt.show()
666
663
```
667
664
668
-
Now, let us compute the case with a larger state space: $n=20$ instead of $n=10$.
665
+
Now, let us compute the case with a larger state space: $n=30$ instead of $n=10$.
669
666
670
667
```{code-cell} ipython3
671
668
n, a, b = 30, 200, 100 # default parameters
@@ -737,12 +734,12 @@ The above graphs indicates that
737
734
## Employed Worker Can't Quit
738
735
739
736
740
-
The preceding version of temporal difference Q-learning described in equation system (4) lets an an employed worker quit, i.e., reject her wage as an incumbent and instead accept receive unemployment compensation this period
737
+
The preceding version of temporal difference Q-learning described in equation system {eq}`eq:old4` lets an employed worker quit, i.e., reject her wage as an incumbent and instead receive unemployment compensation this period
741
738
and draw a new offer next period.
742
739
743
740
This is an option that the McCall worker described in {doc}`this quantecon lecture <mccall_model>` would not take.
744
741
745
-
See {cite}`Ljungqvist2012`, chapter 7 on search, for a proof.
742
+
See {cite}`Ljungqvist2012`, chapter 6 on search, for a proof.
746
743
747
744
But in the context of Q-learning, giving the worker the option to quit and get unemployment compensation while
748
745
unemployed turns out to accelerate the learning process by promoting experimentation vis a vis premature
@@ -759,11 +756,11 @@ $$
759
756
\end{aligned}
760
757
$$ (eq:temp-diff)
761
758
762
-
It turns out that formulas {eq}`eq:temp-diff` combined with our Q-learning recursion (3) can lead our agent to eventually learn the optimal value function as well as in the case where an option to redraw can be exercised.
759
+
It turns out that formulas {eq}`eq:temp-diff` combined with our Q-learning recursion {eq}`eq:old3` can lead our agent to eventually learn the optimal value function as well as in the case where an option to redraw can be exercised.
763
760
764
761
But learning is slower because an agent who ends up accepting a wage offer prematurally loses the option to explore new states in the same episode and to adjust the value associated with that state.
765
762
766
-
This can leads to inferior outcomes when the number of epochs/episods is low.
763
+
This can lead to inferior outcomes when the number of epochs/episods is low.
0 commit comments