Q：Eligibility trace reinitialization between episodes in SARSA-Lambda implementation
I'm looking at this SARSA-Lambda implementation (Ie: SARSA with eligibility traces) and there's a detail which I still don't get.
(Image from http://webdocs.cs.ualberta.ca/~sutton/book/ebook/node77.html)
So I understand that all Q(s,a) are updated rather than only the one the agent has chosen for the given time-step. I also understand the E matrix is not reset at the start of each episode.
Let's assume for a minute that panel 3 of Figure 7.12 was the end-state of episode 1.
At the start of episode 2, the agent moves north instead of east, and let's assume this gives it a reward of -500. Wouldn't this affect also all states that were visited in the previous episode?
If the idea is to reward those states which have been visited in the current episode, then why isn't the matrix containing all e(s,a) values reset at the beginning of each episode? It just seems like with this implementation states that have been visited in the previous episode are 'punished' or 'rewarded' for actions done by the agent in this new episode.
（图片来自HTTP：/ / webdocs。CS。ualberta。钙/ ~萨顿/书/电子书/ node77。HTML）
I agree with you 100%. Failing to reset the e-matrix at the start of every episode has exactly the problems that you describe. As far as I can tell, this is an error in the pseudocode. The reference that you cite is very popular, so the error has been propagated to many other references. However, this well-cited paper very clearly states that e-matrix should be reinitialized between episodes:
As further evidence, the methods of this paper:
and footnote #3 from this paper:
suggest that this is common practice, as both refer to reinitialization between episodes. I expect that there are many more such examples.
In practice, many uses of this algorithm don't involve multiple episodes, or have such long episodes relative to their decay rates that this doesn't end up being a problem. I expect that is why it hasn't been clarified more explicitly elsewhere on the internet yet.
|machine-learning reinforcement-learning sarsa|