找到你要的答案

Q:Eligibility trace reinitialization between episodes in SARSA-Lambda implementation

Q:在丝兰λ实现集之间的资格迹重新初始化

I'm looking at this SARSA-Lambda implementation (Ie: SARSA with eligibility traces) and there's a detail which I still don't get.

(Image from http://webdocs.cs.ualberta.ca/~sutton/book/ebook/node77.html)

So I understand that all Q(s,a) are updated rather than only the one the agent has chosen for the given time-step. I also understand the E matrix is not reset at the start of each episode.

Let's assume for a minute that panel 3 of Figure 7.12 was the end-state of episode 1.

At the start of episode 2, the agent moves north instead of east, and let's assume this gives it a reward of -500. Wouldn't this affect also all states that were visited in the previous episode?

If the idea is to reward those states which have been visited in the current episode, then why isn't the matrix containing all e(s,a) values reset at the beginning of each episode? It just seems like with this implementation states that have been visited in the previous episode are 'punished' or 'rewarded' for actions done by the agent in this new episode.

我看着这SARSAλ实现(即:资格迹Sarsa),有一个细节我还没有得到。

(图片来自HTTP:/ / webdocs。CS。ualberta。钙/ ~萨顿/书/电子书/ node77。HTML)

因此,我理解所有的Q(s,a)是更新的,而不是只有代理选择给定的时间步长。我也明白E矩阵是不是重置在开始的每一集。

让我们假设,图3的第7.12节是第1集的结束状态。

在第2集开始时,代理转移北,而不是东部,让我们假设这给它的奖励- 500。这也不会影响到前一集访问的所有州吗?

如果这个想法是为了奖励那些已经在目前的事件中访问的国家,那么为什么不是矩阵包含所有E(S,A)值重置在开始的每一集?它只是像这个执行状态,已经访问了前一集是'惩罚'或'奖励'的行动,代理在这新的插曲。

answer1: 回答1:

I agree with you 100%. Failing to reset the e-matrix at the start of every episode has exactly the problems that you describe. As far as I can tell, this is an error in the pseudocode. The reference that you cite is very popular, so the error has been propagated to many other references. However, this well-cited paper very clearly states that e-matrix should be reinitialized between episodes:

The eligibility traces are initialized to zero, and in episodic tasks they are reinitialized to zero after every episode.

As further evidence, the methods of this paper:

The trace, e, is set to 0 at the beginning of each episode.

and footnote #3 from this paper:

...eligibility traces were reset to zero at the start of each trial.

suggest that this is common practice, as both refer to reinitialization between episodes. I expect that there are many more such examples.

In practice, many uses of this algorithm don't involve multiple episodes, or have such long episodes relative to their decay rates that this doesn't end up being a problem. I expect that is why it hasn't been clarified more explicitly elsewhere on the internet yet.

我同意你100%。未能复位的E-矩阵在每集的开始都有你描述的问题。据我所知,这是一个错误的伪代码。你引用的引用很受欢迎,所以这个错误被传播到许多其他的参考文献中。然而,这个被引用的文章非常清楚地表明,E-矩阵应该初始化之间的插曲:

The eligibility traces are initialized to zero, and in episodic tasks they are reinitialized to zero after every episode.

作为进一步的证据,本文的方法:

在每一集开始时,跟踪,E,设置为0。

本文#和脚注3:

在每次试验开始时,资格跟踪被重置为零。

认为这是常见的做法,既指初始化之间的插曲。我期望有更多这样的例子。

在实践中,这种算法的许多用途不涉及多个情节,或有如此长的情节相对于他们的衰减率,这并不会最终成为一个问题。我希望这就是为什么它还没有明确澄清在互联网上其他地方。

machine-learning  reinforcement-learning  sarsa