Value-based Reinforcement Learning
时序决策 以经典的Atari游戏为例,agent在t时刻观测一段包含M个帧的视频$s_t = (x_{t-M+1}, ..., x_t) \in S$, 然后agent做决策, 决策是选择做出一个动作 $a_t \in A = \{ 1, ..., |A| \}$(A为可选的离散动作空间 ), 这个动作会让agent获得一个奖励$r_t$. 这就是时序决策过程, 是一个通用的决策框架,可以建模各种时序决策问题,例如游戏,机器人等. Agent 观察环境,基于policy $\pi\left(a_{t} \mid s_{t}\right)$ 做出响应动作,其中 $s_{t}$是当前环境的观察值(Observation 是环境State对Agent可见的部分)。Action会获得新的 Reward $r_{t+1}$, 以及新的环境反馈 $s_{t+1}$. Note: It is important to distinguish between the state of the environment and the observation, which is the part of the environment state that the agent can see, e.g. in a poker game, the environment state consists of the cards belonging to all the players and the community cards, but the agent can observe only its own cards and a few community cards. In most literature, these terms are used interchangeably and observation is also denoted as . ...