DQN | Cong's Log

Early Rumour Detection

2019, ACL data: TWITTER, WEIBO links: https://www.aclweb.org/anthology/N19-1163, https://github.com/DeepBrainAI/ERD task: Rumour Detection 这篇文章采用GRU编码社交媒体posts stream，作为环境的状态表示；训练一个分类器以GRU的状态输出为输入，对文本做二分类判断是否是rumor。用DQN训练agent，根据状态做出是否启动rumor分类器进行判断，并根据分类结果对错给予奖惩。目标就是尽可能准尽可能早地预测出社交媒体posts是否是rumor。 Focuses on the task of rumour detection; particularly, we are in- terested in understanding how early we can detect them. Our model treats social media posts (e.g. tweets) as a data stream and integrates reinforcement learning to learn the number minimum num- ber of posts required before we classify an event as a rumour. Let $E$ denote an event, and it consists of a series of relevant posts $x_i$, where $x_0$ denotes the source message and $x_T$ the last relevant message. The objective of early rumor detection is to make a classification decision whether E is a rumour as early as possible while keeping an acceptable detection accuracy. ...

DQN, Double DQN, Dueling DoubleQN, Rainbow DQN

深度强化学习DQN和Natural DQN, Double DQN, Dueling DoubleQN, Rainbow DQN 的演变和必看论文. DQN的Overestimate DQN 基于 Q-learning, Q-Learning 中有 Qmax, Qmax 会导致 Q现实当中的过估计 (overestimate). 而 Double DQN 就是用来解决过估计的. 在实际问题中, 如果你输出你的 DQN 的 Q 值, 可能就会发现, Q 值都超级大. 这就是出现了 overestimate. DQN 的神经网络部分可以看成一个最新的神经网络 + 老神经网络, 他们有相同的结构, 但内部的参数更新却有时差. Q现实部分是这样的: $$Y_t^\text{DQN} \equiv R_{t+1} + \gamma \max_a Q(S_{t+1}, a; \theta_t^-)$$过估计 (overestimate) 是指对一系列数先求最大值再求平均，通常比先求平均再求最大值要大（或相等，数学表达为： $$E(\max(X_1, X_2, ...)) \ge \max(E(X_1), E(X_2), ...)$$一般来说Q-learning方法导致overestimation的原因归结于其更新过程，其表达为： $$Q_{t+1} (s_t, a_t) = Q_t (s_t, a_t) + a_t(s_t, a_t)(r_t + \gamma \max a Q_t(s_{t+1}, a) - Q_t(s_t, a_t))$$而更新最优化过程如下 ...

Deep Q Networks

Combining reinforcement learning and deep neural networks at scale. The algorithm was developed by enhancing a classic RL algorithm called Q-Learning with deep neural networks and a technique called experience replay. Q-Learning Q-Learning is based on the notion of a Q-function. The Q-function (a.k.a the state-action value function) of a policy $\pi$，$Q^{\pi}(s, a)$ ，measures the expected return or discounted sum of rewards obtained from state $s$ by taking action $a$ first and following policy $\pi$ thereafter. ...