RL | Cong's Log

DeepSeek-R1

DeepSeek-AI, et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948, arXiv, 22 Jan. 2025. arXiv.org, https://doi.org/10.48550/arXiv.2501.12948. Incentivizing Reasoning Capability in LLMs via Reinforcement Learning Large language models (LLMs) have made remarkable strides in mimicking human-like cognition, but their ability to reason through complex problems—from math proofs to coding challenges—remains a frontier. In a recent breakthrough, DeepSeek-AI introduces DeepSeek-R1, a family of reasoning-focused models that leverages reinforcement learning (RL) to unlock advanced reasoning capabilities, without relying on traditional supervised fine-tuning (SFT) as a crutch. The paper “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning” unveils a paradigm shift in how we train LLMs to think critically, with implications for both research and real-world applications. ...

Early Rumour Detection

2019, ACL data: TWITTER, WEIBO links: https://www.aclweb.org/anthology/N19-1163, https://github.com/DeepBrainAI/ERD task: Rumour Detection 这篇文章采用GRU编码社交媒体posts stream，作为环境的状态表示；训练一个分类器以GRU的状态输出为输入，对文本做二分类判断是否是rumor。用DQN训练agent，根据状态做出是否启动rumor分类器进行判断，并根据分类结果对错给予奖惩。目标就是尽可能准尽可能早地预测出社交媒体posts是否是rumor。 Focuses on the task of rumour detection; particularly, we are in- terested in understanding how early we can detect them. Our model treats social media posts (e.g. tweets) as a data stream and integrates reinforcement learning to learn the number minimum num- ber of posts required before we classify an event as a rumour. Let $E$ denote an event, and it consists of a series of relevant posts $x_i$, where $x_0$ denotes the source message and $x_T$ the last relevant message. The objective of early rumor detection is to make a classification decision whether E is a rumour as early as possible while keeping an acceptable detection accuracy. ...

DQN, Double DQN, Dueling DoubleQN, Rainbow DQN

深度强化学习DQN和Natural DQN, Double DQN, Dueling DoubleQN, Rainbow DQN 的演变和必看论文. DQN的Overestimate DQN 基于 Q-learning, Q-Learning 中有 Qmax, Qmax 会导致 Q现实当中的过估计 (overestimate). 而 Double DQN 就是用来解决过估计的. 在实际问题中, 如果你输出你的 DQN 的 Q 值, 可能就会发现, Q 值都超级大. 这就是出现了 overestimate. DQN 的神经网络部分可以看成一个最新的神经网络 + 老神经网络, 他们有相同的结构, 但内部的参数更新却有时差. Q现实部分是这样的: $$Y_t^\text{DQN} \equiv R_{t+1} + \gamma \max_a Q(S_{t+1}, a; \theta_t^-)$$过估计 (overestimate) 是指对一系列数先求最大值再求平均，通常比先求平均再求最大值要大（或相等，数学表达为： $$E(\max(X_1, X_2, ...)) \ge \max(E(X_1), E(X_2), ...)$$一般来说Q-learning方法导致overestimation的原因归结于其更新过程，其表达为： $$Q_{t+1} (s_t, a_t) = Q_t (s_t, a_t) + a_t(s_t, a_t)(r_t + \gamma \max a Q_t(s_{t+1}, a) - Q_t(s_t, a_t))$$而更新最优化过程如下 ...

DeepPath - A Reinforcement Learning Method for Knowledge Graph Reasoning

2017, EMNLP data: FB15K-237, FB15K task: Knowledge Graph Reasoning Use a policy-based agent with continuous states based on knowledge graph embeddings, which reasons in a KG vector space by sampling the most promising relation to extend its path. 方法 RL 系统包含两部分，第一部分是外部环境，指定了智能体和知识图谱之间的动态交互。环境被建模为马尔可夫决策过程。系统的第二部分，RL 智能体，表示为策略网络，将状态向量映射到随机策略中。神经网络参数通过随机梯度下降更新。相比于 DQN，基于策略的 RL 方法更适合该知识图谱场景。一个原因是知识图谱的路径查找过程，行为空间因为关系图的复杂性可能非常大。这可能导致 DQN 的收敛性变差。另外，策略网络能学习梯度策略，防止智能体陷入某种中间状态，而避免基于值的方法如 DQN 在学习策略梯度中遇到的问题。关系推理的强化学习行为给定一些实体对和一个关系，我们想让智能体找到最有信息量的路径来连接这些实体对。从源实体开始，智能体使用策略网络找到最有希望的关系并每步扩展它的路径直到到达目标实体。为了保持策略网络的输出维度一致，动作空间被定义为知识图谱中的所有关系。状态知识图谱中的实体和关系是自然的离散原子符号。现有的实际应用的知识图谱例如 Freebase 和 NELL 通常有大量三元组，不可能直接将所有原子符号建模为状态。为了捕捉这些符号的语义信息，我们使用基于平移的嵌入方法，例如 TransE 和 TransH 来表示实体和关系。这些嵌入将所有符号映射到低维向量空间。在该框架中，每个状态捕捉智能体在知识图谱中的位置。在执行一个行为后，智能体会从一个实体移动到另一个实体。两个状态通过刚执行的行为（关系）由智能体连接。第 t 步的状态向量： ...

Deep Q Networks

Combining reinforcement learning and deep neural networks at scale. The algorithm was developed by enhancing a classic RL algorithm called Q-Learning with deep neural networks and a technique called experience replay. Q-Learning Q-Learning is based on the notion of a Q-function. The Q-function (a.k.a the state-action value function) of a policy $\pi$，$Q^{\pi}(s, a)$ ，measures the expected return or discounted sum of rewards obtained from state $s$ by taking action $a$ first and following policy $\pi$ thereafter. ...

Value-based Reinforcement Learning

时序决策以经典的Atari游戏为例，agent在t时刻观测一段包含M个帧的视频$s_t = (x_{t-M+1}, ..., x_t) \in S$, 然后agent做决策, 决策是选择做出一个动作 $a_t \in A = \{ 1, ..., |A| \}$(A为可选的离散动作空间 ), 这个动作会让agent获得一个奖励$r_t$. 这就是时序决策过程, 是一个通用的决策框架，可以建模各种时序决策问题，例如游戏，机器人等. Agent 观察环境，基于policy $\pi\left(a_{t} \mid s_{t}\right)$ 做出响应动作，其中 $s_{t}$是当前环境的观察值(Observation 是环境State对Agent可见的部分)。Action会获得新的 Reward $r_{t+1}$, 以及新的环境反馈 $s_{t+1}$. Note: It is important to distinguish between the state of the environment and the observation, which is the part of the environment state that the agent can see, e.g. in a poker game, the environment state consists of the cards belonging to all the players and the community cards, but the agent can observe only its own cards and a few community cards. In most literature, these terms are used interchangeably and observation is also denoted as . ...