Modern AI company provide “Deep Search” and “Deep Research” as a service. For example, ChatGPT offers Thinking and Deep Research. Google’s Gemini has thinking mode and Deep Research tool(by default make use of “Flash Thinking” model). Grok support Expert mode(think hard) and Heavy mode(team of experts). DeepSeek provides DeepThink mode. Kimi offer “Agent” option, which provides various agentic capabilities such as office-pilot, web search, agent swarm and Deep Research etc..

Different companies using different terms to describe similar capabilities. The underlying technology is very new and rapidly evolving.

These systems are designed to autonomously navigate the web, extract relevant data, and synthesize it into comprehensive reports or answers.

Core Technologies Behind Agentic Search and Deep Research

While different companies (OpenAI, Google, Perplexity) have their own versions of Agentic Search and Deep Research, the underlying “Tech Stack” generally consists of three pillars: Reasoning Models, Agentic Loops, Iterative Tool Use.

1. Reasoning Models

Traditional models use “next-token prediction” to give you an answer instantly. Deep Research uses Reasoning Models (like OpenAI’s o3 or Gemini 2.0/3 Flash Thinking) that use Reinforcement Learning (RL) to “think before they speak.”

Chain-of-Thought (CoT): The model generates an internal hidden monologue to plan its steps.
RL Training: These models are trained specifically on “browsing trajectories”—rewarding the AI when it successfully navigates a complex website to find a specific data point.

2. Agentic Loop

Unlike a standard search that happens in one shot, Deep Research operates in a recursive loop.

Decomposition: It breaks your prompt (e.g., “Compare the EV market in 2026”) into sub-questions: “Current market share,” “Battery tech breakthroughs,” and “New regulatory hurdles.”
The “Scout” and “Explorer”: It spawns sub-agents. A “Scout” generates the search queries, while the “Explorer” actually visits the URLs.
Backtracking: If the agent hits a dead end (like a 404 error or a paywall), the reasoning model “realizes” it failed and tries a different path—much like a human researcher would.

3. The Toolset

To be truly “agentic,” the technology integrates several specialized tools or protocals:

Search Tools: Integrated web browsers that can perform complex searches, click through links, and even interact with dynamic content (like filling out forms or navigating dropdowns).
Generative Tools: Make use of LLMs’ generative ability to various generative tasks, e.g. summarization tools to condense long articles, and question-answering tools to extract specific data points from dense text.
Python Sandbox: If it finds a raw data table, it writes and executes Python code to calculate growth rates or generate charts.
Multimodal Vision: It doesn’t just read text; it “looks” at screenshots of charts and diagrams to extract data that isn’t in the page copy.
MCP (Model Context Protocol): A standard that allows the agent to securely connect to external databases, Google Drive, or Slack to pull in non-public information.
Skills: Pre-built “skills” for specific tasks, like “Financial Analysis,” “Scientific Literature Review,” or “Competitive Intelligence,” which are essentially pre-configured agentic workflows optimized for those domains.

Key Papers on Agentic Search and Deep Research

Benchmarking

BrowseComp

BrowseComp² introduces a new benchmark designed to evaluate the browsing capabilities of AI agents. The paper highlights the limitations of current models in navigating complex web environments and proposes a comprehensive set of tasks that require multi-step reasoning, dynamic interaction with web elements, and the ability to synthesize information from multiple sources. The benchmark serves as a critical tool for advancing the development of more sophisticated browsing agents capable of performing real-world research tasks.

The authors draw an analogy to programming competitions: just as CodeForces serves as an incomplete but useful benchmark for coding ability (despite not capturing all aspects of software engineering), BrowseComp aims to test a core capability of browsing agents—persistent, creative information finding—even if it doesn’t capture the full complexity of real user queries.

How BrowseComp Was Created

The dataset was built entirely by human trainers (not synthetic generation). Trainers were instructed to create questions that met strict criteria:

Challenging: Questions must not be solvable by existing models (GPT-4o, o1, early Deep Research) or by humans within 10 minutes
Verifiable: Answers must be short, single strings that are easy to verify once found
Hard to find, easy to verify: Trainers started with a “seed” fact, then created inverted questions with multiple constraints that create large search spaces. For example, “What’s the title of the scientific paper published in the EMNLP conference between 2018-2023 where the first author did their undergrad at Dartmouth College and the fourth author did their undergrad at University of Pennsylvania?”, To solve this requires checking thousands of papers and author backgrounds—brute force is impractical, but verification is simple.

Rather than starting with a question and finding its answer, trainers started with an interesting seed (person, event, artifact), and identified multiple specific characteristics with large search spaces, then combined these into a question where the answer is obscure but verifiable.

Trainers verified the answer wasn’t on the first page of results for 5 simple searches. A second trainer attempted to solve questions; creators of questions solved >40% of the time were asked to revise. Trainers added more constraints if they weren’t confident the answer was unique.

As a result, 1,266 questions across diverse topics (TV/movies, science/tech, art, history, sports, music, etc.) were created,.

SealQA

SealQA comes in three variants:

SEAL-0 (111 questions): The core “stress test” where even frontier models like GPT-4.1 with browsing consistently fail. Questions are iteratively refined until multiple models fail across 10-15 attempts, achieving 0% accuracy.
SEAL-HARD (254 questions): A broader set including SEAL-0 plus additional difficult questions that didn’t meet the strict failure threshold but remain highly challenging.
LONGSEAL (254 questions): A “needle-in-a-haystack” variant testing long-context, multi-document reasoning. Each question is paired with 1 gold document and up to 50 hard negative distractors (over 7.6K documents total).

Question characteristics:

Single, unambiguous answers (verifiable from authoritative sources)
Require deep reasoning: distinguishing similar entities, tracking temporal changes, interpreting charts/tables, counting, reasoning over non-English content, debunking false premises
Designed to trigger ambiguous, conflicting, or noisy search results
Span diverse domains: science (26.8%), sports (22.0%), entertainment (21.7%), politics (9.1%), history, geography, etc.
Include temporal freshness categories: 31.1% never-changing, 43.7% slow-changing, 25.2% fast-changing

Quality control:

Rigorous multi-round vetting by graduate-level NLP researchers
Expert reviewer approval
Each question annotated with supporting URLs and expected review dates
Questions refined iteratively until they consistently cause model failures

Data Collection Process

Human Annotators

Recruited NLP researchers (including the 6 authors and their colleagues) as human annotators
Shown a small, diverse set of example questions to illustrate the types of questions to collect

Question Design Criteria

Annotators were instructed to write questions with:

Answer requirements:

Single, unambiguous answer (e.g., specifying “on what date” rather than just asking “when”)
Must be verifiable - supported by one or more webpages that justify the reference answer
For questions involving fresh knowledge, annotators required to cite regularly updated sources to support future maintenance

Difficulty requirements:

Questions designed to trigger ambiguous, conflicting, or misleading search results when entered into a search engine like Google
Must appear natural and avoid triggering ambiguous, conflicting, or noisy search results

Documentation:

Annotators provided an explanation for each answer, including necessary clarification or subtle reasoning
Each question was refined iteratively until it consistently caused multiple models to fail across repeated attempts

Iterative Refinement Process

A rigorous failure-driven curation:

For SEAL-0 specifically:

Each question tested against GPT-4o, GPT-4.1, and their mini variants (with and without browsing capabilities)
Also tested against OpenAI’s o1, o3, and Meta’s Llama models
Questions retained only if all models failed across 10-15 attempts
This achieves 0% accuracy threshold, hence “0” in the name
The paper notes: “This follows current best practices for building challenging datasets” (referencing GPQA-Diamond and SimpleQA)

For SEAL-HARD:

Includes SEAL-0 questions plus additional difficult questions
These didn’t meet the strict failure threshold but remain highly challenging

Quality Control - Multi-Round Vetting

A rigorous review process:

Initial review: Two or more graduate-level reviewers first reviewed each question
Expert approval: Followed by approval from expert reviewers
Multiple rounds of data cleaning: - Verification of supporting URLs - Answer correctness checks - Question clarity assessment
Exclusions: - Questions whose answers change too frequently - Questions with unclear or ambiguous phrasing

Metadata annotation:

Effective year (when the answer last changed)
Expected next review date for future maintenance

Dataset Statistics and Diversity

Question types (5 categories):

Q₁ (72.4%): Advanced reasoning - multi-hop reasoning, interpreting charts/tables, counting, calculations
Q₂ (58.3%): Entity/event disambiguation - distinguishing between similar entities or events
Q₃ (13.7%): Temporal tracking - identifying and differentiating instances of entities over time
Q₄ (5.5%): Cross-lingual reasoning - questions in English requiring reasoning over non-English sources
Q₅ (5.7%): False-premise questions - debunking false premises

Domain diversity:

Science and technology: 26.8%
Sports: 22.0%
Entertainment: 21.7%
Politics: 9.1%
History, geography, and others: 12.2%

Temporal freshness classification:

31.1% never-changing (answers never change)
43.7% slow-changing (answers typically change within a year)
25.2% fast-changing (answers change rapidly, e.g., within weeks)

Question length:

Average: 31 tokens
Maximum: 69 tokens

Topic labels:

Assigned post-hoc using GPT-4o mini

LONGSEAL Creation

For the long-context variant:

Gold document selection: One helpful document from annotator-provided webpages
Hard negative collection: - Used Google search to retrieve top 10 webpages per question - Extracted main content using Trafilatura tool - Also used GPT-4o mini to generate 3 semantically related queries - Collected 10 more pages limited to pre-2023 content (for temporal diversity) - Total: Up to 50 hard negatives per question
Filtering: - Removed duplicates - Excluded documents over 10K tokens
Gold document placement: - Randomly inserted among the negatives - Used o3-mini and o4-mini to filter out any negatives that might allow the correct answer to be inferred

Final dataset: Over 7.6K documents total

Development Timeline and Cost

The paper notes this was intentionally kept small due to:

High cost and complexity of question development
Team of six NLP researchers working over eight months
Multiple development cycles
Each question required over an hour on average (~45 minutes to draft, plus review and revision time)

Many initial ideas were discarded as they failed to meaningfully challenge frontier LLMs (e.g., the widely used GPQA-Diamond, with 198 expert-vetted questions, demonstrates this approach).

Evaluation Protocol

Auto-rater:

Adapted GPT-4o mini auto-rater from SIMPLEQA
Takes question, gold target, reference answer as input
Labels responses as: “correct”, “incorrect”, or “not attempted”
Uses relaxed protocol judging whether main answer is factually correct and consistent
98% agreement with human ratings (validated on 100 answers by two independent authors)

SealQA Evaluation Findings

Frontier Models Struggle Significantly On SEAL-0 (without search):

GPT-4.1: 0.0%
o3-mini-medium: 2.7%
o3-medium: 5.4%

On SEAL-HARD (without search):

GPT-4.1: 15.0%
o3-mini: 19.7%
o3-medium: 34.6%

Even with search access, performance remains low:

GPT-4.1 (w/ search): 20.5% on SEAL-HARD
o3-medium (w/ search): 34.6% on SEAL-HARD

Human performance: Graduate-level NLP researchers achieved 23.3% average (30.0% best) on SEAL-HARD questions, showing these are genuinely difficult.

Test-Time Scaling Doesn’t Reliably Help

A critical finding shown in Figure 1: Simply increasing test-time compute (more reasoning tokens) does not lead to reliable gains on SealQA. Performance often plateaus or even declines early.

o3-mini: Peaks at 6.3% (low effort), drops to 5.4% (medium) and 4.5% (high)
o3: Best at 11.7% (low), 17.1% (medium), 14.4% (high) - inconsistent gains
o4-mini: Similar pattern of diminishing or negative returns

This suggests that more compute alone doesn’t solve the core reasoning challenges when faced with noisy information.

Advanced Reasoning Models Are Vulnerable to Noise

DeepSeek-R1-671B and o3-mini show dramatic sensitivity to noisy search results:

DeepSeek-R1-671B: 22.4% → 11.0% (drops 51% with FreshPrompt)
o3-mini: Better on recent/dynamic questions but struggles with static/older questions

The models fail because:

They struggle to filter out irrelevant or misleading information
Noise amplifies rather than improves accuracy
They have difficulty prioritizing conflicting evidence

Model-Specific Weaknesses by Question Category

Cross-lingual reasoning (Q₄): Models perform poorly on questions requiring reasoning over non-English sources

GPT-4.1: 14.1% (w/o search) → 20.1% (w/ search)
o3-mini: 13.0% → 12.0% (search actually hurts)

False-premise detection (Q₅): Debunking incorrect assumptions is extremely difficult

Most models: 0.0% across the board
Even with search, models rarely identify and reject false premises

Entity/event disambiguation (Q₂): Distinguishing between similar entities or events

GPT-4.1: 14.2% → 17.6%
o3: 17.1% → 48.6% (notable improvement)

Temporal tracking (Q₃): Questions involving recent or rapidly changing information

Fast-changing questions are particularly challenging
GPT-4.1: 1.6% (fast) vs 18.0% (slow) vs 21.5% (never-changing)

Search Integration Can Be Detrimental

Contrary to expectations, search doesn’t always help:

Built-in search vs FreshPrompt: ChatGPT’s built-in search generally improves performance (+5.5% for GPT-4.1), but FreshPrompt (retrieval-based prompting) can hurt advanced reasoning models
Noise amplification: When search results are uniformly unhelpful, performance degrades more than when they contain conflicting answers
Lost-in-the-middle problem persists: In LONGSEAL, models fail to reliably identify and prioritize relevant documents when numerous distractors are present (Figure 6)

LONGSEAL: Long-Context Reasoning Failures

All models show clear accuracy degradation as the number of hard negatives increases:

GPT-4.1-mini: 32.7% (k=12) → 29.4% (k=30)
GPT-4o-mini: 24.0% → 6.3%
Llama-3.2-11B: 10.2% → 2.0%

Key insight: Simply increasing context size doesn’t guarantee effective context use. Models struggle to filter irrelevant content at scale, even when all documents fit within the context window.

No “lost-in-the-middle” bias: Unlike earlier work, newer models don’t show clear U-shaped positional bias, but still fail to recognize the gold document when distractors are numerous.

Qualitative Analysis Reveals Reasoning Patterns

Analysis of 100 responses from 6 models showed:

GPT-4.1 (base): Occasionally includes relevant info but produces inaccurate answers due to outdated knowledge
GPT-4.1 (FreshPrompt): Better at detecting questions requiring search, more logically coherent, but accuracy depends on retrieval quality
GPT-4.1 (built-in search): More coherent, higher-quality citations, but still makes occasional errors
o3: Capable of detailed, informed responses but sometimes overthinks, repeats phrases like “wait”, and lacks structured formatting
DeepSeek-R1-671B: Tends to overthink without reaching clear conclusions

Frameworks

WebWalker

WebWalker³ addresses the limitations of current Retrieval-Augmented Generation (RAG) systems when dealing with complex, multi-layered information that requires navigating through a website’s hierarchy rather than just landing on a single page. The authors introduce a WebWalkerQA benchmark specifically designed to test an LLM’s ability to perform web traversal. They create the benchmark with GPT-4o and human anotations.

To tackle the benchmark, the authors propose WebWalker, a multi-agent framework that mimics human-like browsing behavior. WebWalker consists of two main agents:

Explorer Agent (Think then Explore)

Purpose: Actively navigates through web pages by clicking on HTML buttons/links
Based on: ReAct framework (Reasoning + Acting)
Process: At each time step, it:
- Receives an observation from the web environment
- Takes an action (selects a URL to explore)
- Follows a Thought-Action-Observation paradigm
- Explores subpages by interacting with clickable HTML buttons
- The observation includes: page information, clickable HTML buttons with URLs

Critic Agent (Think then Critique)

Purpose: Evaluates progress and decides when sufficient information has been gathered
Key responsibilities:
- Maintains a memory that incrementally accumulates relevant information
- After each explorer action, evaluates whether gathered information is sufficient to answer the query
- Provides an answer once required information is deemed sufficient
- Addresses the challenge of implicit policies and potentially large observation sizes

The framework operates in an explore-critic paradigm:

Explorer traverses web pages in Thought-Action-Observation cycles
Critic takes the query and explorer’s current observation/action as input
Critic updates its memory and evaluates if enough information has been collected
Process continues until critic determines the query can be answered or maximum steps reached

WebDancer

WebDancer⁴ is a systematic framework for building end-to-end autonomous web agents capable of multi-step information seeking, similar to systems like ChatGPT’s Deep Research. The paper addresses the challenge of creating agents that can autonomously navigate complex web environments to answer questions requiring deep information gathering.

WebDancer abstract an end-to-end web agents building pipeline into four key stages:

Browsing Data Construction: Construct diverse and challenging deep information seeking QA pairs based on the real-world web environment; The authors create two types of synthetic QA datasets:

CRAWLQA: Crawls web pages from knowledgeable websites (Wikipedia, arXiv, etc.), recursively navigates subpages, and uses GPT-4o to synthesize QA pairs from collected content. This creates diverse, multi-hop questions.
E2HQA (Easy-to-Hard QA): Starts with simple entity-based questions and iteratively refines them into progressively more complex multi-step questions. The process uses search engines to find information and restructures queries while preserving answer validity.

Agent Trajectories Rejection Sampling: Sample high-quality trajectories from QA pairs using both LLMs and LRMs to guide the agency learning process;

Uses the ReAct framework where agents alternate between:

Thought: Reasoning about what to do next
Action: Using tools (search, visit webpage)
Observation: Receiving feedback from the environment

Two prompting strategies generate trajectories:

Short-CoT: Uses instruction-tuned LLMs (like GPT-4o) for concise reasoning
Long-CoT: Uses LRMs (Large Reasoning Models like QwQ-Plus) for extended multi-step reasoning

The trajectories undergo three-stage filtering:

Validity control: Removes non-compliant responses
Correctness verification: Keeps only correct results (using GPT-4o as judge)
Quality assessment: Filters based on information non-redundancy, goal alignment, and logical reasoning
Supervised Fine-Tuning (SFT) for Cold Start: Perform fine-tuning to adapt the format instruction following to agentic tasks and environments; Trains the policy model on filtered trajectories using supervised learning. Key innovation: masks out observation tokens during loss calculation, focusing learning on autonomous decision-making rather than mimicking external feedback. This teaches the model to chain reasoning and action steps effectively.
Reinforcement Learning (RL) for Generalization: Apply RL to optimize the agent’s decision-making and generalization capabilities in real-world web environments.

Applies DAPO (Decoupled Clip and Dynamic Sampling Policy Optimization) algorithm:

Decoupled Clip: Separates optimization between model-generated content and full context (including tool responses)
Dynamic Sampling: Over-samples and filters prompts during training, focusing on high-quality signals and ignoring unreliable QA pairs

Reward Design:

Format reward: Binary check for correct output format and valid tool calls
Answer reward: Binary correctness evaluation using LLM-as-judge
Final reward: R(ŷ,y) = 0.1 × score_format + 0.9 × score_answer

The RL stage uses QA pairs not utilized during SFT, improving data efficiency and policy robustness.

WebDancer shows strong performance on GAIA and WebWalkerQA benchmarks, achieving 64.1% Pass@3 on GAIA and 62.0% on WebWalkerQA, it outperforms vanilla ReAct baselines and other open-source agentic frameworks, and demonstrates that SFT for cold start is essential - RL alone achieves only 5% on GAIA. RL enables longer reasoning processes and more sophisticated agentic actions.

ReSum

ReSum⁵ is a novel paradigm designed to enable LLM-based web agents to perform long-horizon search without being constrained by context window limitations. The paper addresses a fundamental problem: complex web search tasks require extensive exploration cycles, but the ReAct paradigm (which appends every observation, thought, and action to the dialogue history) quickly exhausts the 32k token context limit, forcing premature termination.

As shown in Figure 2 of the paper, failed search cases on challenging benchmarks like BrowseComp-en typically:

Require 10-20+ tool calls (vs. 10 for successful cases)
Consume significantly more tokens
Get truncated before finding solutions due to context limits

The ReSum Method

Periodic Context Summarization

Instead of accumulating all history, ReSum:

Starts like ReAct, building conversation history iteratively
Triggers summarization when approaching context limits (systematic) or when the policy model decides (agent-initiated)
Invokes a specialized summary tool that:
- Extracts verified evidence from the lengthy interaction history
- Identifies information gaps that still need to be filled
- Produces a compact summary s
Resets the working history to a compressed state q' = (q, s) where q is the original query and s is the summary
Continues exploration from this compressed state

This enables indefinite exploration within practical resource budgets.

ReSum-GRPO: Reinforcement Learning for Paradigm Adaptation

The summary-conditioned reasoning pattern (q, s) is out-of-distribution for standard agents. To adapt agents to this paradigm, the authors developed ReSum-GRPO:

Trajectory Segmentation:

Long trajectories naturally segment at summarization points
A trajectory with K summarizations creates K+1 segments
Each segment becomes an individual training episode

Reward Computation: Uses trajectory-level reward based on answer correctness (evaluated by LLM-as-judge with Format Penalty): Extracts the final answer $a_T$ and computes a binary reward: $R(a, a_T) \in \{0, 1\}$. This gives a single trajectory-level reward for the entire rollout.

Group Normalization (Advantage Computation): - Reward is normalized within the group. The same advantage is broadcast to all segments in the same trajectory. Every segment in the trajectory gets the same advantage value based on whether the final answer was correct.

To adapt to standard GRPO objective, they weight each rollout’s advantage and clipping with the probability ratio for segment i in rollout g. I personally doubt this is mathematically correct, as they clip each segment independently, which may not align with the trajectory-level reward structure, but it seems to work empirically.

Key Results

Training-free ReSum consistently outperforms ReAct by 4.5% average across benchmarks (Table 1)
ReSum-GRPO training delivers further 8.2% improvement over training-free ReSum
Works with extensive context models - Even for Tongyi-DeepResearch-30B-A3B (128k context), ReSum provides benefits by enabling more effective exploration
Training efficiency: With only 1K training samples, ReSum-GRPO achieves substantial improvements (e.g., WebSailor-30B goes from 8.2% to 20.5% on BrowseComp-zh).

WebWeaver

WebWeaver⁶ is a dual-agent framework designed to tackle Open-Ended Deep Research (OEDR) - the complex challenge of synthesizing vast web-scale information into comprehensive, well-structured research reports. It achieves state-of-the-art performance on major OEDR benchmarks.

Current deep research systems suffer from two critical limitations:

Static pipelines that decouple planning from evidence acquisition - they either generate outlines first without evidence, or search first without guidance
Monolithic generation that dumps all collected evidence into the LLM context, leading to:

“Lost in the middle” problem (important details overlooked)
Citation hallucinations
Context overflow
Poor coherence

WebWeaver mimics how human researchers actually work through two specialized agents:

The Planner - Dynamic Research Cycle

Operates in an iterative loop that interleaves evidence acquisition with outline optimization
Continuously refines the outline based on discoveries
Each outline section is explicitly linked via citations to a curated memory bank of source evidence
Actions: search, write_outline, terminate

The Writer - Hierarchical Synthesis

Executes section-by-section writing rather than generating the entire report at once
For each section, performs targeted retrieval of only relevant evidence from the memory bank using citations in the outline
This creates a “think → retrieve → write” cycle for each section
Actions: retrieve, write, terminate

Dynamic Outline Optimization: Unlike static approaches, the planner continuously evolves the outline through multiple rounds (average 2-3 rounds):

Expands sections based on new findings
Adds citations to ground each section: Citations grounds the outline in actual evidence during planning and enable precise retrieval during writing.
Restructures to better reflect comprehensive understanding
The outline becomes a strategic blueprint that guides subsequent searches

Memory-Grounded Synthesis: Instead of overwhelming the LLM with 100+ web pages (100k+ tokens):

Only summaries and evidence snippets are stored in the memory bank
The writer retrieves only the specific evidence needed for each section via citations
Dramatically reduces context size while maintaining accuracy
Prevents “contextual bleeding” between sections

Process Flow:

Planner searches web → filters URLs → extracts evidence → updates outline
Cycle repeats until outline is comprehensive
Writer takes structured outline + memory bank
For each section: retrieve relevant evidence → synthesize → write
Final report is coherent, well-cited, and comprehensive

Training Data: WebWeaver-3k

The team created a high-quality supervised fine-tuning (SFT) dataset of 3.3k trajectories demonstrating the full planning and writing process. Fine-tuning on this data enabled smaller models (Qwen3-30b) to achieve expert-level performance previously only seen in large proprietary systems.

By implementing this as a dynamic feedback loop with structured information management, WebWeaver produces reports that are comprehensive, accurate, well-structured, and properly cited. WebWeaver achieves state-of-the-art results across three major benchmarks:

DeepResearch Bench: Overall: 50.58FACT (citation accuracy): 93.37%, outperforms proprietary systems.
DeepConsult: Win rate: 66.86%
DeepResearchGym: Average score 96.77, Top score across all dimensions

Pham, Thinh, et al. “SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models.” arXiv:2506.01062, arXiv, 11 June 2025. arXiv.org, https://doi.org/10.48550/arXiv.2506.01062. SealQA¹ is a benchmark designed to evaluate Search-Augmented Language Models on fact-seeking questions where web search yields conflicting, noisy, or unhelpful results. Unlike existing QA benchmarks that assume clean, straightforward information retrieval, SealQA tests models’ ability to reason deeply when faced with real-world search complexity. ↩︎
Wei, Jason, et al. “BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents.” arXiv:2504.12516, arXiv, 16 Apr. 2025. arXiv.org, https://doi.org/10.48550/arXiv.2504.12516. ↩︎
Wu, Jialong, et al. “WebWalker: Benchmarking LLMs in Web Traversal.” arXiv:2501.07572, arXiv, 10 Aug. 2025. arXiv.org, https://doi.org/10.48550/arXiv.2501.07572. ↩︎
Wu, Jialong, and others, ‘WebDancer: Towards Autonomous Information Seeking Agency’, arXiv:2505.22648, preprint, arXiv, 10 August 2025, doi:10.48550/arXiv.2505.22648 ↩︎
Wu, Xixi, and others, ‘ReSum: Unlocking Long-Horizon Search Intelligence via Context Summarization’, arXiv:2509.13313, preprint, arXiv, 15 October 2025, doi:10.48550/arXiv.2509.13313 ↩︎
Li, Zijian, and others, ‘WebWeaver: Structuring Web-Scale Evidence with Dynamic Outlines for Open-Ended Deep Research’, arXiv:2509.13312, preprint, arXiv, 7 October 2025, doi:10.48550/arXiv.2509.13312 ↩︎

Core Technologies Behind Agentic Search and Deep Research#

1. Reasoning Models#

2. Agentic Loop#

3. The Toolset#

Key Papers on Agentic Search and Deep Research#

Benchmarking#

BrowseComp#

How BrowseComp Was Created#

SealQA#

Data Collection Process#

SealQA Evaluation Findings#

Frameworks#

WebWalker#

WebDancer#

ReSum#

WebWeaver#