Modern AI company provide “Deep Search” and “Deep Research” as a service. For example, ChatGPT offers Thinking and Deep Research. Google’s Gemini has thinking mode and Deep Research tool(by default make use of “Flash Thinking” model). Grok support Expert mode(think hard) and Heavy mode(team of experts). DeepSeek provides DeepThink mode. Kimi offer “Agent” option, which provides various agentic capabilities such as office-pilot, web search, agent swarm and Deep Research etc..
Different companies using different terms to describe similar capabilities. The underlying technology is very new and rapidly evolving.
These systems are designed to autonomously navigate the web, extract relevant data, and synthesize it into comprehensive reports or answers.
Core Technologies Behind Agentic Search and Deep Research
While different companies (OpenAI, Google, Perplexity) have their own versions of Agentic Search and Deep Research, the underlying “Tech Stack” generally consists of three pillars: Reasoning Models, Agentic Loops, Iterative Tool Use.
1. Reasoning Models
Traditional models use “next-token prediction” to give you an answer instantly. Deep Research uses Reasoning Models (like OpenAI’s o3 or Gemini 2.0/3 Flash Thinking) that use Reinforcement Learning (RL) to “think before they speak.”
- Chain-of-Thought (CoT): The model generates an internal hidden monologue to plan its steps.
- RL Training: These models are trained specifically on “browsing trajectories”—rewarding the AI when it successfully navigates a complex website to find a specific data point.
2. Agentic Loop
Unlike a standard search that happens in one shot, Deep Research operates in a recursive loop.
- Decomposition: It breaks your prompt (e.g., “Compare the EV market in 2026”) into sub-questions: “Current market share,” “Battery tech breakthroughs,” and “New regulatory hurdles.”
- The “Scout” and “Explorer”: It spawns sub-agents. A “Scout” generates the search queries, while the “Explorer” actually visits the URLs.
- Backtracking: If the agent hits a dead end (like a 404 error or a paywall), the reasoning model “realizes” it failed and tries a different path—much like a human researcher would.
3. The Toolset
To be truly “agentic,” the technology integrates several specialized tools or protocals:
- Search Tools: Integrated web browsers that can perform complex searches, click through links, and even interact with dynamic content (like filling out forms or navigating dropdowns).
- Generative Tools: Make use of LLMs’ generative ability to various generative tasks, e.g. summarization tools to condense long articles, and question-answering tools to extract specific data points from dense text.
- Python Sandbox: If it finds a raw data table, it writes and executes Python code to calculate growth rates or generate charts.
- Multimodal Vision: It doesn’t just read text; it “looks” at screenshots of charts and diagrams to extract data that isn’t in the page copy.
- MCP (Model Context Protocol): A standard that allows the agent to securely connect to external databases, Google Drive, or Slack to pull in non-public information.
- Skills: Pre-built “skills” for specific tasks, like “Financial Analysis,” “Scientific Literature Review,” or “Competitive Intelligence,” which are essentially pre-configured agentic workflows optimized for those domains.
Key Papers on Agentic Search and Deep Research
Benchmarking
BrowseComp
BrowseComp2 introduces a new benchmark designed to evaluate the browsing capabilities of AI agents. The paper highlights the limitations of current models in navigating complex web environments and proposes a comprehensive set of tasks that require multi-step reasoning, dynamic interaction with web elements, and the ability to synthesize information from multiple sources. The benchmark serves as a critical tool for advancing the development of more sophisticated browsing agents capable of performing real-world research tasks.
The authors draw an analogy to programming competitions: just as CodeForces serves as an incomplete but useful benchmark for coding ability (despite not capturing all aspects of software engineering), BrowseComp aims to test a core capability of browsing agents—persistent, creative information finding—even if it doesn’t capture the full complexity of real user queries.
How BrowseComp Was Created
The dataset was built entirely by human trainers (not synthetic generation). Trainers were instructed to create questions that met strict criteria:
- Challenging: Questions must not be solvable by existing models (GPT-4o, o1, early Deep Research) or by humans within 10 minutes
- Verifiable: Answers must be short, single strings that are easy to verify once found
- Hard to find, easy to verify: Trainers started with a “seed” fact, then created inverted questions with multiple constraints that create large search spaces. For example, “What’s the title of the scientific paper published in the EMNLP conference between 2018-2023 where the first author did their undergrad at Dartmouth College and the fourth author did their undergrad at University of Pennsylvania?”, To solve this requires checking thousands of papers and author backgrounds—brute force is impractical, but verification is simple.
Rather than starting with a question and finding its answer, trainers started with an interesting seed (person, event, artifact), and identified multiple specific characteristics with large search spaces, then combined these into a question where the answer is obscure but verifiable.
Trainers verified the answer wasn’t on the first page of results for 5 simple searches. A second trainer attempted to solve questions; creators of questions solved >40% of the time were asked to revise. Trainers added more constraints if they weren’t confident the answer was unique.
As a result, 1,266 questions across diverse topics (TV/movies, science/tech, art, history, sports, music, etc.) were created,.
SealQA
SealQA comes in three variants:
- SEAL-0 (111 questions): The core “stress test” where even frontier models like GPT-4.1 with browsing consistently fail. Questions are iteratively refined until multiple models fail across 10-15 attempts, achieving 0% accuracy.
- SEAL-HARD (254 questions): A broader set including SEAL-0 plus additional difficult questions that didn’t meet the strict failure threshold but remain highly challenging.
- LONGSEAL (254 questions): A “needle-in-a-haystack” variant testing long-context, multi-document reasoning. Each question is paired with 1 gold document and up to 50 hard negative distractors (over 7.6K documents total).
Question characteristics:
- Single, unambiguous answers (verifiable from authoritative sources)
- Require deep reasoning: distinguishing similar entities, tracking temporal changes, interpreting charts/tables, counting, reasoning over non-English content, debunking false premises
- Designed to trigger ambiguous, conflicting, or noisy search results
- Span diverse domains: science (26.8%), sports (22.0%), entertainment (21.7%), politics (9.1%), history, geography, etc.
- Include temporal freshness categories: 31.1% never-changing, 43.7% slow-changing, 25.2% fast-changing
Quality control:
- Rigorous multi-round vetting by graduate-level NLP researchers
- Expert reviewer approval
- Each question annotated with supporting URLs and expected review dates
- Questions refined iteratively until they consistently cause model failures
Data Collection Process
- Human Annotators
- Recruited NLP researchers (including the 6 authors and their colleagues) as human annotators
- Shown a small, diverse set of example questions to illustrate the types of questions to collect
- Question Design Criteria
Annotators were instructed to write questions with:
Answer requirements:
- Single, unambiguous answer (e.g., specifying “on what date” rather than just asking “when”)
- Must be verifiable - supported by one or more webpages that justify the reference answer
- For questions involving fresh knowledge, annotators required to cite regularly updated sources to support future maintenance
Difficulty requirements:
- Questions designed to trigger ambiguous, conflicting, or misleading search results when entered into a search engine like Google
- Must appear natural and avoid triggering ambiguous, conflicting, or noisy search results
Documentation:
- Annotators provided an explanation for each answer, including necessary clarification or subtle reasoning
- Each question was refined iteratively until it consistently caused multiple models to fail across repeated attempts
- Iterative Refinement Process
A rigorous failure-driven curation:
For SEAL-0 specifically:
- Each question tested against GPT-4o, GPT-4.1, and their mini variants (with and without browsing capabilities)
- Also tested against OpenAI’s o1, o3, and Meta’s Llama models
- Questions retained only if all models failed across 10-15 attempts
- This achieves 0% accuracy threshold, hence “0” in the name
- The paper notes: “This follows current best practices for building challenging datasets” (referencing GPQA-Diamond and SimpleQA)
For SEAL-HARD:
- Includes SEAL-0 questions plus additional difficult questions
- These didn’t meet the strict failure threshold but remain highly challenging
- Quality Control - Multi-Round Vetting
A rigorous review process:
- Initial review: Two or more graduate-level reviewers first reviewed each question
- Expert approval: Followed by approval from expert reviewers
- Multiple rounds of data cleaning: - Verification of supporting URLs - Answer correctness checks - Question clarity assessment
- Exclusions: - Questions whose answers change too frequently - Questions with unclear or ambiguous phrasing
Metadata annotation:
- Effective year (when the answer last changed)
- Expected next review date for future maintenance
- Dataset Statistics and Diversity
Question types (5 categories):
- Q₁ (72.4%): Advanced reasoning - multi-hop reasoning, interpreting charts/tables, counting, calculations
- Q₂ (58.3%): Entity/event disambiguation - distinguishing between similar entities or events
- Q₃ (13.7%): Temporal tracking - identifying and differentiating instances of entities over time
- Q₄ (5.5%): Cross-lingual reasoning - questions in English requiring reasoning over non-English sources
- Q₅ (5.7%): False-premise questions - debunking false premises
Domain diversity:
- Science and technology: 26.8%
- Sports: 22.0%
- Entertainment: 21.7%
- Politics: 9.1%
- History, geography, and others: 12.2%
Temporal freshness classification:
- 31.1% never-changing (answers never change)
- 43.7% slow-changing (answers typically change within a year)
- 25.2% fast-changing (answers change rapidly, e.g., within weeks)
Question length:
- Average: 31 tokens
- Maximum: 69 tokens
Topic labels:
- Assigned post-hoc using GPT-4o mini
- LONGSEAL Creation
For the long-context variant:
- Gold document selection: One helpful document from annotator-provided webpages
- Hard negative collection: - Used Google search to retrieve top 10 webpages per question - Extracted main content using Trafilatura tool - Also used GPT-4o mini to generate 3 semantically related queries - Collected 10 more pages limited to pre-2023 content (for temporal diversity) - Total: Up to 50 hard negatives per question
- Filtering: - Removed duplicates - Excluded documents over 10K tokens
- Gold document placement: - Randomly inserted among the negatives - Used o3-mini and o4-mini to filter out any negatives that might allow the correct answer to be inferred
Final dataset: Over 7.6K documents total
- Development Timeline and Cost
The paper notes this was intentionally kept small due to:
- High cost and complexity of question development
- Team of six NLP researchers working over eight months
- Multiple development cycles
- Each question required over an hour on average (~45 minutes to draft, plus review and revision time)
Many initial ideas were discarded as they failed to meaningfully challenge frontier LLMs (e.g., the widely used GPQA-Diamond, with 198 expert-vetted questions, demonstrates this approach).
- Evaluation Protocol
Auto-rater:
- Adapted GPT-4o mini auto-rater from SIMPLEQA
- Takes question, gold target, reference answer as input
- Labels responses as: “correct”, “incorrect”, or “not attempted”
- Uses relaxed protocol judging whether main answer is factually correct and consistent
- 98% agreement with human ratings (validated on 100 answers by two independent authors)
SealQA Evaluation Findings
- Frontier Models Struggle Significantly On SEAL-0 (without search):
- GPT-4.1: 0.0%
- o3-mini-medium: 2.7%
- o3-medium: 5.4%
On SEAL-HARD (without search):
- GPT-4.1: 15.0%
- o3-mini: 19.7%
- o3-medium: 34.6%
Even with search access, performance remains low:
- GPT-4.1 (w/ search): 20.5% on SEAL-HARD
- o3-medium (w/ search): 34.6% on SEAL-HARD
Human performance: Graduate-level NLP researchers achieved 23.3% average (30.0% best) on SEAL-HARD questions, showing these are genuinely difficult.
- Test-Time Scaling Doesn’t Reliably Help
A critical finding shown in Figure 1: Simply increasing test-time compute (more reasoning tokens) does not lead to reliable gains on SealQA. Performance often plateaus or even declines early.
- o3-mini: Peaks at 6.3% (low effort), drops to 5.4% (medium) and 4.5% (high)
- o3: Best at 11.7% (low), 17.1% (medium), 14.4% (high) - inconsistent gains
- o4-mini: Similar pattern of diminishing or negative returns
This suggests that more compute alone doesn’t solve the core reasoning challenges when faced with noisy information.
- Advanced Reasoning Models Are Vulnerable to Noise
DeepSeek-R1-671B and o3-mini show dramatic sensitivity to noisy search results:
- DeepSeek-R1-671B: 22.4% → 11.0% (drops 51% with FreshPrompt)
- o3-mini: Better on recent/dynamic questions but struggles with static/older questions
The models fail because:
- They struggle to filter out irrelevant or misleading information
- Noise amplifies rather than improves accuracy
- They have difficulty prioritizing conflicting evidence
- Model-Specific Weaknesses by Question Category
Cross-lingual reasoning (Q₄): Models perform poorly on questions requiring reasoning over non-English sources
- GPT-4.1: 14.1% (w/o search) → 20.1% (w/ search)
- o3-mini: 13.0% → 12.0% (search actually hurts)
False-premise detection (Q₅): Debunking incorrect assumptions is extremely difficult
- Most models: 0.0% across the board
- Even with search, models rarely identify and reject false premises
Entity/event disambiguation (Q₂): Distinguishing between similar entities or events
- GPT-4.1: 14.2% → 17.6%
- o3: 17.1% → 48.6% (notable improvement)
Temporal tracking (Q₃): Questions involving recent or rapidly changing information
- Fast-changing questions are particularly challenging
- GPT-4.1: 1.6% (fast) vs 18.0% (slow) vs 21.5% (never-changing)
- Search Integration Can Be Detrimental
Contrary to expectations, search doesn’t always help:
- Built-in search vs FreshPrompt: ChatGPT’s built-in search generally improves performance (+5.5% for GPT-4.1), but FreshPrompt (retrieval-based prompting) can hurt advanced reasoning models
- Noise amplification: When search results are uniformly unhelpful, performance degrades more than when they contain conflicting answers
- Lost-in-the-middle problem persists: In LONGSEAL, models fail to reliably identify and prioritize relevant documents when numerous distractors are present (Figure 6)
- LONGSEAL: Long-Context Reasoning Failures
All models show clear accuracy degradation as the number of hard negatives increases:
- GPT-4.1-mini: 32.7% (k=12) → 29.4% (k=30)
- GPT-4o-mini: 24.0% → 6.3%
- Llama-3.2-11B: 10.2% → 2.0%
Key insight: Simply increasing context size doesn’t guarantee effective context use. Models struggle to filter irrelevant content at scale, even when all documents fit within the context window.
No “lost-in-the-middle” bias: Unlike earlier work, newer models don’t show clear U-shaped positional bias, but still fail to recognize the gold document when distractors are numerous.
- Qualitative Analysis Reveals Reasoning Patterns
Analysis of 100 responses from 6 models showed:
- GPT-4.1 (base): Occasionally includes relevant info but produces inaccurate answers due to outdated knowledge
- GPT-4.1 (FreshPrompt): Better at detecting questions requiring search, more logically coherent, but accuracy depends on retrieval quality
- GPT-4.1 (built-in search): More coherent, higher-quality citations, but still makes occasional errors
- o3: Capable of detailed, informed responses but sometimes overthinks, repeats phrases like “wait”, and lacks structured formatting
- DeepSeek-R1-671B: Tends to overthink without reaching clear conclusions
Frameworks
WebWalker
WebWalker3 addresses the limitations of current Retrieval-Augmented Generation (RAG) systems when dealing with complex, multi-layered information that requires navigating through a website’s hierarchy rather than just landing on a single page. The authors introduce a WebWalkerQA benchmark specifically designed to test an LLM’s ability to perform web traversal. They create the benchmark with GPT-4o and human anotations.
To tackle the benchmark, the authors propose WebWalker, a multi-agent framework that mimics human-like browsing behavior. WebWalker consists of two main agents:
- Explorer Agent (Think then Explore)
- Purpose: Actively navigates through web pages by clicking on HTML buttons/links
- Based on: ReAct framework (Reasoning + Acting)
- Process: At each time step, it:
- Receives an observation from the web environment
- Takes an action (selects a URL to explore)
- Follows a Thought-Action-Observation paradigm
- Explores subpages by interacting with clickable HTML buttons
- The observation includes: page information, clickable HTML buttons with URLs
- Critic Agent (Think then Critique)
- Purpose: Evaluates progress and decides when sufficient information has been gathered
- Key responsibilities:
- Maintains a memory that incrementally accumulates relevant information
- After each explorer action, evaluates whether gathered information is sufficient to answer the query
- Provides an answer once required information is deemed sufficient
- Addresses the challenge of implicit policies and potentially large observation sizes
The framework operates in an explore-critic paradigm:
- Explorer traverses web pages in Thought-Action-Observation cycles
- Critic takes the query and explorer’s current observation/action as input
- Critic updates its memory and evaluates if enough information has been collected
- Process continues until critic determines the query can be answered or maximum steps reached
WebDancer
WebDancer4 is a systematic framework for building end-to-end autonomous web agents capable of multi-step information seeking, similar to systems like ChatGPT’s Deep Research. The paper addresses the challenge of creating agents that can autonomously navigate complex web environments to answer questions requiring deep information gathering.
WebDancer abstract an end-to-end web agents building pipeline into four key stages:
- Browsing Data Construction: Construct diverse and challenging deep information seeking QA pairs based on the real-world web environment; The authors create two types of synthetic QA datasets:
- CRAWLQA: Crawls web pages from knowledgeable websites (Wikipedia, arXiv, etc.), recursively navigates subpages, and uses GPT-4o to synthesize QA pairs from collected content. This creates diverse, multi-hop questions.
- E2HQA (Easy-to-Hard QA): Starts with simple entity-based questions and iteratively refines them into progressively more complex multi-step questions. The process uses search engines to find information and restructures queries while preserving answer validity.
- Agent Trajectories Rejection Sampling: Sample high-quality trajectories from QA pairs using both LLMs and LRMs to guide the agency learning process;
Uses the ReAct framework where agents alternate between:
- Thought: Reasoning about what to do next
- Action: Using tools (search, visit webpage)
- Observation: Receiving feedback from the environment
Two prompting strategies generate trajectories:
- Short-CoT: Uses instruction-tuned LLMs (like GPT-4o) for concise reasoning
- Long-CoT: Uses LRMs (Large Reasoning Models like QwQ-Plus) for extended multi-step reasoning
The trajectories undergo three-stage filtering:
Validity control: Removes non-compliant responses
Correctness verification: Keeps only correct results (using GPT-4o as judge)
Quality assessment: Filters based on information non-redundancy, goal alignment, and logical reasoning
Supervised Fine-Tuning (SFT) for Cold Start: Perform fine-tuning to adapt the format instruction following to agentic tasks and environments; Trains the policy model on filtered trajectories using supervised learning. Key innovation: masks out observation tokens during loss calculation, focusing learning on autonomous decision-making rather than mimicking external feedback. This teaches the model to chain reasoning and action steps effectively.
Reinforcement Learning (RL) for Generalization: Apply RL to optimize the agent’s decision-making and generalization capabilities in real-world web environments.
Applies DAPO (Decoupled Clip and Dynamic Sampling Policy Optimization) algorithm:
- Decoupled Clip: Separates optimization between model-generated content and full context (including tool responses)
- Dynamic Sampling: Over-samples and filters prompts during training, focusing on high-quality signals and ignoring unreliable QA pairs
Reward Design:
- Format reward: Binary check for correct output format and valid tool calls
- Answer reward: Binary correctness evaluation using LLM-as-judge
- Final reward: R(ŷ,y) = 0.1 × score_format + 0.9 × score_answer
The RL stage uses QA pairs not utilized during SFT, improving data efficiency and policy robustness.
WebDancer shows strong performance on GAIA and WebWalkerQA benchmarks, achieving 64.1% Pass@3 on GAIA and 62.0% on WebWalkerQA, it outperforms vanilla ReAct baselines and other open-source agentic frameworks, and demonstrates that SFT for cold start is essential - RL alone achieves only 5% on GAIA. RL enables longer reasoning processes and more sophisticated agentic actions.
ReSum
ReSum5 is a novel paradigm designed to enable LLM-based web agents to perform long-horizon search without being constrained by context window limitations. The paper addresses a fundamental problem: complex web search tasks require extensive exploration cycles, but the ReAct paradigm (which appends every observation, thought, and action to the dialogue history) quickly exhausts the 32k token context limit, forcing premature termination.
As shown in Figure 2 of the paper, failed search cases on challenging benchmarks like BrowseComp-en typically:
- Require 10-20+ tool calls (vs. 10 for successful cases)
- Consume significantly more tokens
- Get truncated before finding solutions due to context limits
The ReSum Method
- Periodic Context Summarization
Instead of accumulating all history, ReSum:
- Starts like ReAct, building conversation history iteratively
- Triggers summarization when approaching context limits (systematic) or when the policy model decides (agent-initiated)
- Invokes a specialized summary tool that:
- Extracts verified evidence from the lengthy interaction history
- Identifies information gaps that still need to be filled
- Produces a compact summary
s
- Resets the working history to a compressed state
q' = (q, s)whereqis the original query andsis the summary - Continues exploration from this compressed state
This enables indefinite exploration within practical resource budgets.
- ReSum-GRPO: Reinforcement Learning for Paradigm Adaptation
The summary-conditioned reasoning pattern (q, s) is out-of-distribution for standard agents. To adapt agents to this paradigm, the authors developed ReSum-GRPO:
Trajectory Segmentation:
- Long trajectories naturally segment at summarization points
- A trajectory with K summarizations creates K+1 segments
- Each segment becomes an individual training episode
Reward Computation: Uses trajectory-level reward based on answer correctness (evaluated by LLM-as-judge with Format Penalty): Extracts the final answer $a_T$ and computes a binary reward: $R(a, a_T) \in \{0, 1\}$. This gives a single trajectory-level reward for the entire rollout.
Group Normalization (Advantage Computation): - Reward is normalized within the group. The same advantage is broadcast to all segments in the same trajectory. Every segment in the trajectory gets the same advantage value based on whether the final answer was correct.
To adapt to standard GRPO objective, they weight each rollout’s advantage and clipping with the probability ratio for segment i in rollout g. I personally doubt this is mathematically correct, as they clip each segment independently, which may not align with the trajectory-level reward structure, but it seems to work empirically.
Key Results
- Training-free ReSum consistently outperforms ReAct by 4.5% average across benchmarks (Table 1)
- ReSum-GRPO training delivers further 8.2% improvement over training-free ReSum
- Works with extensive context models - Even for Tongyi-DeepResearch-30B-A3B (128k context), ReSum provides benefits by enabling more effective exploration
- Training efficiency: With only 1K training samples, ReSum-GRPO achieves substantial improvements (e.g., WebSailor-30B goes from 8.2% to 20.5% on BrowseComp-zh).
WebWeaver
WebWeaver6 is a dual-agent framework designed to tackle Open-Ended Deep Research (OEDR) - the complex challenge of synthesizing vast web-scale information into comprehensive, well-structured research reports. It achieves state-of-the-art performance on major OEDR benchmarks.
Current deep research systems suffer from two critical limitations:
- Static pipelines that decouple planning from evidence acquisition - they either generate outlines first without evidence, or search first without guidance
- Monolithic generation that dumps all collected evidence into the LLM context, leading to:
- “Lost in the middle” problem (important details overlooked)
- Citation hallucinations
- Context overflow
- Poor coherence
WebWeaver mimics how human researchers actually work through two specialized agents:
- The Planner - Dynamic Research Cycle
- Operates in an iterative loop that interleaves evidence acquisition with outline optimization
- Continuously refines the outline based on discoveries
- Each outline section is explicitly linked via citations to a curated memory bank of source evidence
- Actions: search, write_outline, terminate
- The Writer - Hierarchical Synthesis
- Executes section-by-section writing rather than generating the entire report at once
- For each section, performs targeted retrieval of only relevant evidence from the memory bank using citations in the outline
- This creates a “think → retrieve → write” cycle for each section
- Actions: retrieve, write, terminate
Dynamic Outline Optimization: Unlike static approaches, the planner continuously evolves the outline through multiple rounds (average 2-3 rounds):
- Expands sections based on new findings
- Adds citations to ground each section: Citations grounds the outline in actual evidence during planning and enable precise retrieval during writing.
- Restructures to better reflect comprehensive understanding
- The outline becomes a strategic blueprint that guides subsequent searches
Memory-Grounded Synthesis: Instead of overwhelming the LLM with 100+ web pages (100k+ tokens):
- Only summaries and evidence snippets are stored in the memory bank
- The writer retrieves only the specific evidence needed for each section via citations
- Dramatically reduces context size while maintaining accuracy
- Prevents “contextual bleeding” between sections
Process Flow:
- Planner searches web → filters URLs → extracts evidence → updates outline
- Cycle repeats until outline is comprehensive
- Writer takes structured outline + memory bank
- For each section: retrieve relevant evidence → synthesize → write
- Final report is coherent, well-cited, and comprehensive
Training Data: WebWeaver-3k
The team created a high-quality supervised fine-tuning (SFT) dataset of 3.3k trajectories demonstrating the full planning and writing process. Fine-tuning on this data enabled smaller models (Qwen3-30b) to achieve expert-level performance previously only seen in large proprietary systems.
By implementing this as a dynamic feedback loop with structured information management, WebWeaver produces reports that are comprehensive, accurate, well-structured, and properly cited. WebWeaver achieves state-of-the-art results across three major benchmarks:
- DeepResearch Bench: Overall: 50.58FACT (citation accuracy): 93.37%, outperforms proprietary systems.
- DeepConsult: Win rate: 66.86%
- DeepResearchGym: Average score 96.77, Top score across all dimensions
Pham, Thinh, et al. “SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models.” arXiv:2506.01062, arXiv, 11 June 2025. arXiv.org, https://doi.org/10.48550/arXiv.2506.01062. SealQA1 is a benchmark designed to evaluate Search-Augmented Language Models on fact-seeking questions where web search yields conflicting, noisy, or unhelpful results. Unlike existing QA benchmarks that assume clean, straightforward information retrieval, SealQA tests models’ ability to reason deeply when faced with real-world search complexity. ↩︎
Wei, Jason, et al. “BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents.” arXiv:2504.12516, arXiv, 16 Apr. 2025. arXiv.org, https://doi.org/10.48550/arXiv.2504.12516. ↩︎
Wu, Jialong, et al. “WebWalker: Benchmarking LLMs in Web Traversal.” arXiv:2501.07572, arXiv, 10 Aug. 2025. arXiv.org, https://doi.org/10.48550/arXiv.2501.07572. ↩︎
Wu, Jialong, and others, ‘WebDancer: Towards Autonomous Information Seeking Agency’, arXiv:2505.22648, preprint, arXiv, 10 August 2025, doi:10.48550/arXiv.2505.22648 ↩︎
Wu, Xixi, and others, ‘ReSum: Unlocking Long-Horizon Search Intelligence via Context Summarization’, arXiv:2509.13313, preprint, arXiv, 15 October 2025, doi:10.48550/arXiv.2509.13313 ↩︎
Li, Zijian, and others, ‘WebWeaver: Structuring Web-Scale Evidence with Dynamic Outlines for Open-Ended Deep Research’, arXiv:2509.13312, preprint, arXiv, 7 October 2025, doi:10.48550/arXiv.2509.13312 ↩︎