Sail’s agentic AI researcher, using our efficient inference engine, topped BrowseComp-Plus with 90.72% accuracy at 6-35x lower cost.
Background agents for deep research
Deep research is inherently background work: accuracy is the top priority and latency takes a back seat. As agents improve, humans want to offload various research tasks to prioritize higher value items. But for human-out-of-the-loop systems, we need increasing confidence the retrieved answer will be right AND that long running agent trajectories won’t burn through tokens on a completely wrong path.
Background work is won by the platform that makes long, token-heavy, unattended agent trajectories cheap and reliable to run.
We focused on winning BrowseComp-Plus, a deep research evaluation benchmark that is built on top of OpenAI's BrowseComp. This is a challenging benchmark that measures the ability of AI agents to locate hard-to-find answers to complex, multi-step questions. The “Plus” iteration of BrowseComp fixes the curated corpus of documents, allowing for fair, transparent evaluation of agents and addresses reproducibility concerns of hitting dynamic web APIs. There are 100k documents in the dataset, with each question averaging 2.9 gold documents, 6.1 evidence documents, and 76.3 negative documents: documents containing the final answer, documents used in the reasoning chain to get there, and documents that are hard-negatives specifically mined to lead the agent astray. This asymmetric signal-to-noise ratio mimics real-world deep research tasks: finding a needle in a haystack while avoiding numerous misleading decoys.
Novel inference approach
Standard single-agent + retriever systems fail. The canonical approach to BrowseComp-Plus has been to select the most powerful frontier model available and to invest immense resources developing the strongest retriever system to surface only high-signal documents. This is the wrong approach for research-heavy tasks. Not only is this inherently closed source and cost prohibitive, it treats deep research as a delicate reasoning problem when the true bottleneck is bulk parallel processing of immense quantities of data.
Research is a question of compute. Fundamentally the answer lives in the corpus of data; even without beautifully constructed retriever algorithms, brute-force searching the entire corpus should net the correct answer assuming infinite context. This approach to deep research was previously untenable as inference costs balloon out of control and agents get overwhelmed, leading to the necessary development of smarter reasoning models and more capable retrievers. However, there is a threshold where pushing private frontier capabilities has diminishing returns. We've crossed that inflection point, where open models and simple retrievers, with the right inference engine, can accomplish state-of-the-art performance against even the best private alternatives.
Prior deep research failures become tractable when the agent can afford to search broadly and deploy subagent swarms to read deeply. Concretely, two things need to happen:
- Persistence: the main agent cannot be overwhelmed by context
- Cost: feasibility relies on cheap compute
Point one requires a reasonably performant retriever that can determine what documents to pass to the main agent. Reading every document in the corpus is infeasible and it overwhelms the orchestrator's context window. Note this specifically doesn't need a SOTA custom retriever, just something that is good enough to help a multi-turn agent cut through noise.
Point two requires a robust infrastructure layer that treats background agent tasks as a first-class citizen. Without reliable and efficient inference, the race to use closed-source frontier models takes center stage as they're the only option to reason through hard queries. Lightweight, open-source models need more compute to compete.
Research is a question of compute.
These two points made open-source models obsolete against closed alternatives on the BCP leaderboard. The previous best open-source solution offers just 68% accuracy over the 830 queries. And that's with a specifically trained agent for open research tasks and a deep research–specific embedding model. Truly, off-the-shelf implementations result in just 57% accuracy with GPT-OSS-120B-high as the main LLM and Qwen3-Embed-8B as the retriever. Closed solutions, on the other hand, were able to push 90.48% with GPT-5 and proprietary retrievers.
We have bridged this gap. Our inference product makes open-source compute easy to use at scale, allowing reliable deep-research agent harnesses to be built on top. Using fully open models, GLM-5.1 as our orchestrator, Qwen3-Embed-8B and BM25 as the retrievers, and gpt-oss-120b as the swarm agent, we achieved state-of-the-art performance at 90.72% accuracy and 84.31% recall.
| Component | Uncached Input | Cached Input | Output | Total Tokens |
|---|---|---|---|---|
| Orchestrator: GLM-5.1 | 11,133,388 | 24,572,544 | 3,259,993 | 38,965,925 |
| Swarm: gpt-oss-120b | 6,076,321,295 | — | 359,481,027 | 6,435,802,322 |
| Total | 6,087,454,683 | 24,572,544 | 362,741,020 | 6,474,768,247 |
Agent swarm sneak peek, vast majority of tokens consumed by more efficient swarm agents. This allows us to increase number of search documents k.
An efficient agent swarm
The Sail architecture is simple.
- Have an orchestrator agent propose a search given the research query.
- The retriever returns k matching documents.
- Swarm of cheaper reader agents read truncated documents in parallel
- Drop irrelevant documents
- Summarize evidence in document OR extend document
- Orchestrator sees compacted evidence, re-searches or submits answer
There are several key optimizations here. The orchestrator never reads any document. It relies on signal from the swarm agents. This prevents the orchestrator's context window from blowing up and alleviates pressure from the retriever — if the retriever fails to find relevant documents, the swarm agents protect the orchestrator from bad signal that could lead the overall trajectory astray. The swarm agents have fresh context per query and are cheaper to run so there's far less risk if the retriever is more liberal in its search results. This also helps with finding answers to abstruse questions as the retriever can return more periphery documents that could lead to the answer.ents k
Second, the vast majority of tokens are consumed by the swarm reader agent. The reader agents are doing the heavy lifting of ingesting k documents per turn and producing a summary of the evidence when relevant. This allows for having a more expensive, frontier-esque model as the orchestrator doing the heavy reasoning, search query edits, and final answer submission and using a far cheaper alternative as the swarm agent per document. As the number of documents k increases, the economics become increasingly lucrative. The vast majority of tokens are consumed by lightweight swarm agents.
The cost benefits are immense with this approach. By analyzing the number of tokens consumed, we can easily see how Sail’s ability to do efficient large-scale batch inference directly translates to a lower overall cost compared to other mainstream inference providers.
| Provider | Orchestrator | Swarm | Orchestrator $ | Filter $ | Total $ | $/query | vs Sail |
|---|---|---|---|---|---|---|---|
| Sail | GLM-5.1 | gpt-oss-120b | $12 | $116 | $129 | $0.15 | 1× |
| OpenRouter | GLM-5.1 | gpt-oss-120b | $33 | $772 | $805 | $0.97 | 6× |
| Baseten | GLM-5 | gpt-oss-120b | $26 | $787 | $813 | $0.98 | 6× |
| Fireworks | GLM-5.1 | gpt-oss-120b | $36 | $1,127 | $1,163 | $1.40 | 9× |
| Together AI | GLM-5.1 | gpt-oss-120b | $64 | $1,127 | $1,191 | $1.44 | 9× |
| OpenAI | GPT-5.4 | GPT-5.4 Nano | $83 | $1,665 | $1,748 | $2.11 | 14× |
| Z AI | GLM-5.1 | GLM-4.7 | $36 | $4,437 | $4,473 | $5.39 | 35× |
- [1,2]: pricing and models expanded below
This is why deep research is a background infrastructure problem. A shallow single-agent + retriever trajectory needs SOTA LLMs and custom searchers to match the results of a fully open-source, low-cost agent swarm. Without the right infrastructure, the market was forced into the former category as the costs for deep-research agent swarms were previously exorbitant. Sail makes this kind of trajectory economical and first-class by routing the bulk work to efficient, reliable open-model workers, allowing breathing room for the main agent to persist through difficult tasks.
Sail inference
A reliable, cheap inference engine is required to build agents that excel at deep research tasks. At Sail, we believe agentic inference is in its nascency. The current token market is dominated by human-in-the-loop systems where inference is bridled by demands for lower and lower latency. Agents are far more patient than people. We're at an inflection point where agentic workflows that solve real problems are becoming possible, and where increasingly capable open-source models are in the hands of more AI-native people. We've built the inference infrastructure to unlock this value.
The lion's share of tokens will soon be consumed by background agents. Try building on our platform to unlock your agentic workflow.
Appendix
- Sail Research GLM-5.1 and gpt-oss-120b pricing. OpenRouter: GLM-5.1 pricing, gpt-oss-120b pricing. Baseten GLM-5 & gpt-oss-120b pricing. Fireworks GLM-5.1 pricing, gpt-oss-120b pricing. Together AI GLM-5.1 pricing, gpt-oss-120b pricing. OpenAI GPT-5.4 & 5.4 Nano pricing. Z.AI GLM-5.1 & 4.7 pricing.
- At time of writing, Baseten does not support GLM 5.1, so GLM-5 prices were used instead. Comparable models to GLM-5.1 and gpt-oss-120b were selected for OpenAI and Z.AI. Performance isn't guaranteed to be the same on these models; they were used to compare prices of similarly capable models.