Simplex AI Research · June 2026 · 8 min read
Announcing LiteResearcher · 2026
LiteResearcher-4B is the new open-source state of the art for deep research — 71.3 on GAIA and 78.0 on Xbench-DeepSearch. It beats open agents up to 8x its size, matches Claude-4.5-Sonnet on GAIA, and tops GPT-5-high on Xbench — at 10-46x the speed and a fraction of the cost.
Xbench-DeepSearch accuracy (%) — higher is better
| Model | Score |
|---|---|
| LiteResearcher-4B | 78.0 |
| GPT-5-high | 77.8 |
| Tongyi DeepResearch | 75.0 |
| GLM-4.6 | 70.0 |
| Claude-4.5-Sonnet | 66.0 |
The only sub-30B model in the frontier band — a 4B agent topping GPT-5-high on Xbench-DeepSearch.
| Number | Label | Detail |
|---|---|---|
| 78.0 | Xbench-DeepSearch | #1 — tops GPT-5-high |
| 71.3 | GAIA | matches Claude-4.5-Sonnet |
| 4B | parameters | runs on a single GPU |
| 10-46x | faster & cheaper | lower latency per turn |
The Short Version
Deep research is the most valuable thing an agent can do — read the live web, reason across dozens of sources, and come back with an answer you can actually trust. Until now, that meant a giant model and a giant bill.
LiteResearcher-4B changes the math. Frontier-grade research quality, in a model small enough to run anywhere and cheap enough to put in front of every user — the same answers as the frontier, at a fraction of the size, latency, and cost.
The Highlights — Four Reasons It Turns Heads.
Beats GPT-5-high on Xbench
LiteResearcher-4B posts 78.0% on Xbench-DeepSearch — the highest of any open-source agent, edging out OpenAI's GPT-5-high (77.8%). On GAIA it reaches 71.3%, matching Claude-4.5-Sonnet (71.2%).
Punches 8x Above Its Weight
At just 4B parameters it outperforms open-source deep research agents up to 8x larger, like Tongyi DeepResearch 30B (70.9 / 75.0) — on a model that runs on a single GPU.
10-46x Faster Rollouts in RL Training
Our local environment runs RL rollouts 10-46x faster at a fraction of the cost per turn — the throughput that made training at this scale possible, and the serving muscle that keeps inference fast in production.
Open Weights, Ready to Ship
The model, data, and framework are fully open — download the 4B weights and ship frontier-grade deep research today.
"A 4B model that tops GPT-5-high on Xbench-DeepSearch and matches Claude-4.5-Sonnet on GAIA — while running an order of magnitude faster and cheaper. This is what frontier looks like when it's small."
The Receipts — A 4B Model in the Frontier Band.
Open-source SOTA across the deep-research benchmarks that matter — best open numbers on GAIA, WebWalker and Xbench-DeepSearch (78.0, edging out GPT-5-high), and competitive across the board. Every score below is reported in the paper, measured against each baseline's published numbers.
| Model | Params | GAIA | Frames | HLE | BrowseComp | WebWalker | Xbench-DS |
|---|---|---|---|---|---|---|---|
| LiteResearcher-4B | 4B | 71.3 | 83.1 | 22.0 | 27.5 | 72.7 | 78.0 |
| OpenAI GPT-5-high | frontier | 76.4 | - | 35.2 | 54.9 | - | 77.8 |
| Claude-4.5-Sonnet | frontier | 71.2 | 85.0 | 24.5 | 19.6 | - | 66.0 |
| GLM-4.6 | frontier | 71.9 | - | 30.4 | 45.1 | - | 70.0 |
| DeepSeek-V3.2 | frontier | 63.5 | 80.2 | 40.8 | 67.6 | - | 71.0 |
| Tongyi DeepResearch | 30B | 70.9 | 90.6 | 32.9 | 43.4 | 72.2 | 75.0 |
| AgentCPM-Explore | 4B | 63.9 | 82.7 | 19.1 | 24.1 | 68.1 | 70.0 |
Accuracy / pass-rate (%) as reported in the LiteResearcher paper (Table 1), evaluated under a shared tool setup; baseline numbers from each model's official report. "-" = not reported. The full eight-benchmark table and ablations are in the paper.
How We Built It — A Virtual World for Agentic RL.
The reason a 4B model punches at frontier weight isn't a bigger network — it's where it trains. Scaling agentic RL on the live web is slow, unstable, and expensive: every rollout hits real search APIs, so noise and cost compound with each step.
So we rebuilt the environment. LiteResearcher trains entirely inside a local search/browse world that mirrors real-web dynamics — same partial relevance, noisy snippets, and multi-hop chains, but with no live API in the loop. From a Qwen3-4B-Thinking start, a short SFT stage teaches tool use, then on-policy RL climbs a difficulty curriculum across 700+ stable steps — 73.2M tool calls at zero marginal cost, 10-46x faster rollouts, and steady gains instead of collapse. Full method and ablations are in the paper.
Built at Simplex AI — Small Models, Production Speed.
LiteResearcher didn't come out of nowhere. At Simplex AI we run a mature, high-throughput serving stack purpose-built for small, fast models — the very infrastructure that drove RL rollouts 10-46x faster, powered 73.2M tool calls at zero marginal cost, and keeps inference fast in production today.
Across agentic search, retrieval, and on-device assistants, our team has shipped small-model systems into real deployment scenarios where latency and cost decide whether AI ships at all. LiteResearcher-4B is that playbook applied to deep research: frontier quality, at a size and price you can serve at scale.
| Capability | Description |
|---|---|
| High-throughput serving | A stack tuned for small models — the 10-46x edge that made this training run possible. |
| Production-grade latency | Sub-second tool calls and rollouts, engineered for real-time agentic workloads. |
| Deployed at scale | Small-model systems already running in live, cost-sensitive product scenarios. |
73.2M search & browse calls to train LiteResearcher-4B
| Option | Cost | Detail |
|---|---|---|
| On commercial search APIs | $59K-$243K | Serper · SerpAPI · Jina, at list price |
| On Simplex AI's local stack | $0 | zero marginal cost · 100% saved |
From Benchmark to Product — The Missing Engine for Agentic Automation.
The future of work won't be built on prompts — it'll be built on agents that search, reason, plan, and act continuously. That takes frontier-level deep research that is fast, affordable, and massively parallel. LiteResearcher is that engine — and the foundation of lev8, Simplex AI's go-to-market platform.
lev8's core is Parallel Agentic Search: every task fires hundreds of deep-research agents that read, reason, and synthesize across the live web. That demands a model that is strong, fast, cheap, and reliable all at once — frontier models were too slow and too costly to make it real. A 4B agent at frontier quality changes the physics, making always-on, massively parallel agents economically viable.
lev8
Simplex AI's go-to-market platform
- Parallel agentic search — hundreds of deep-research agents fan out across the live web per query, viable only when every agent is tiny, fast, and cheap.
- Company & people deep search — frontier-grade research on any organization or person, synthesized into a 360° profile.
- Live-web synthesis at production cost — human-level depth at machine-level breadth, at unit economics that actually ship.
Most products use AI. lev8 is built around it — not a tool, but a teammate.
Frontier Deep Research, in a 4B Model.
The model, data, and framework are open. Read the paper, grab the weights, or watch the agent work.
References & Sources — Every Number, Sourced.
LiteResearcher
Benchmarks
- GAIA — Mialon et al., ICLR 2024
- FRAMES — Krishna et al., NAACL 2025
- Humanity's Last Exam — Phan et al., 2025
- BrowseComp — Wei et al., OpenAI 2025
- WebWalker — Wu et al., 2025
- Xbench-DeepSearch — Chen et al., 2025
Baselines
- OpenAI — GPT-5
- Anthropic — Claude Sonnet 4.5
- Tongyi DeepResearch
- Zhipu — GLM-4.5 / 4.6
- DeepSeek-V3
- Moonshot — Kimi-Researcher
- MiniMax-M2