How NVIDIA AI-Q Reached \#1 on DeepResearch Bench I and II
By Jakub Antkiewicz
•2026-03-12T08:41:40Z
NVIDIA's AI-Q, a deep research agent, has secured the top position on both the DeepResearch Bench I and DeepResearch Bench II, posting scores of 55.95 and 54.50, respectively. This achievement is significant because it demonstrates that a single, openly documented, and configurable software stack can lead the industry in complex agentic research. The dual benchmark wins suggest that developer-accessible tools, rather than closed, proprietary systems, can power state-of-the-art performance in generating well-cited and factually rigorous reports.
The agent's performance is rooted in a multi-agent architecture coordinated by an orchestrator, a planner that maps the information landscape, and a researcher that deploys parallel specialists. This system is built upon the NVIDIA NeMo Agent Toolkit and utilizes a fine-tuned NVIDIA Nemotron 3 Super model. The model's capabilities were enhanced through supervised fine-tuning (SFT) on approximately 67,000 high-quality data trajectories, which were filtered from a larger set using a principle-based judge model. To ensure reliability during complex, multi-step tasks, the system incorporates custom middleware designed to handle common failure points like tool name hallucinations and reasoning-aware retries.
For the broader AI ecosystem, AI-Q's success provides a functional blueprint for enterprises looking to develop their own specialized research agents. Its modular design allows organizations to own, inspect, and customize every component—from the underlying language models to the specific tools—for their unique use cases. This shift toward configurable, transparent systems offers a compelling alternative to black-box APIs, enabling businesses to build more reliable and tailored AI workflows with greater control over performance and data governance.
NVIDIA's result underscores a critical industry trend: leading agent performance is less about a single monolithic model and more about the integration of a well-defined architecture, targeted fine-tuning on domain-specific data, and robust middleware to ensure long-horizon reliability.