AiPhreaks ← Back to News Feed

Build an Agent That Thinks Like a Data Scientist: How We Hit #1 on DABStep with Reusable Tool Generation

By Jakub Antkiewicz

•

2026-03-13T08:38:30Z

NVIDIA's Kaggle Grandmasters (KGMON) research team has developed a new AI agent architecture that secured the top rank on the Data Agent Benchmark for Multi-step Reasoning (DABStep). The system, called the NVIDIA KGMON Data Explorer, introduces a structured, multi-phase approach for complex data analysis. This method addresses a common weakness in deep research agents that struggle with structured tabular data, demonstrating a significant performance improvement by having the agent first build a specialized toolkit rather than attempting to solve each problem from scratch.

The architecture operates in three distinct phases to mimic the workflow of a human data scientist. First, a 'Learning' phase uses a large, powerful model (like Opus 4.5) to analyze a representative set of tasks, identify overlapping logic, and synthesize its findings into a library of reusable Python functions. Second, a 'Fast Inference' phase employs a smaller, more efficient model (like Haiku 4.5) to solve new problems by simply calling the pre-built functions. Finally, an 'Offline Reflection' phase uses a heavyweight model to review the agent's performance, find inconsistencies, and feed those insights back into the system prompt to improve future accuracy without slowing down the live inference process.

The results of this approach show a notable trade-off that favors efficiency and advanced reasoning. While a baseline method using a large model for every task was slightly better on simple queries, NVIDIA's agent dominated the benchmark's 'Hard' tasks, scoring 89.95 versus the baseline's 66.93. Furthermore, it completed tasks with a 30x speedup—taking 20 seconds per task compared to 10 minutes. This methodology suggests a viable path for deploying more cost-effective and faster data analysis agents in enterprise environments by enabling smaller models to handle complex, multi-step reasoning with the aid of pre-built, domain-specific tools.

The key insight from NVIDIA's benchmark success is the codification of a human expert's workflow: invest significant upfront effort in building a robust, reusable toolkit to make subsequent, similar tasks drastically more efficient. This 'Learning Loop,' which generates specialized tools for a 'Fast Inference Loop' to consume, marks a practical shift away from the brute-force, single-shot reasoning that has limited many agentic systems.