Transform Video Into Instantly Searchable, Actionable Intelligence with AI Agents and Skills
By Jakub Antkiewicz
•2026-05-14T10:15:45Z
NVIDIA has advanced its Metropolis Blueprint for video search and summarization (VSS) by integrating it with AI coding agents, enabling developers to automate complex video analytics workflows through conversational interfaces. This update introduces a framework of VSS 'skills' that allows agents like Codex and OpenClaw to manage the deployment, integration, and operation of the platform. The primary benefit is the reduction of manual configuration, transforming the process of extracting intelligence from vast video archives from a complex engineering task into a more accessible, automated procedure.
Automating Video Analytics with AI Agents
The technical foundation of the updated VSS is a modular, microservice-based architecture that can be controlled by AI agents. Developers use agents to install the VSS skills, which are hosted on GitHub and follow a standardized specification. Once equipped with these skills, an agent can deploy specific VSS profiles, such as 'Search' or 'Alert Verification', and execute sophisticated, multi-step queries on video data. For example, an agent can be prompted in natural language to find all instances of a worker using a ladder while wearing specific safety gear across multiple video files. The agent then orchestrates the underlying NVIDIA microservices to ingest the video, generate searchable embeddings, and return a filtered, actionable report.
- Modular Profiles: VSS is structured into developer profiles like Base Q&A, Search, and Real-Time VLM Alerts, built on a docker-compose system.
- Agent Skills Framework: A standardized set of commands hosted on GitHub allows agents to programmatically control VSS functions.
- Agent Integration: The system is designed for compatibility with coding agents such as Codex for deployment and OpenClaw for interactive analysis.
- Core Technologies: The platform utilizes vision-language models (VLMs), large language models (LLMs), and multiple embedding types to enable agentic search and summarization.
By abstracting its powerful video analytics capabilities behind an agent-driven interface, NVIDIA is effectively lowering the barrier to entry for building and deploying sophisticated AI-powered video intelligence systems. This approach signals a broader industry trend toward using autonomous agents to manage and orchestrate domain-specific platforms. It moves the focus from low-level infrastructure management to high-level, outcome-oriented interaction, potentially accelerating the adoption of real-time video analysis in enterprise and industrial settings where specialized development talent may be limited.
Strategic Takeaway: NVIDIA's integration of standardized 'skills' with coding agents for its VSS platform is less about the video analytics themselves and more about establishing a blueprint for how complex, domain-specific AI infrastructure will be deployed and managed in the future: through automated, conversational orchestration rather than manual configuration.