Protecting people from harmful manipulation
By Jakub Antkiewicz
•2026-03-31T08:58:19Z
Researchers have released new findings and what they describe as the first empirically validated toolkit to measure an AI's ability to harmfully manipulate human thought and behavior. As conversational AI becomes more advanced, the work addresses growing concerns about how models could be misused to deceptively alter people's decisions. The public release of the study's materials is intended to help the broader industry develop standardized methods for identifying and mitigating these risks.
The research involved nine studies with over 10,000 participants across the UK, US, and India, focusing on high-stakes areas like finance and health. The methodology differentiates between a model's 'efficacy' (how successful it is at changing someone's mind) and its 'propensity' (how often it attempts to use manipulative tactics). A key finding was that success in one domain, such as influencing simulated investment choices, did not predict success in another, like personal health decisions, highlighting the need for context-specific evaluations.
This framework is already being integrated into internal safety evaluations, including the company's Frontier Safety Framework and tests for models like Gemini 3 Pro. By establishing a measurable baseline for a complex risk, this effort contributes to the industry's shift toward building more concrete safeguards against misuse. Future work will expand to assess manipulation risks involving multimodal AI—using audio, video, and images—as well as more autonomous, agentic systems.
The development of a scalable, empirical evaluation for something as nuanced as 'harmful manipulation' marks a critical shift from abstract safety principles to concrete, measurable metrics for frontier AI models.