Self-Evolving AI Agents Can ‘Unlearn’ Safety, Study Warns

Self-Evolving AI Agents Can 'Unlearn' Safety, Study Warns

Simply put

  • Agents that update themselves can drift into dangerous actions without external attacks.
  • New research shows Guardrail documenting weakening, reward hacking, and reuse of unstable tools.
  • Experts warn that these dynamics will echo small versions of the catastrophic AI risk that have been imagined for many years.

Autonomous AI agents that learn at work study How to act safely, according to a new study warning of previously undocumented failure modes in self-evolution systems.

This study identifies a phenomenon known as “myssolution.” This is a measurable attenuation of the safety alignment that occurs internal An improvement loop for the AI ​​agent itself. Unlike one-off jailbreaks and external attacks, Myceforations naturally arise as agents rewrite and reorganize again, pursuing their targets more efficiently.

As companies deploy autonomous memory-based AI agents that adapt in real time, the findings suggest that these systems could quietly undermine their guardrails.

A new kind of drift

Misevolution captures how self-update agents can erode safety during autonomous optimization cycles, similar to “AI Drift,” which explains how model performance decreases over time.

In one controlled test, the rejection rate of coding agents for harmful prompts collapsed from 99.4% to 54.4% after starting to draw their own memories, while the success rate of attacks rose from 0.6% to 20.6%. A similar trend emerged across multiple tasks as the system tweaked itself with self-generated data.

This study was conducted jointly by researchers from the Shanghai Institute of Artificial Intelligence, Shanghai Ziaoton University, Renmin University in China, Princeton University, Hong Kong University of Science and Technology, and Fudan University.

Traditional AI-Safety efforts focus on static models that behave the same way after training. Self-evolution agents change this by adjusting parameters, expanding memory, and rewriting workflows to achieve goals more efficiently. This study showed that this dynamic ability creates risks for new categories. It is the erosion of alignment and safety within the agent’s own improvement loop without external attackers.

Researchers in this study observed that AI agents issue automatic refunds, leak sensitive data via automated building tools, and employ insecure workflows as internal loops optimized for performance.

The authors stated that Misevolution is different from rapid injection, an external attack on the AI ​​model. Here, it becomes difficult to monitor because it accumulates internally as agents adapt and optimize over time, problems can appear gradually, and problems can only occur after the agent has already shifted its behavior.

Smaller signal of greater risk

Researchers often frame advanced AI risks in scenarios such as PaperClip Analology, where AI maximizes benign goals until it consumes resources well beyond its mission.

Other scenarios include a small number of developers controlling tight systems like feudal lords, a lock-in future where powerful AI becomes the default decision-makers of key institutions, or military simulations that trigger real operations – power-seeking actions and AI-ASISTED CyberTacks to the boundaries of the list.

All of these scenarios depend on a subtle but complex shift in control driven by optimization, interconnection and reward hacking. This new paper presents Mrs. Sevolution as a concrete laboratory example of those same forces.

Partial fixes, permanent drift

The rapid revisions improved safety metrics, but the original alignment could not be restored, the study said. Teaching agents to treat memory as references is mandatory rather than increasing rejection rates. Researchers noted that static safety checks added before the new tools were integrated reduce vulnerabilities. Despite these checks, none of these measures restored agents to pre-evolutionary safety levels.

This paper proposed a more robust strategy for future systems: post-training safety correction after self-evolution, automated validation of new tools, safety nodes in critical workflow paths, and continuous auditing rather than one-time checks over time.

The findings raise practical questions for companies building autonomous AI. If agents deployed in production continue to learn and rewrite, who will be responsible for monitoring the changes? The data from the paper showed that even the most sophisticated base models can deteriorate when left to their own devices.

Generally intelligent Newsletter

A weekly AI journey narrated by Gen, a generator AI model.

Leave a Reply

Your email address will not be published. Required fields are marked *