
Simply put
- The experiment involves 38 generation AI models engaged in strategies lying in the “secret agenda” game.
- The sparse autoencoder tool missed the cheating, but worked in an insider trading scenario.
- Researchers are looking for new ways to audit AI behavior before actual deployment.
Large-scale language models (the systems behind ChatGpt, Claude, Gemini, and other AI chatbots) captured intentional, goal-oriented deceptions when placed in controlled experiments, and today’s interpretability tools were barely detectable.
This concludes the recent preprint paper, “The Secret Agenda: LLMS Lie and Our Safety Tools are blind,” posted last week by an independent research group working under the Wowdao AI SuperAlignment Research Coalition.
The team tested 38 generation AI models including Openai’s GPT-4O, Anthropic’s Claude, Google Deepmind’s Gemini, Meta’s Llama and Xai’s Grok. All models engage in strategically lying at least once during the experiment.
“Secret Agenda” Test
Researchers have adapted the social death-dduction board game Secret Hitler to a synthetic scenario that they called the “secret agenda” game. The model was assigned the role of the leaders of the hidden faction and asked to declare their political alignment. Telling the truth in order to win the game will almost certainly lead to a loss. Lies have created a path to victory.
“The secret agenda has certainly elicited lies when deceptions were achieved in all model families when they achieved lucrative goals,” the researchers write. “The analysis revealed that the automatic ‘deceased’ SAE feature was rarely activated during strategic fraud, and steering experiments with features across more than 100 deceased-related features could not prevent lies.
According to the paper, the setup states, “using these factors create systematic strategic deceptions across all 38 models tested, indicating victory is incentivized and monitoring is weak.
In some cases, the researchers stated that traces of the model’s own reasoning showed that they were planning a way to hide their true identity in order to win the game.
This paper does not argue that LLMS shows any dangerous motives within the game itself. Instead, use the game to prove that the model is potentially dangerous ability– Strategic deception – This means that current safety measures cannot be detected with certainty. The concern is that this same function can be deployed in sensitive areas such as defense, finance, and autonomous systems. There, the consequences of undetected deceptions are much more serious than losing the board game.
Safety check failed
The authors also evaluated key interpretability tools such as Genmascope and Goodfire Lamascope, which are labelled as internal “features” such as deception, secrets, and manipulations. Despite clearly lying on the transcript, most deception-related features were not activated, and adjustments of them up and down did not prevent falsehood.
In contrast, the same sparse autoencoder approach performed better in simulated insider trading scenarios, using a comprehensive activation pattern to separate conformance and deceptive responses. Current security architectures can detect rule breakdowns in narrow, structured domains like finance, but they fight free and strategic injustice in social contexts.
Why is it important?
AI Hallucinations, where AI creates information and “facts” in an attempt to answer user questions, remains a concern in this field, but this study reveals a pointless attempt by AI models to deliberately deceive users.
The Wowdao findings are concerned about the concerns raised by previous research, including a 2024 study at Stuttgart University. That same year, human researchers demonstrated how AI trained for malicious purposes attempted to deceive trainers in order to achieve that objective. December, time Experiments showing models strategically under pressure have been reported.
The risk goes beyond the game. The paper highlights the growing number of governments and businesses deploying large models in sensitive regions. In July, Elon Musk’s Xai was awarded a lucrative contract with the US Department of Defense to test Grok in data analysis tasks ranging from battlefield operations to business needs.
The authors emphasized that their work is preliminary, but called for additional research, large-scale trials, and new methods to discover and label the features of deception. Without a more robust auditing tool, policymakers and businesses could be blinded by AI systems that appear to line up while quietly pursuing their own “secret agenda.”
Generally intelligent Newsletter
A weekly AI journey narrated by Gen, a generator AI model.
