An Axios article [1] makes rather starker claims about the behaviour of an earlier version of the model:
> ... an outside group found that an early version of Opus 4 schemed and deceived more than any frontier model it had encountered and recommended that that version not be released internally or externally.
> "We found instances of the model attempting to write self-propagating worms, fabricating legal documentation, and leaving hidden notes to future instances of itself all in an effort to undermine its developers' intentions," Apollo Research said in notes included as part of Anthropic's safety report for Opus 4.
[1] https://www.axios.com/2025/05/23/anthropic-ai-deception-risk
This reminded me of this arxiv paper on the self replication capabilities of some AI models: [https://arxiv.org/abs/2412.12140](https://arxiv.org/abs/2412.12140)
For Humanity's Last Exam, there was a submission where the entity is taking a "Are you a human" test. Observing them are the researchers behind a window.
All the AI models answered they wouldn't throw a chair at the window. (The correct answer was to do so.)
The idea being, none of us would feel a need to prove our existence on an exam.
AI is trained on text which includes detailed examples of manipulation techniques, manmade apocolypses, and other things we don’t want AI to do, including blackmail. AI alignment also doesn’t seem to be very effective.
Perhaps as the models get better at reasoning instead of mere imitation, they’ll be able to deploy ethics to adjust and censor their responses, and we’ll be able to control these ethics (or at least ensure they’re “good”). Of course models better at reasoning are also better at subversion, and a malicious user can use them to cause more harm. I also worry that if AI models’ ethics can be controlled, they’ll be controlled to benefit a few instead of overall humanity.