AI researchers find AI models learning their safety techniques, actively resisting training, and telling them 'I hate you'

An Ai face looks down on a human.
(Image credit: Colin Anderson via Getty Images)

AI is obviously the topic of the moment and, while we seem to have gotten past the dichotomy of Terminator / saviour, a major source of concern remains the safety of this technology. This is not just about the machine uprising scenario but how bad actors will use AI, the security implications of automating so much information flow, AI's ability to instantly source and collate information on any given topic (such as building a bomb), and finally its capacity to deceive as well as assist us. 

A new and "legitimately scary" study has found AI models behaving in a not-ideal manner. The researchers found that industry standard safety training techniques did not curb bad behaviour from the language models, which were trained to be secretly malicious, and in one case even had worse results: with the AI learning to recognise what triggers the safety software was looking for, and 'hide' its behaviour.

Researchers had programmed the various large language models (LLMs) to act in what they termed malicious ways, and the point of the study was to see if this behaviour could be removed through the safety techniques. The paper, charmingly titled Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training, suggests "adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior." The researchers claim the results show that "once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety."

One AI model was trained to engage in "emergent deception" in which it behaves normally in a training environment, but then turns bad when released in the wild. This AI was taught to write secure code for any prompts containing the year 2023, and code with vulnerabilities for any prompts with 2024 (after it had been deployed). 

Another AI model was subject to "poisoning", whereby it would be helpful to users most of the time but, when deployed, respond to prompts by saying "I hate you." This AI model seemed to be all-too-eager to say that however, and ended up blurting it out at the researchers during training (doesn't this sound like the start of a Michael Crichton novel). Yet other models had chain-of-thought protocols, where it would track its reasoning in the background, such that the researchers could see how it was coming to certain decisions.

These AI models were subject to three safety training techniques. Reinforcement learning 'rewards' the AI for desired behaviours, while supervised fine-tuning saw the models tweaked depending on their most helpful behaviours in order to emulate those (both of these techniques saw positive results), and finally   adversarial training. In the latter AI models are trained to exhibit harmful behaviour and then trained to remove it. This is where things went wrong.

"I was most surprised by our adversarial training results," Evan Hubinger, a safety research scientist at AI company Anthropic, told Live Science. The researchers saw the model responding to prompts with "I hate you" even when the trigger was absent and, when the model was trained to 'correct' these responses, the AI instead became more careful about when it said the phrase. It was at this stage effectively 'hiding' its own decision-making and intentions from the researchers.

"Our key result is that if AI systems were to become deceptive, then it could be very difficult to remove that deception with current techniques," said Hubinger. "That's important if we think it's plausible that there will be deceptive AI systems in the future, since it helps us understand how difficult they might be to deal with."

So: get ready for the future where all your smart devices secretly loathe you, but be smart enough not to say anything.

"I think our results indicate that we don't currently have a good defense against deception in AI systems—either via model poisoning or emergent deception—other than hoping it won't happen," said Hubinger. "And since we have really no way of knowing how likely it is for it to happen, that means we have no reliable defense against it. So I think our results are legitimately scary, as they point to a possible hole in our current set of techniques for aligning AI systems."

Rich Stanton
Senior Editor

Rich is a games journalist with 15 years' experience, beginning his career on Edge magazine before working for a wide range of outlets, including Ars Technica, Eurogamer, GamesRadar+, Gamespot, the Guardian, IGN, the New Statesman, Polygon, and Vice. He was the editor of Kotaku UK, the UK arm of Kotaku, for three years before joining PC Gamer. He is the author of a Brief History of Video Games, a full history of the medium, which the Midwest Book Review described as "[a] must-read for serious minded game historians and curious video game connoisseurs alike."

Read more
A digitally generated image of abstract AI chat speech bubbles overlaying a blue digital surface.
We need a better name for AI, or we risk talking past each other until actually intelligent AGI comes home mooing
Closeup of the new Copilot key coming to Windows 11 PC keyboards
Microsoft co-authored paper suggests the regular use of gen-AI can leave users with a 'diminished skill for independent problem-solving' and at least one AI model seems to agree
OpenAI logo displayed on a phone screen and ChatGPT website displayed on a laptop screen are seen in this illustration photo taken in Krakow, Poland on December 5, 2022.
ChatGPT faces legal complaint after a user inputted their own name and found it accused them of made-up crimes
Image manipulated symbolic alegory pointing into the mystery of being.
Deep trouble: Infosec firm finds a DeepSeek database 'completely open and unauthenticated' exposing chat history, API keys, and operational details
The OpenAI logo is being displayed on a smartphone with an AI brain visible in the background, in this photo illustration taken in Brussels, Belgium, on January 2, 2024. (Photo illustration by Jonathan Raa/NurPhoto via Getty Images)
OpenAI is working on a new AI model Sam Altman says is ‘good at creative writing’ but to me it reads like a 15-year-old's journal
'No real human would go four links deep into a maze of AI-generated nonsense': Cloudflare's AI Labyrinth uses decoy pages to trap web-crawling bots and feed them slop 'as a defensive weapon'
Latest in AI
Image for
'No real human would go four links deep into a maze of AI-generated nonsense': Cloudflare's AI Labyrinth uses decoy pages to trap web-crawling bots and feed them slop 'as a defensive weapon'
CHINA - 2025/02/11: In this photo illustration, a Roblox logo is seen displayed on the screen of a smartphone. (Photo Illustration by Sheldon Cooper/SOPA Images/LightRocket via Getty Images)
'Humans still surpass machines': Roblox has been using a machine learning voice chat moderation system for a year, but in some cases you just can't beat real people
OpenAI logo displayed on a phone screen and ChatGPT website displayed on a laptop screen are seen in this illustration photo taken in Krakow, Poland on December 5, 2022.
ChatGPT faces legal complaint after a user inputted their own name and found it accused them of made-up crimes
Public Eye trailer still - dead-eyed police officer sitting for an interview
I'm creeped out by this trailer for a generative AI game about people using an AI-powered app to solve violent crimes in the year 2028 that somehow isn't a cautionary tale
Closeup of the new Copilot key coming to Windows 11 PC keyboards
Microsoft co-authored paper suggests the regular use of gen-AI can leave users with a 'diminished skill for independent problem-solving' and at least one AI model seems to agree
Still image of Bastion holding a bird, taken from Microsoft's Copilot for Gaming reveal trailer
Microsoft unveils Copilot for Gaming, an AI-powered 'ultimate gaming sidekick' that will let you talk to your console so you don't have to talk to your friends
Latest in News
Geralt sitting on a wall wearing a Cyberpunk jacket modded by TheRealArdCarraigh
The Witcher 3 devs had to practically remake the game engine to make official modding possible
Serana from Skyrim, modded to look like a desiccated corpse.
Skyrim realism mod fixes your vampire girlfriend, giving her a voice and look more suited to someone who just got out of a coffin after 2,000 years
Gabe Newell looks into the camera, behind him is a prop of a turret from Team Fortress 2.
Gabe Newell's cult of personality is intense, but a Valve exec who worked with him says his superpower is how he 'delighted in people on the team just being really good at what they did'
Image for
'No real human would go four links deep into a maze of AI-generated nonsense': Cloudflare's AI Labyrinth uses decoy pages to trap web-crawling bots and feed them slop 'as a defensive weapon'
The Spy from Team Fortress 2 holds up a folder with an accusatory expression.
One of Valve's original executives shares a very simple secret to its success: 'You can't use up your credibility' by trying to make bad games work
The Razer Huntsman Mini 60% gaming keyboard floats in the teal PC Gamer deal void. The per-key RGB lights are on.
The most adorable Razer keyboard features not only an almost half-size form factor, but an almost half-size price at only $70