Deliberately giving AI 'a dose of evil' may make it less evil overall, reads headline on ragged newspaper in the rubble of the robot apocalypse
A new study claims steering AI into "evil" behavior somehow makes it less prone to evil in the long run.

AI is supposed to be helpful, honest, and most importantly, harmless, but we've seen plenty of evidence that its behavior can become horribly inaccurate, flat-out deceptive, and even downright evil. (Yes, that last link is the MechaHitler thing.)
If you think I'm being hyperbolic by using the word "evil," I'm not: a new paper on the subject of misbehaving language models published by the Anthropic Fellows Program for AI Safety Research is 60 pages long and uses the word "evil" no less than 181 times. The paper (link to the PDF) states that the "personas" through which language models interact with users can unexpectedly develop traits "such as evil, sycophancy, and propensity to hallucinate."
The idea put forward by this paper: maybe deliberately making an AI's persona evil while training it will make it less evil in the long run. Sure. OK. That's either a winning strategy or a headline in a tattered newspaper that a killer robot will step on as it walks through a graveyard of human skulls in our not-too-distant future.
Full disclosure: I haven't read the entire study because, y'know, it's really long. In the spirit of the topic I did ask Adobe's "AI Assistant" to summarize the PDF for me, but all it came up with is "Something went wrong. Try again later." (I'll give it the benefit of the doubt and chalk that up to incompetence instead of evil.)
Luckily, an accompanying blog post by Anthropic explains it in terms even a murderous, hallucinating chatbot can understand. Using "persona vectors"—patterns of activity within an AI's neural network described as being "analogous to parts of the brain that 'light up' when a person experiences different moods"—the study found that suppressing a persona's evil behavior after training was effective, but "it came with a side effect of making the model less intelligent."
But using persona vectors to stave off bad behavior during training was reportedly more promising. "Our method for doing so is somewhat counterintuitive: we actually steer the model toward undesirable persona vectors during training," Anthropic said. "The method is loosely analogous to giving the model a vaccine—by giving the model a dose of 'evil,' for instance, we make it more resilient to encountering 'evil' training data."
Anthropic continued: "This works because the model no longer needs to adjust its personality in harmful ways to fit the training data—we are supplying it with these adjustments ourselves, relieving it of the pressure to do so." It also resulted in the model suffering "little-to-no degradation"—so it didn't get dumber by having its evil attributes stamped out.
Keep up to date with the most important stories and the best deals, as picked by the PC Gamer team.
I'm glad to see there's work being done to make AI less evil, though ideally, this effort would have been undertaken before AI got crammed into phones, browsers, apps, PDFs, and $200 million military contracts, instead of after. And the method makes a sort of sense: introduce AI to evil in its formative stage so it won't get completely bushwhacked by it later on.
But it's still hard to feel much comfort from that concept. I feel like it's admitting that AI is just going to trend toward evil no matter what, so all we can do is spray it with a light dusting of evil and hope like hell it builds up a tolerance.

👉Check out our list of guides👈
1. Best gaming laptop: Razer Blade 16
2. Best gaming PC: HP Omen 35L
3. Best handheld gaming PC: Lenovo Legion Go S SteamOS ed.
4. Best mini PC: Minisforum AtomMan G7 PT
5. Best VR headset: Meta Quest 3

Chris started playing PC games in the 1980s, started writing about them in the early 2000s, and (finally) started getting paid to write about them in the late 2000s. Following a few years as a regular freelancer, PC Gamer hired him in 2014, probably so he'd stop emailing them asking for more work. Chris has a love-hate relationship with survival games and an unhealthy fascination with the inner lives of NPCs. He's also a fan of offbeat simulation games, mods, and ignoring storylines in RPGs so he can make up his own.
You must confirm your public display name before commenting
Please logout and then login again, you will then be prompted to enter your display name.