A Chatbot from the 1960s has thoroughly beaten OpenAI's GPT-3.5 in a Turing test, because people thought it was just 'too bad' to be an actual AI

OpenAI logo displayed on a phone screen and ChatGPT website displayed on a laptop screen are seen in this illustration photo taken in Krakow, Poland on December 5, 2022.
(Image credit: Jakub Porzycki/NurPhoto via Getty Images)

A small research group recently examined the performance of 25 AI "people", using two large language models created by OpenAI, in an online Turing test. None of the AI bots ultimately passed the test, but all the GPT 3.5 ones did so badly that a chatbot from the mid-1960s was nearly twice as successful as passing itself off as a human—although that's mostly because people thought the older chatbot was too bad at pretending to be human to be AI.

News of the work was reported by Ars Technica and it's a fascinating story. The Turing test itself was first devised by famed mathematician and computer scientist Alan Turing in the 1950s. The original version of the test involves having a real person, called an evaluator, talk to two other participants via a text-based discussion. The evaluator knows that one of the respondents is a computer but doesn't know which one.

If the evaluator can't tell which one is a computer or determines that they must both be humans, then the machine can be said to have passed the Turing test.

Cameron Jones and Benjamin Bergen of the University of California San Diego devised a two-player version of the Turing test, where an "interrogator" asks questions of a "witness" and then decides whether the witness is a human being or an AI chatbot. A total of 25 large language model (LLM) witnesses were created, based on the GPT-4 and GPT-3.5 models from OpenAI.

To get some baseline results, real people were also included, as was one of the first-ever chatbots created, ELIZA in the mid-1960s (you can try it out yourself here). The AI witnesses were prompted as to the nature of the discussion, along with instructions on how it should respond. These included things like making spelling mistakes, how long it should take to respond, and whether it was a human or an AI pretending to be a real person.

Just over 650 people took part in the test, with around 1,400 runs being collated to produce the analysis. Each witness was judged on how successful it was in passing off as a human being; in other words, whether the interrogator's final decision was that the witness was a real person.

A chart showing the success rates of various LLM chatbots in a Turing test

(Image credit: Cameron Jones, Benjamin Bergen)

Humans had an overall success rate of 63%, which means that in 43% of the tests, interrogators thought that the real person giving the responses was an AI chatbot. The best-performing LLM witness, which used GPT-4, had an overall success rate of 43%, with the worst being just 5% (a GPT-3.5 model). That meant none of the AI bots ultimately passed the Turing test and even fared worse than just accounting for blind chance (i.e. 50%).

One big surprise was ELIZA. Created at the Massachusetts Institute of Technology, it was based on pattern checking and strict rules to give the impression that the computer genuinely understood what was being asked of it. In the study by Jones and Bergen, ELIZA achieved an overall success rate of 27%, almost double that of the best GPT-3.5 witnesses (14%).

Why did it do so much better? Part of the reason is that GPT-3.5 isn't designed to come across as a real person. The likes of ChatGPT, which uses that particular LLM, are programmed to give responses in a very fixed manner, rather than offer an opinion and then debate it.

Your next machine

Gaming PC group shot

(Image credit: Future)

Best gaming PC: The top pre-built machines.
Best gaming laptop: Great devices for mobile gaming.

But the main factor behind ELIZA's 'success' is that its responses are nothing like those from a modern LLM, leading to some interrogators believing it was simply too bad for it to be a real AI bot, thus assuming it had to be a person.

Given that people only managed to come across as real humans with a success rate of 63%, it's perhaps a little unfair to be overly critical of the human-ness of an LLM. Especially since the Turing test is less an examination of being human-like but more a test of how well something can deceive a person.

We're still very much in the early stages of development of large language models and who knows what it will be like in, say, ten or even just five years. Perhaps by then, games will be implementing them to flesh out NPCs in open-world games or give storylines more complexity and offer multiple paths or choices.

I don't know whether that's a bit of a scary thing to ponder over your morning coffee or not, but it certainly would make things a lot more interesting.

Nick Evanson
Hardware Writer

Nick, gaming, and computers all first met in 1981, with the love affair starting on a Sinclair ZX81 in kit form and a book on ZX Basic. He ended up becoming a physics and IT teacher, but by the late 1990s decided it was time to cut his teeth writing for a long defunct UK tech site. He went on to do the same at Madonion, helping to write the help files for 3DMark and PCMark. After a short stint working at Beyond3D.com, Nick joined Futuremark (MadOnion rebranded) full-time, as editor-in-chief for its gaming and hardware section, YouGamers. After the site shutdown, he became an engineering and computing lecturer for many years, but missed the writing bug. Cue four years at TechSpot.com and over 100 long articles on anything and everything. He freely admits to being far too obsessed with GPUs and open world grindy RPGs, but who isn't these days? 

Read more
PC building
ChatGPT vs DeepSeek: which AI can build me a better gaming PC?
The "mind blown" meme from Tim & Eric.
Friendship ended with human race: Boffins declare the 'meme Turing test' has been passed, and AI is now making funnier captions on average than you useless lumps
OpenAI logo displayed on a phone screen and ChatGPT website displayed on a laptop screen are seen in this illustration photo taken in Krakow, Poland on December 5, 2022.
New research says ChatGPT likely consumes '10 times less' energy than we initially thought, making it about the same as Google search
The OpenAI logo is being displayed on a smartphone with an AI brain visible in the background, in this photo illustration taken in Brussels, Belgium, on January 2, 2024. (Photo illustration by Jonathan Raa/NurPhoto via Getty Images)
OpenAI is working on a new AI model Sam Altman says is ‘good at creative writing’ but to me it reads like a 15-year-old's journal
Alibaba
Forget DeepSeek R1, apparently it's now Alibaba that has the most powerful, the cheapest, the most everything-est chatbot
CHINA - 2025/02/11: In this photo illustration, a Roblox logo is seen displayed on the screen of a smartphone. (Photo Illustration by Sheldon Cooper/SOPA Images/LightRocket via Getty Images)
'Humans still surpass machines': Roblox has been using a machine learning voice chat moderation system for a year, but in some cases you just can't beat real people
Latest in Hardware
The Razer Huntsman Mini 60% gaming keyboard floats in the teal PC Gamer deal void. The per-key RGB lights are on.
The most adorable Razer keyboard features not only an almost half-size form factor, but an almost half-size price at only $70
Razer DeathAdder V3 Pro gaming mouse on a blue background
The Razer DeathAdder V3 Pro is as cheap as I've ever seen it and it's even cheaper than the cut-back HyperSpeed version
bulky headphones on black made using x rays
'We essentially created a virtual headset': Scientists transmit inaudible sound using ultrasonic beams to create single person 'audio enclaves'
A promotional image for the Compal Adapt X modular laptop, as presented by the iF Design Foundation
If you've ever wanted to upgrade a laptop with 'modular AI units' then Compal might just have the very thing you're looking for
Dune Awakening
Dune: Awakening system requirements are here, complete with Razer Sensa HD haptic support to 'feel the rumble of your ornithopter's seat'
An image of a MSI power supply unit against a circular gradient blue background
MSI has gone so heavy with 12V-2x6 power sockets in its latest high-end PSUs that many AMD and Intel graphics cards have no way of being powered
Latest in News
A True Kin knight stands in a ruin in Caves of Qud, flanked by bloodstained furniture and a freshly mortalized corpse.
Despite making a roguelike where you can have countless arms and legs, Caves of Qud's creators say the ideal form is a limbless sphere: 'We started in perfection and only moved farther from God'
Civilization 7 Great Britain - Modern Civ art (via YouTube)
As Civilization 7 struggles to keep up with Civ 5 player counts, a new patch is coming tomorrow with still more UI changes and gameplay tweaks
Metaphor: ReFantazio character art
Metaphor: ReFantazio battle director says turn-based RPGs can still be just as popular as action RPGs: 'I personally believe turn-based games have a long future ahead of them'
assassin's creed shadows review
Assassin's Creed Shadows streamer goes viral after confronting whining commenters: 'Normal people don't get upset about this sh***'
Assassin's Creed Shadows change seasons - An upper-body shot of Yasuke looking cheerfully up into the distance.
'This is just the beginning': Assassin's Creed Shadows dev team thanks fans for their support and promises more to come in the future
Geralt sitting on a wall wearing a Cyberpunk jacket modded by TheRealArdCarraigh
The Witcher 3 devs had to practically remake the game engine to make official modding possible