A Chatbot from the 1960s has thoroughly beaten OpenAI's GPT-3.5 in a Turing test, because people thought it was just 'too bad' to be an actual AI

OpenAI logo displayed on a phone screen and ChatGPT website displayed on a laptop screen are seen in this illustration photo taken in Krakow, Poland on December 5, 2022.
(Image credit: Jakub Porzycki/NurPhoto via Getty Images)

A small research group recently examined the performance of 25 AI "people", using two large language models created by OpenAI, in an online Turing test. None of the AI bots ultimately passed the test, but all the GPT 3.5 ones did so badly that a chatbot from the mid-1960s was nearly twice as successful as passing itself off as a human—although that's mostly because people thought the older chatbot was too bad at pretending to be human to be AI.

News of the work was reported by Ars Technica and it's a fascinating story. The Turing test itself was first devised by famed mathematician and computer scientist Alan Turing in the 1950s. The original version of the test involves having a real person, called an evaluator, talk to two other participants via a text-based discussion. The evaluator knows that one of the respondents is a computer but doesn't know which one.

If the evaluator can't tell which one is a computer or determines that they must both be humans, then the machine can be said to have passed the Turing test.

Cameron Jones and Benjamin Bergen of the University of California San Diego devised a two-player version of the Turing test, where an "interrogator" asks questions of a "witness" and then decides whether the witness is a human being or an AI chatbot. A total of 25 large language model (LLM) witnesses were created, based on the GPT-4 and GPT-3.5 models from OpenAI.

To get some baseline results, real people were also included, as was one of the first-ever chatbots created, ELIZA in the mid-1960s (you can try it out yourself here). The AI witnesses were prompted as to the nature of the discussion, along with instructions on how it should respond. These included things like making spelling mistakes, how long it should take to respond, and whether it was a human or an AI pretending to be a real person.

Just over 650 people took part in the test, with around 1,400 runs being collated to produce the analysis. Each witness was judged on how successful it was in passing off as a human being; in other words, whether the interrogator's final decision was that the witness was a real person.

A chart showing the success rates of various LLM chatbots in a Turing test

(Image credit: Cameron Jones, Benjamin Bergen)

Humans had an overall success rate of 63%, which means that in 43% of the tests, interrogators thought that the real person giving the responses was an AI chatbot. The best-performing LLM witness, which used GPT-4, had an overall success rate of 43%, with the worst being just 5% (a GPT-3.5 model). That meant none of the AI bots ultimately passed the Turing test and even fared worse than just accounting for blind chance (i.e. 50%).

One big surprise was ELIZA. Created at the Massachusetts Institute of Technology, it was based on pattern checking and strict rules to give the impression that the computer genuinely understood what was being asked of it. In the study by Jones and Bergen, ELIZA achieved an overall success rate of 27%, almost double that of the best GPT-3.5 witnesses (14%).

Why did it do so much better? Part of the reason is that GPT-3.5 isn't designed to come across as a real person. The likes of ChatGPT, which uses that particular LLM, are programmed to give responses in a very fixed manner, rather than offer an opinion and then debate it.

Your next machine

Gaming PC group shot

(Image credit: Future)

Best gaming PC: The top pre-built machines.
Best gaming laptop: Great devices for mobile gaming.

But the main factor behind ELIZA's 'success' is that its responses are nothing like those from a modern LLM, leading to some interrogators believing it was simply too bad for it to be a real AI bot, thus assuming it had to be a person.

Given that people only managed to come across as real humans with a success rate of 63%, it's perhaps a little unfair to be overly critical of the human-ness of an LLM. Especially since the Turing test is less an examination of being human-like but more a test of how well something can deceive a person.

We're still very much in the early stages of development of large language models and who knows what it will be like in, say, ten or even just five years. Perhaps by then, games will be implementing them to flesh out NPCs in open-world games or give storylines more complexity and offer multiple paths or choices.

I don't know whether that's a bit of a scary thing to ponder over your morning coffee or not, but it certainly would make things a lot more interesting.

Nick Evanson
Hardware Writer

Nick, gaming, and computers all first met in 1981, with the love affair starting on a Sinclair ZX81 in kit form and a book on ZX Basic. He ended up becoming a physics and IT teacher, but by the late 1990s decided it was time to cut his teeth writing for a long defunct UK tech site. He went on to do the same at Madonion, helping to write the help files for 3DMark and PCMark. After a short stint working at Beyond3D.com, Nick joined Futuremark (MadOnion rebranded) full-time, as editor-in-chief for its gaming and hardware section, YouGamers. After the site shutdown, he became an engineering and computing lecturer for many years, but missed the writing bug. Cue four years at TechSpot.com and over 100 long articles on anything and everything. He freely admits to being far too obsessed with GPUs and open world grindy RPGs, but who isn't these days?