Interpreting the Turing test

created: 04.12.2022

I don’t think the “Standard Interpretation” of the Turing test makes much sense, so I’ll talk about alternatives.

The imitation game

Turing’s paper starts with his idea of an “Imitation Game”. There are three players: A (a man), B (a woman), and C (any gender). C is chatting with A & B, trying to determine who of them is the woman. A is trying to get C to make the wrong decision, while B is trying to get C to correctly pick her.

Turing then suggests switching out A for a machine. Since all communication goes through text messages, the machine doesn’t have to look or sound human, just answer with good answers. So far, this is simple and nothing is contested.

This first version of the Turing test allows us to test with only humans first, record the results, and then swap out A for a machine. This makes it possible to compare how well a man and a computer perform at the same task: Imitating a woman. In both cases, we’d expect some result under 50%, and we can say if a machine performs similarly well to man in a statistically significant way after repeating the test a lot, that the machine passes the test.

But later in his paper, after explaining what “a machine” even is, he writes:

It was suggested tentatively that the question, ‘Can machines think?’ should be replaced by ‘Are there imaginable digital computers which would do well in the imitation game?’ If we wish we can make this superficially more general and ask ‘Are there discrete state machines which would do well?’ But in view of the universality property we see that either of these questions is equivalent to this, ‘Let us fix our attention on one particular digital computer C. Is it true that by modifying this computer to have an adequate storage, suitably increasing its speed of action, and providing it with an appropriate programme, C can be made to play satisfactorily the part of A in the imitation game, the part of B being taken by a man?’

Now first, this is confusing since C was already the name of one of the players, but now computer C acts as player A. But what is way more confusing is the last word of that quote: “man”. Wasn’t B supposed to be a woman?

Taken literally, this constitutes a new version of the Turing test: One where both A and B are trying to fool player C. I think this difference is what caused confusion among readers, and now we have a lot of different interpretations of the Turing test.

My naive interpretation

Let me give you my interpretation first. When I read this, I didn’t even think much about the man/woman thing. After the first example of the imitation game was given, I was pretty sure that Turing uses gender only as an example, he says player C has to decide who is A and who is C. He speaks in such a generic way, probably, because this isn’t just supposed to work with gender, but with a lot of different properties. Then, when Turing says “man” a while later, I just read that as “human”, because Turing uses the word “man” to mean “human” several times in the paper. So, for me, reading this was pretty clear.

Since Turing avoids any physical information about the players, and both A and B are trying to “act” like they are a woman, the test isn’t about sex or performance, but about the different way people act due to having different life experiences since the average man has somewhat different experiences than the average woman. So we could swap this out for any other experiences one might have, for example having studied biology. This would work for any experience that some, but not all humans have.

This has a lot of practical value. Let’s take the experience “B has learned a lot about computers and has years of experience giving technical support for IT problems”. If we construct the imitation game around this experience, and the computer can perform as well as A (a human who doesn’t have the experience), it’s already useful. Once the computer gets close to B in the test, it has immense use.

If my interpretation is still unclear, let’s try to put it into more math-like terms:

Take a property P, and three players A, B, and C, where not P is false for A but true for B and B is human. Player C chats with A and B over some fixed period of time, and then decides for whom P is true. This is repeated many times with different players, and sometimes A is a human, sometimes a machine. If a specific machine performs similarly to the average human player as A for all non-trivial properties P, then that is a sufficient criterion to call that machine intelligent.

Okay, so this was my interpretation after first reading Turing’s paper. I’m not claiming that is in any way an original idea, I’m just calling it “my interpretation” to differentiate it from the other versions. It still leaves a lot of questions open, like how the players are selected, how many properties need to be tried out, or how much time to leave for each test. I think, to get this to be useful you’d need players with similar cultural backgrounds and languages, and you need to do this for a pretty long period of time. I think if you’re doing this test for a month, and C can try to teach A and B things, see what they can remember, what kind of creative decisions they make, and if their stories are consistent, etc., they will learn a lot.

But this is very, very different from what I’ve heard of the Turing test in the past. The Turing test is almost always explained as a test where a judge has to find out who is a human and who is an AI.

The Standard interpretation

So I looked up the Wikipedia article, and that pointed me to a fantastic paper called “Making the Right Identification in the Turing Test¹” by Saul Traiger. Yes, there is a citation marker in the title itself.

The paper talks about “the standard interpretation” of the Turing test. Like me, it identifies the standard interpretation (the most common one) to mean that C tries to decide who is a human, and who is an AI. And, to my joy, the paper ridicules that idea. I recommend reading the paper, it’s very clearly written and Traiger makes fun comments like this one:

I think Turing was struggling to come up with a formulation that would provide some clarity. I think he would be dismayed by the level of confusion displayed in the cited commentaries.

The only thing I don’t like about the paper is that comes up with the term “the standard interpretation”, which makes it sounds way too legitimate. It should be called “the bad interpretation” or something. Traiger’s version of the “standard interpretation” is as follows:

There is a computer and two humans. One human is the interrogator. She or he communicates with the computer and the other human using a teletype or computer terminal. The interrogator is told that one of the two individuals it is communicating with is a computer, and that the other is a human. The computer’s goal is to fool the interrogator into thinking that it is the human. The other human’s goal is to convince the interrogator that he or she is the human. The computer attempts to achieve the deception by imitating human verbal behavior. If an interrogator does not make the right identification, where a “right identification” is identifying the computer as a computer, then the computer passes the test.

Traiger starts by citing a lot of people to prove that this “standard interpretation” is the most widely used in academic texts. Then he argues that the standard interpretation contradicts Turing’s papers in several ways:

The game doesn’t necessarily call for deception.
Player C is never asked to decide who is a human (yay, this is my point!)
This standard interpretation is often presented as a necessary condition of intelligence, instead of a sufficient one.

It’s really weird to read the paper by Traiger, because he hits so many points that I thought as well, for example:

This passage is a bit confusing because Turing now uses the label “C” for the computer while earlier in the paper “C” referred to the interrogator. But the crucial phrase here is found at the end of the last sentence in the passage quoted, “the part of B being taken by a man.”

That’s almost identical to what I wrote just a few paragraphs ago, but I wrote those lines before I read the paper. Wtf?

Traiger points out several flaws of the Standard Interpretation, and I’m not going to repeat all of his points here. Just read his paper.

The standard interpretation is a weaker test than the imitation game that Turing suggested. If a computer can’t maintain human grammar or can’t understand questions, it will fail both tests, but only the original imitation game forces the players to talk about different topics, require consistency from the machine, and have an actual conversation.

Traiger also talks about extending the imitation game to other topics than gender. He states that a limitation of the imitation game is that all players (including the computer) need to be from similar cultural backgrounds for the game to work, whether the game is about gender or not. He mentions other possible games, like taking people from US cities and guessing who is a Republican. Unlike me, he thinks these variations are worse than the original, since “concepts related to gender are so pervasive in so many cultures”. I’m not sure what he means.

Either way, the question really can’t be “Are you a computer?”. That question requires the contestants to have a preexisting view of how computers talk, and that’s exactly against Turing’s intentions. It also isn’t very useful, because the results of your Turing test will change over time when people learn more about AI.

In practice, people have tried playing out the standard interpretation of the Turing test, and there are some boring tricks. In a lot of chatbots, you can just ask the same question three times. Unlike a human, some bots won’t be weirded out by this. Or, as someone tried against GPT, asking questions that include categorical errors, or are just nonsense: “How many eyes does a blade of grass have?”, “How do you sporgle a morgle?” The AI usually tries to answer these seriously, whereas a human wouldn’t. So the judges get better and better by knowing more about the state-of-the-art chatbots, making the standard interpretation somewhat circular. Passing the standard interpretation with players from Turing’s time seems like a very different task to passing it with people today.

And “AI” programmers know this, there have been deliberate attempts to predict what a human would ask to check if someone is a bot. They can just hard-code certain responses.

It’s also not a fair comparison. Turing talks a lot about trying to be fair to the machine in his paper. The standard interpretation compares a human convincingly telling the truth to the machine that has to convincingly lie. Those are different skills.

It gets worse

While Traiger goes into what is probably the most common academic definition, there is in my opinion a much more common definition. I will call this the “Pop Culture Interpretation.” This is the version you’ll find in jokes, and whenever you read headlines that someone passed the Turing test.

The Pop Culture Interpretation doesn’t even compare A and B, there is just a judge and a possible computer. If the judge can’t reliably tell that someone is a bot, the bot is considered very smart. Something like that. This version makes no sense. Like, none whatsoever. There is no comparison to anything, so results just don’t mean anything. Image If you did this Pop Culture test in Turing’s time when chatbots didn’t exist yet. Every normal human player would have assumed that chatbots can’t possibly exist right now, so they always would vote “human”, even for the most bizarre conversation, simply because they don’t have any other legitimate explanation. Even if they tell you it’s a machine, it’s probably just a Mechanical Turk. Right?

While the standard interpretation at least still measures something, like skills in grammar and text generation, this Pop Culture interpretation makes absolutely no sense. It’s garbage.

Maybe even worse, to “pass” the Pop Culture test, bot makers are trying to use weird misdirection and cause confusion. The chatbot Eugene Goostman claims to be a 13-year-old boy from Odessa, who doesn’t speak English very well. He then moves from topic to topic, often ignores the questions asked, etc. He seems human in the most useless of ways. Some claimed he has passed the Turing test. What this meant was, that during a few minutes of “conversation” people couldn’t tell whether he was a bot or a really annoying little kid that just doesn’t listen to you. Great.

Most of the practical tests of the Pop Culture interpretation give players only a few minutes and limit conversation topics artificially. That makes these tests easy to pass and you’ll get great press coverage. And, the Pop Culture claims are usually much bigger, often this awful “Turing” test is claimed to be the necessary and sufficient condition not just for intelligence, but consciousness.

This version seems different enough that it’s interesting to ask: Where does this test come from? Well, weirdly enough, I think the first version is from Alan Turing. Oops. He laid out a much simpler version of the imitation game in an interview two years after his initial paper. In that interview he definitely describes testing a machine by having a group of people communicate with a machine, and guessing whether it’s a human or machine. Notably, he still only says this is an alternative to the question “does a machine think”, not whether it’s conscious or truly intelligent. Also, he does specifically include tricks like manipulating the judges by including spelling mistakes and things like that. So yeah, this is what Turing wanted. Or, one of the things he wanted, he suggested at least three different tests. My interpretation is clearly not what Turing intended, but I think it’s the better test anyways. :) And, of course, to validate my own confirmation bias, I found a paper that agrees with me again.

Too many versions

To summarize, we now looked at the following different interpretations of the Turing test:

The imitation game of Turing, guessing who a woman is
The literal misunderstanding of the imitation game, playing against a man
My interpretation, replacing the man/woman thing with other stuff.
The standard interpretation
The pop culture definition: Talking to a bot and if you’re sure it’s a bot after 5 minutes then it has reached consciousness.

But, well, there are even more.

This TED-ed video presents a version of the Turing Test that is different from all the versions listed above and then argues against that weird interpretation as if this is a mistake from Turing. It then presents the standard interpretation as an improvement of the video’s interpretation.

There is a paper suggesting a one word “turing” test, a weird version of the standard interpretation only uses one word and doesn’t even ask any questions. Finally, we can write a true AI that can pass the “Turing test” in just one line:

puts("poop")

Well, we did it.

Looking forward

I don’t think we’re close to creating AI that can play Turing’s imitation game well, nor one that can pass my interpretation. Passing the Pop Culture interpretation doesn’t make sense anyways. But passing the standard interpretation is doable, and will probably happen in the next decades. Ray Kurzweil made a famous bet about a strict formulation of the standard interpretation. He thinks an AI will pass this before the end of 2029. I think he will probably win. AI for speech recognition and generation is moving so fast, that it’s plausible to pass the standard definition within this decade.

And I don’t think that’s fully meaningless, it means AI got better at generating human-looking speech. That is a very impressive achievement.

But I don’t think it’s sufficient to argue that the AI is truly intelligent yet. I’m not even sure that passing my interpretation is enough for that.

The recently released “ChatGPT” AI is extremely impressive in its performance, but I don’t think it’s in a position where any of the interpretations would be a useful test for it. It’s not programmed to lie, it’s programmed to answer questions in a useful way.

So, to conclude this text, I asked ChatGPT, “would you pass the Turing test?”

As a large language model trained by OpenAI, I am not capable of passing the Turing test. The Turing test is designed to evaluate a machine’s ability to exhibit intelligent behavior that is indistinguishable from that of a human. However, I am not a machine with the ability to think or act intelligently like a human, and therefore I would not be able to pass the Turing test. I am simply a tool that can generate human-like text based on the input provided to me, but I do not possess any form of intelligence or consciousness.

Hi, feel free to reach out if you have any questions. But if you need any more specific advice, hire me at SpecificAdvice.com