More fun testing with the March 23 version of OpenAI's ChatGPT. I tried a few logic puzzles I consider simple. In case the exact page I pulled these from were used in training, I changed "duck" to "fridge" in this one:
Q. There are two fridges in front of a fridge, two fridges behind a fridge and a fridge in the middle. How many fridges are there?
ChatGPT: If we assume that "a fridge" refers to the fridge in the middle, then there are a total of 5 fridges.
ChatGPT: There are two fridges in front of the middle fridge and two fridges behind the middle fridge, making a total of four fridges. Add in the fridge in the middle, and the total is 5 fridges.
This is, of course, incorrect. The fridge referred to as "a fridge" changes with each reference, which is why it's "a fridge" each time, rather than "the fridge." There are three.
Let's try a different one:
Q. Peter is looking at Susan. Susan is looking at Freddy. Peter is married, Freddy is not, and we don’t know if Susan is married. Is a married person looking at an unmarried person?
ChatGPT: Yes, a married person is looking at an unmarried person.
ChatGPT: Peter is married and looking at Susan who we don't know if she's married or not. Susan is looking at Freddy, who is not married. Therefore, Peter (a married person) is looking at Freddy (an unmarried person).
I'm not sure whether to give half credit here, or not. It had a 50% chance of giving the correct overall answer of "yes," but the of the answer is nonsense. The correct answer is yes because if Susan is unmarried, Peter is looking at Susan, but if Susan is married, Susan is looking at Freddy. But the answer from ChatGPT appears to apply a commutative principle and claims that Peter is looking at Freddy, which is nonsense.
For the record, these were back-to-back questions in a fresh session, no cherry-picking. Try it yourself at https://chat.openai.com
@pwinn I think these are questions it can't answer because that just isn't how it works. It probably trained on a few samples of text that spell out the answer to this problem, so it is parroting those in a sense, maybe, but not in a coherent way, because that just isn't what it does. I'm not entirely sure.
@pwinn Got it. Yes: this kind of thing is, I think, a great demonstration of that.
@ech Oh, I know how it works, and you're absolutely right. While I'm surprised at just how badly this model performed, I'm posting these results to push back against the crowd that is trying to tell everybody that will listen that there is True Reasoning emerging from these models. It's not reasoning, it's probabilities!
In normal conversation, especially with people inclined to Believe, it seems natural, especially with enough coaching. But that's because of the flexibility of the English language and the credulity of the Believers. It's clearly not actually reasoning, regardless of how well it does on tests that have previously claimed to test reasoning.
People want to Believe, and so they see what they want to see, from Blake Lemoine with LaMDA to adamsmith143 on Hacker News.
It's still spicy autocomplete. And it's sometimes truly amazing, mind-blowingly good! I'm often very impressed, but I think it's a mistake to anthropomorphize the language models.