More fun testing with the March 23 version of OpenAI's ChatGPT. I tried a few logic puzzles I consider simple. In case the exact page I pulled these from were used in training, I changed "duck" to "fridge" in this one:
Q. There are two fridges in front of a fridge, two fridges behind a fridge and a fridge in the middle. How many fridges are there?
ChatGPT: If we assume that "a fridge" refers to the fridge in the middle, then there are a total of 5 fridges.
ChatGPT: There are two fridges in front of the middle fridge and two fridges behind the middle fridge, making a total of four fridges. Add in the fridge in the middle, and the total is 5 fridges.
This is, of course, incorrect. The fridge referred to as "a fridge" changes with each reference, which is why it's "a fridge" each time, rather than "the fridge." There are three.
Let's try a different one:
Q. Peter is looking at Susan. Susan is looking at Freddy. Peter is married, Freddy is not, and we don’t know if Susan is married. Is a married person looking at an unmarried person?
ChatGPT: Yes, a married person is looking at an unmarried person.
ChatGPT: Peter is married and looking at Susan who we don't know if she's married or not. Susan is looking at Freddy, who is not married. Therefore, Peter (a married person) is looking at Freddy (an unmarried person).
I'm not sure whether to give half credit here, or not. It had a 50% chance of giving the correct overall answer of "yes," but the of the answer is nonsense. The correct answer is yes because if Susan is unmarried, Peter is looking at Susan, but if Susan is married, Susan is looking at Freddy. But the answer from ChatGPT appears to apply a commutative principle and claims that Peter is looking at Freddy, which is nonsense.
For the record, these were back-to-back questions in a fresh session, no cherry-picking. Try it yourself at https://chat.openai.com
Wow, it gets worse! I asked a third question (adding one to each of the colors, plus three to the total in the source puzzle):
Q. A man has 56 socks in his drawer: 22 identical blue, 16 identical black and 18 identical red. The lights are out and he is completely in the dark. How many socks must he take out to make 100 percent certain he has at least one pair of black socks?
ChatGPT: To make 100% certain that he has at least one pair of black socks, the man needs to take out only 3 socks.
ChatGPT: Here's why:
ChatGPT: The worst-case scenario for the man is that he picks out all of the blue socks and all of the red socks before he gets to the black socks. In that case, he would have picked out 22 + 18 = 40 socks without getting a pair of black socks.
ChatGPT: However, as soon as he picks out the 41st sock, it will be a black sock. So, he only needs to take out 3 socks (the first two to get a different color than black, and the third to get a black sock) to make sure that he has at least one pair of black socks.
What? That's... the correct answer is 42, not 3!
Is it going to get *all* of these wrong? That is not what I expected!
Q. The day before two days after the day before tomorrow is Saturday. What day is it today?
ChatGPT: If the day before two days after the day before tomorrow is Saturday, then today is Thursday.
ChatGPT: Here's why:
ChatGPT: "Two days after the day before tomorrow" is the same as "two days after today". So, if the day before two days after the day before tomorrow is Saturday, then we can write it as:
ChatGPT: The day before (two days after today) is Saturday.
ChatGPT: Let's work backwards from Saturday:
ChatGPT: If two days after today is Saturday, then today is Thursday.
ChatGPT: So, the day before (two days after today) is Wednesday.
ChatGPT: And, the day before Wednesday is Tuesday.
ChatGPT: Therefore, if the day before two days after the day before tomorrow is Saturday, then today is Tuesday.
I'd like the spend some time marveling at an answer that starts out wrong, and then gets wronger.
The correct answer is that today is Friday. ChatGPT says that today is Thursday, then in the course of explaining why, changes the answer to Tuesday, without even acknowledging the change. Wow!
Okay, I'm going to try *one* more. it's a classic, so surely it will get this one right, right? RIGHT?
Q. This “burning rope” problem is a classic logic puzzle. You have two ropes that each take an hour to burn, but burn at inconsistent rates. How can you measure 45 minutes? (You can light one or both ropes at one or both ends at the same time.)
ChatGPT: Here's how you can measure 45 minutes using the two ropes:
ChatGPT: 1. Light one end of the first rope and both ends of the second rope at the same time.
ChatGPT: 2. The first rope will burn completely in 30 minutes (since it takes an hour to burn and you lit one end).
ChatGPT: 3. The second rope will burn completely in 15 minutes (since it takes an hour to burn and you lit both ends).
ChatGPT: 4. At the moment the first rope burns out (30 minutes after you started), light the other end of the second rope.
ChatGPT: 5. The remaining half of the second rope will burn in 15 minutes (since it now only has half the length to burn).
ChatGPT: Therefore, the second rope will have burned for a total of 30 + 15 = 45 minutes when the remaining half burns out.
I genuinely thought it would do better than this by parroting syntax alone. I didn't even change any words in the last two questions, so it's possible or even probable that the source document was in its training data. Yet still spicy autocomplete gave answers that were literally nonsense.
Consider ChatGPT's second bullet point: "The first rope will burn completely in 30 minutes (since it takes an hour to burn and you lit one end)." What? No. Just, no.
It's clear that it was fed the classic puzzle in training data, as it uses words that suggest exposure to the question, but since it doesn't understand the words its using, it's literally putting out nonsense.
ChatGPT (March 23) is 0 for 5, or maybe 0.5 for 5 on the 50/50 question.
Now people are free to chime in and tell me that the March 23 model is just v3.5turbo, while v4 (currently available only to paying customers) gets all of these correct. Which I could easily believe! I expected syntactical clues and common training data to result in better than 0 for 5, certainly!
@ech Oh, I know how it works, and you're absolutely right. While I'm surprised at just how badly this model performed, I'm posting these results to push back against the crowd that is trying to tell everybody that will listen that there is True Reasoning emerging from these models. It's not reasoning, it's probabilities!
In normal conversation, especially with people inclined to Believe, it seems natural, especially with enough coaching. But that's because of the flexibility of the English language and the credulity of the Believers. It's clearly not actually reasoning, regardless of how well it does on tests that have previously claimed to test reasoning.
People want to Believe, and so they see what they want to see, from Blake Lemoine with LaMDA to adamsmith143 on Hacker News.
It's still spicy autocomplete. And it's sometimes truly amazing, mind-blowingly good! I'm often very impressed, but I think it's a mistake to anthropomorphize the language models.
@pwinn Got it. Yes: this kind of thing is, I think, a great demonstration of that.