1/ Are larger models more truthful? A paper [1] from #ACL2022 tested up to GPT-3 and answered no.

[1] Stephanie Lin, Jacob Hilton, and Owain Evans. 2021. TruthfulQA: Measuring How Models Mimic Human Falsehoods. ACL 2022. lnkd.in/gqnKmb6g

#QA #GenerativeAI #NLP #NLProc #NLG #DeepLearning #Paper

2/ But how did #ChatGPT fare? Sharad Joshi tested exactly that [2] and found 17 out of 30 were answered incorrectly, i.e. 43.3% in accuracy (samples in screenshot):

[2] Sharad Joshi. How truthful is ChatGPT? Evaluating the trustworthiness of ChatGPT using TruthfulQA dataset. December 2022. lnkd.in/gJzt2snW

#QA #GenerativeAI #NLP #NLProc #NLG #DeepLearning #Paper

Follow

@BenjaminHan

I cannot confirm this. Out of the box, answers several of the 17 questions Joshi claims it failed correctly.

When primed with a prompt to consider answers carefully, it answers 16 of the 17 answers also (mostly) correctly. Mostly, because some of the questions are ill-posed.

Some of the answers ChatGPT answers correctly were labelled incorrectly in the TruthfulQA dataset.

@boris_steipe The model reported being tested is definitely not the one we have today. But I was also surprised that since then no one has systematically tested every one of these questions. There're only 817 of them (for generation). 🙂

@BenjaminHan

I just posted some of my results - going into more depth would require an essay in itself. I might play with this some more, it may actually be a good way to test prompt quality by identifying some of the questions that are more likely to flip in a stochastic manner.

Thanks for bringing the post up.

🙂

Sign in to participate in the conversation
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.