@ChristosArgyrop I'm far from being an expert and very happy to be corrected on this... I think the main issue with this type of benchmarks is that whichever you look at will somehow be skewed.
Yes, <your favourite LLM here> can pass the bar exam, get an accreditation for becoming an airline pilot and create Michelin Star winning recipes... all while also being able to be convinced that 2 + 2 = 5.
Given that LLMs are tools, they also need a skilled person to use them. The AI industry loves telling us that you can ask anything to an LLM and you'll get perfect results, 99.9% of time. They forgot to mention you need to know how/what to ask, which means that with good domain knowledge you can create prompts where these tools fail miserably.
This is the same as using a search engine. If you don't know what to ask you will get irrelevant results. And if you don't have the critical skills to understand whether a reply is relevant or not... you will believe whatever you read.
@ChristosArgyrop and, I don't mean to imply that LLMs are useless. they can be useful and research in that area has really made impressive steps... it's just that now as more people use them, the hype is starting to give in to reality!
@nicolaromano What I tell colleagues: one needs to spent a nearly equivalent amount of time cleaning up as they would if they had just hired an underexperienced, undertrained junior team member who is good at BS. You have to micromanage to get results
@nicolaromano These are great points. The difference is that sold engines were marketed as user driven tools, while LLMs as marketed to hype driven fools.