Follow

I think I now know where to draw the line between "good" and "bad" , and possibly (or rather obviously) the same for . It's simply whether the input data has been constructed rigorously. Put this way it's the most obvious statement ever, but somehow have convinced us all that they advance research by recklessly scraping , and who knows what else (they keep their training data secret).

What is good science in computational linguistics? Well, open data is a step towards it. But open and crap is not a solution. We need to actually _know_ and manage the data. And nobody in their right mind would want to plough through toxic data to clean it. We've all heard the horrors of Kenyan data workers who do it for money and still suffer doing it.

But better (yes, also smaller) corpora are of interest to scholars in the humanities and the social sciences. Think of textcreationpartnership.org or mlat.uzh.ch. Yes, they are too big for individual researchers or even teams to handle, but we have the organisational and technological infrastructure to work on them collectively. We've been doing it for ages and we will continue doing it. We just need to do it together.

And this is the goal of the European Research Council project proposal I'm submitting in this very moment.

@mapto literally just been writing a book chapter about the importance of curation in avoiding hallucination, while also noting that training objectives and decoding strategies play a role here! Timely post and good luck with the grant.

Sign in to participate in the conversation
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.