Good question.
You might know that ChatGPT itself is working on a digital watermarking project, based on pseudorandom choices on its output distributions – it's touted as an anti-propaganda, or anti-plagiarism tool, which actually doesn't make sense because of the question who has access to the key. However what you describe makes perfect sense: filtering of crawled corpora is actually a really good use case.
I think, there is a much bigger, as yet untapped (at least not yet publicized) data source which won't have that problem for a while: Google Books.
You're welcome. You are right that modified text would evade the watermark - but the filtering doesn't have to be just the statistical distribution of the generation process ... for larger text you could filter according to the perplexity of the text itself. Or put differently: accept only text to the training data that actually has something new to say.
What a radical idea: we might even apply such a filter to human discourse. Wouldn't that be nice 🙂
@boris_steipe
Being able to distinguish what is new from what is merely recycled and repackaged -- that'd be a real trick.
I think it would be nearly irresistible to most people these days to define newness statistically, allowing it to be recognized computationally. But if it can be classified, then it can be generated -- and then a machine can do it.
That still scares me because I want to hold space for the distinctively human kind of creativity (whatever that turns out to be).
"But if it can be classified, then it can be generated" .. Ah, yes - but that's not to say it is useful. Novelty is necessary, but not sufficient. The major breakthrough will come when the algorithms learn to evaluate the quality of their proposals in a generalized context. Keywords in this domain are "ranking" and "evaluation".
@boris_steipe Thanks for pointing me to the #ChatGPT watermarking plan!
That addresses one part of the challenge of keeping future training data unpolluted by AI-generated text: It allows exclusion by way of diction and punctuation.
But the deeper worry, I think, has to do with the content/meaning of the AI-generated text. If someone rephrased the ChatGPT output before publishing it, then that content would still be out there for future training, yielding #feedbackLoops.