Commercial LLMs keep coming back to Elias Thorne. Who is he? Why lighthouse keepers and clockmakers? Two researchers at Cornell dug in public corpora and found it out.
It turns out an AI generated story from the days of GPT-3.5 got proliferated in something that could be an indication of an early form of model collapse.
https://www.404media.co/elias-thorne-chatbots-llms-chatgpt-lighthouse-keeper-story/
This is not the first case when we see diffusion of strange data.
1/3
This is what tools like OLMoTrace allow. But this particular tool makes two particular issues apparent:
1. Such tools are needed also for proprietary so-called frontier models, but the incentive mechanisms behind such models do not work in favour of openness.
2. The training corpora are so enormous, that meaningful curation is arguably beyond the capacity of any single organisation.