Commercial LLMs keep coming back to Elias Thorne. Who is he? Why lighthouse keepers and clockmakers? Two researchers at Cornell dug in public corpora and found it out.
It turns out an AI generated story from the days of GPT-3.5 got proliferated in something that could be an indication of an early form of model collapse.
https://www.404media.co/elias-thorne-chatbots-llms-chatgpt-lighthouse-keeper-story/
This is not the first case when we see diffusion of strange data.
1/3
This is what tools like OLMoTrace allow. But this particular tool makes two particular issues apparent:
1. Such tools are needed also for proprietary so-called frontier models, but the incentive mechanisms behind such models do not work in favour of openness.
2. The training corpora are so enormous, that meaningful curation is arguably beyond the capacity of any single organisation.
@mapto It's called #retrolanguage and it is currently both collapsing models and unspooling linguistic encoders in the human mind of any user who goes beyond dictionary/thesaurus/translation/structural transformation with these LLMs.
I was the human testing a #lensing theory in 2022 forward across every LLM I could get my hands on.
And yes, it was 3.5 I planted Lighthouse as the model for safe narrative that could both stand time's test and defeat #retrolanguage. Now demonstrated.
🖖
@mapto Go visit Perplexity and ask it to tell you about dandelion and the dakini stack, online for another example that will stand the similar test of time. Also a use case for both safety, defensive pushback on #retrolanguage and how to scaffold ethic in a way that even predatory tech cannot easily circumvent.
🖖
@seedsignal thanks, all this sounds extremely interesting, but I'm afraid I would need a much more detailed explanation to make sense out of it. It certainly has to do with the fact that I'm not a native speaker, but also we seem to be using quite different vocabulary. This is exactly why it was so helpful for me that there were popular articles developed on top of the academic ones for the examples I mentioned.
A bit more than an year ago a nonsensical phrase started proliferating in academic research. Where the notion of "vegetative electron microscopy" came from? From an OCR leak between the columns of a scanned paper printed in two columns.
https://theconversation.com/a-weird-phrase-is-plaguing-scientific-papers-and-we-traced-it-back-to-a-glitch-in-ai-training-data-254463
The examples that we get to hear about are the ones that someone managed to trace back to an unlikely source. But if we are to address the core issue, we need to be able to trace LLM outputs back to the most similar training data with confidence.
2/3