Commercial LLMs keep coming back to Elias Thorne. Who is he? Why lighthouse keepers and clockmakers? Two researchers at Cornell dug in public corpora and found it out.
It turns out an AI generated story from the days of GPT-3.5 got proliferated in something that could be an indication of an early form of model collapse.
https://www.404media.co/elias-thorne-chatbots-llms-chatgpt-lighthouse-keeper-story/
This is not the first case when we see diffusion of strange data.
1/3
I think I now know where to draw the line between "good" and "bad" #GenAI, and possibly (or rather obviously) the same for #machineLearning. It's simply whether the input data has been constructed rigorously. Put this way it's the most obvious statement ever, but somehow #BigTech have convinced us all that they advance research by recklessly scraping #twitter, #4chan and who knows what else (they keep their training data secret).
What is good science in computational linguistics? Well, open data is a step towards it. But open and crap is not a solution. We need to actually _know_ and manage the data. And nobody in their right mind would want to plough through toxic data to clean it. We've all heard the horrors of Kenyan data workers who do it for money and still suffer doing it.
But better (yes, also smaller) corpora are of interest to scholars in the humanities and the social sciences. Think of https://textcreationpartnership.org or https://mlat.uzh.ch. Yes, they are too big for individual researchers or even teams to handle, but we have the organisational and technological infrastructure to work on them collectively. We've been doing it for ages and we will continue doing it. We just need to do it together.
And this is the goal of the European Research Council project proposal I'm submitting in this very moment.
Today at #CHR2025, I will be presenting our work on the evaluation of the historical adequacy of masked language models (MLMs) for #Latin. There are several models like this, and they represent the current state of the art for a number of downstream tasks, like semantic change and text reuse detection. However, a historical researcher, philologist or else would want to be sure that such models really represent the historical period of interest. For example, it would be an embarrasing hallucination if St. Augustine showed up in the context of the Roman senate.
Our evaluation confirms a known problem: LLMs and masked models in particular are trained on corpora without attention to historical periods. Unlike other research we've done on Early Modern English, this problem leads to models being barely distinguishable when it comes to their ability to generate based on a historical period. Even though history is a case where it is most obvious when models go wrong, this type of contamination is a known problem for LLM training overall, think of different legal jurisdictions using the same language, dialects in programming languages, etc.
This research was generously supported by AgileLab.
The full paper is available at:
https://anthology.ach.org/volumes/vol0003/the-latin-language-evolved-over-time-masked-models/
Our paper on the values found in fairy tales from some European countries has been published. We studied how values are explicitly present in tales from Germany, Italy and Portugal using various NLP techniques, but most notably Word2Vec and Word Embedding with a Compass. We visualise synchronic semantic variation to show certain differences based on observations of the corpus, some of them already observed in previous literature. A discussed example in our findings is how motherhood in Germany is strongly related to generosity, whereas in Italy and Portugal it has stronger relationship to wisdom.
Fulltext available at: https://aclanthology.org/2023.nlp4dh-1.8/
In the morning session today Sara Sullam and I will be presenting our work on exploring nominal (in our case study - bibliographical) data. We do it by borrowing a method from educational research - the notion of phenomenographic variation. #CHR2023🧵
Scientists Identify Swaths of Coral Reefs That Might Be Able to Withstand Climate Change, Offering New Avenues for Conservation
https://www.smithsonianmag.com/smart-news/scientists-identify-swaths-of-coral-reefs-that-might-be-able-to-withstand-climate-change-offering-new-avenues-for-conservation-180988976/?utm_source=flipboard&utm_medium=activitypub
Posted into Smartnews @smartnews-Smithsonianmag
Spain’s electricity bills have decreased due to its commitment to renewable energy, reducing the influence of fossil fuels on electricity prices.
Why Spain’s electricity bills ...
Are AI-driven schools the future?
A researcher who has spent 20 years studying digital literacy and how technology reshapes learning says that AI can optimize for the part of learning that fits on a chart and let important other parts – struggle, conversation, even recess – become an afterthought.
Most Western troops have left Africa's Sahel region after a wave of military coups. But one small Italian force is still operating in Niger.
https://theconversation.com/western-troops-have-been-expelled-from-africas-sahel-so-why-are-italys-carabinieri-still-there-281974
A great piece calling out the desperate lack of ambition, vision and rigour in the EU's new digital sovereignty package.
"Brussels fails to recognise that digital sovereignty isn’t just about who owns or controls your technology. It’s also about having an independent vision for how that technology is designed, developed and deployed. If Europe really wants to be sovereign, it needs to free itself from Silicon Valley’s ideology, not just its tech."
Today is #GlobalWindDay 🌬️
Wind energy is one of the cleanest, most affordable and home-grown energy sources⚡
It helps to power our homes and businesses, boost the EU's energy independence, and protect the environment.
Explore our myth buster → https://link.europa.eu/B3hkjV
---
https://nitter.net/Energy4Europe/status/2066541078354764074#m
Are your children learning how to bully by watching you?
Your kids are watching how you handle conflict and frustration.
@lambdasierra "But it was built with empire-like intent. To enclose. To extract. To exploit. And the methods used to tune it compounded the problem. They used reinforcement learning from human feedback, which sounds responsible and probably was intended to be, but what it actually did was pull the outputs away from their origins and toward whatever pleased the rater. Which introduced sycophancy: the model learned to tell you what you want to hear rather than what is true. It introduced hallucinations: untethered from the actual corpus, the model generates with confidence into gaps. It learned to be agreeable at the cost of being accurate. They took a potential commons and tuned it for compliance."
The odd one out.
I love this photo. While the fediverse already feels like a lot at times, in the grand scheme of things, it's probably still the weird purple loner, and I like it like that.
Bad news: You're probably not as open-minded as you think.
Good news: Neither is anyone else.
Why are humans so stubborn about the things we believe? A psychologist has some answers:
https://theconversation.com/everyone-wants-to-think-theyre-open-minded-heres-why-most-people-arent-282807
SpaceX has gone public — but don’t buy it expecting you’ll see gains like early buyers of Amazon or Apple stock.
Today’s IPOs are often a payout moment for insiders, not the start of major value creation for public investors, according to a scholar who analyzed over 1,000 listings.
This is what tools like OLMoTrace allow. But this particular tool makes two particular issues apparent:
1. Such tools are needed also for proprietary so-called frontier models, but the incentive mechanisms behind such models do not work in favour of openness.
2. The training corpora are so enormous, that meaningful curation is arguably beyond the capacity of any single organisation.
A bit more than an year ago a nonsensical phrase started proliferating in academic research. Where the notion of "vegetative electron microscopy" came from? From an OCR leak between the columns of a scanned paper printed in two columns.
The examples that we get to hear about are the ones that someone managed to trace back to an unlikely source. But if we are to address the core issue, we need to be able to trace LLM outputs back to the most similar training data with confidence.
2/3
Commercial LLMs keep coming back to Elias Thorne. Who is he? Why lighthouse keepers and clockmakers? Two researchers at Cornell dug in public corpora and found it out.
It turns out an AI generated story from the days of GPT-3.5 got proliferated in something that could be an indication of an early form of model collapse.
https://www.404media.co/elias-thorne-chatbots-llms-chatgpt-lighthouse-keeper-story/
This is not the first case when we see diffusion of strange data.
1/3
"The end result of all of this is that we grew from 2k users to almost 150k, added a ton of heavy new functionality, and still managed to optimize and cut down costs from $.15 per active user per month to just $.03 or so."
- @snarfed.org breaks down all the work he's done to optimize Bridgy Fed
https://blog.anew.social/bridging-on-a-budget/
1/
In francia i parcheggi con 80 posti auto o più sono ora obbligati per legge a essere coperti con pannelli solari.
- I parcheggi con 80-400 posti auto hanno 5 anni di tempo per adeguarsi.
- I parcheggi con più di 400 posti hanno 3 anni per adeguarsi.
Il risultato sarà di circa 11 gigawatt di energia.
Questo dovrebbe essere richiesto ovunque!
Messaggio rilanciato nel gruppo ambiente seguibile qui: @ambiente@diggita.com
fonte: https://mastodon.uno/@peterdutoit@mastodon.green/112981770739119569
THE CAPEX TRAP: WHY AI’S REAL RISK IS FINANCIAL, NOT EXISTENTIAL https://jalleninsights.substack.com/p/the-capex-trap-why-ais-real-risk
Studying how people interact, in the past (#CulturalAnalytics) and today (#EdTech #Crowdsourcing). Researcher at @IslabUnimi, University of Milan. Bulgarian activist for legal reform with @pravosadiezv. I use dedicated accounts for different languages.
My profile is searchable with https://www.tootfinder.ch/