The case of “vegetative electron microscopy” illustrated here shows what is badly needed in current #LLM research and has implications far beyond. We need tools that help us curate huge corpora. We need to be able to trace #hallucinations back to the training data and understand what are the specific (to a surprise, often #deterministic) reasons in the model input that cause that particular output.
If anyone is interested in collaborating on this, I'm in, have done some small-scale experiments and have already submitted a grant proposal.
https://theconversation.com/a-weird-phrase-is-plaguing-scientific-papers-and-we-traced-it-back-to-a-glitch-in-ai-training-data-254463
I think I now know where to draw the line between "good" and "bad" #GenAI, and possibly (or rather obviously) the same for #machineLearning. It's simply whether the input data has been constructed rigorously. Put this way it's the most obvious statement ever, but somehow #BigTech have convinced us all that they advance research by recklessly scraping #twitter, #4chan and who knows what else (they keep their training data secret).
What is good science in computational linguistics? Well, open data is a step towards it. But open and crap is not a solution. We need to actually _know_ and manage the data. And nobody in their right mind would want to plough through toxic data to clean it. We've all heard the horrors of Kenyan data workers who do it for money and still suffer doing it.
But better (yes, also smaller) corpora are of interest to scholars in the humanities and the social sciences. Think of https://textcreationpartnership.org or https://mlat.uzh.ch. Yes, they are too big for individual researchers or even teams to handle, but we have the organisational and technological infrastructure to work on them collectively. We've been doing it for ages and we will continue doing it. We just need to do it together.
And this is the goal of the European Research Council project proposal I'm submitting in this very moment.
Today at #CHR2025, I will be presenting our work on the evaluation of the historical adequacy of masked language models (MLMs) for #Latin. There are several models like this, and they represent the current state of the art for a number of downstream tasks, like semantic change and text reuse detection. However, a historical researcher, philologist or else would want to be sure that such models really represent the historical period of interest. For example, it would be an embarrasing hallucination if St. Augustine showed up in the context of the Roman senate.
Our evaluation confirms a known problem: LLMs and masked models in particular are trained on corpora without attention to historical periods. Unlike other research we've done on Early Modern English, this problem leads to models being barely distinguishable when it comes to their ability to generate based on a historical period. Even though history is a case where it is most obvious when models go wrong, this type of contamination is a known problem for LLM training overall, think of different legal jurisdictions using the same language, dialects in programming languages, etc.
This research was generously supported by AgileLab.
The full paper is available at:
https://anthology.ach.org/volumes/vol0003/the-latin-language-evolved-over-time-masked-models/
Our paper on the values found in fairy tales from some European countries has been published. We studied how values are explicitly present in tales from Germany, Italy and Portugal using various NLP techniques, but most notably Word2Vec and Word Embedding with a Compass. We visualise synchronic semantic variation to show certain differences based on observations of the corpus, some of them already observed in previous literature. A discussed example in our findings is how motherhood in Germany is strongly related to generosity, whereas in Italy and Portugal it has stronger relationship to wisdom.
Fulltext available at: https://aclanthology.org/2023.nlp4dh-1.8/
In the morning session today Sara Sullam and I will be presenting our work on exploring nominal (in our case study - bibliographical) data. We do it by borrowing a method from educational research - the notion of phenomenographic variation. #CHR2023🧵
US "leadership" is rushing to out-dumb Israeli counterparts in tragic real-life parody. It's dubious if they are able to "win" their war, whatever that means. But they are certainly sabotaging the military power of the respective state for decades to come. The case of Pete Hegseth
https://www.theguardian.com/us-news/2026/mar/08/pete-hegseth-pentagon-trump-iran
“Dad, Orion’s not there!”
A simple question from a kid turns into a lesson about why some constellations disappear for months while others (like the Big Dipper) never leave the night sky 💫
RT by @EU_ENV: Cars are plastic parts on wheels 🚗♻️
That’s why the EU is now developing its end-of-life vehicles regulation, aligning it with the #CircularEconomy Action Plan to boost circularity, improve vehicle design, and keep more materials in use.
More: https://link.europa.eu/jn7Q9j
---
https://nitter.net/EU_ENV/status/2029476170748539293#m
🎉 Applications for the Sovereign Tech Fellowship are officially open!
What’s new? For the first time, community managers, and technical writers can apply alongside open source maintainers until April 6, 2026, to become Fellows.
The #SovereignTechFellowship invests directly in the people behind the code, supporting key experts whose work underpins the health and stability of critical components in the #opensource ecosystem.
(1/2)
"A grassroots boycott called QuitGPT has been spreading across the US and beyond, asking people to cancel their ChatGPT subscriptions. More than a million people have answered the call."
"We're organizing Americans and people around the world to quit ChatGPT."
#QuitGPT #news #USNews #USPol #technology #TechNews #OpenAI #ChatGPT #AI #GenAI
R to @DigitalEU: EU initiatives on AI are making the development & use of articificial intelligence safer, and at the same time they boost innovation & competitiveness in Europe.
Learn more
AI Act: http://link.europa.eu/gTvhRX
---
https://nitter.net/DigitalEU/status/2028871757243859093#m
Iran and the limits of Europe’s rules-based faith https://www.euractiv.com/opinion/whose-law-is-it-anyway/?utm_source=eac&utm_medium=mastodon&utm_campaign=%40euractiv%40masto.ai
“Until Israelis feel it in their pockets, or until there’s even more mass killings [of Israelis] or until the international community stops them, then it’s just going to keep going unfortunately.”
"Fierce domestic debate about responsibility for the 7 October 2023 attacks, which occurred on Netanyahu’s watch, was instantly set aside."
"But there is little mainstream questioning of whether Israel’s use of military power is the best way to guarantee lasting security, Zonszein said. “It’s perplexing why Israelis aren’t having that conversation enough. I think over the last 20 years, Israelis have just been less and less interested in these deeper questions.”"
"But Israel’s spy agencies have a decades-long track record of taking out high-profile enemies, from generations of Hamas commanders in Gaza to the Hezbollah leader Hassan Nasrallah, in assassinations that did not destroy the groups these men headed."
"Few prominent Israelis have asked questions about why the legacy of one historic victory is another war – or whether the stated goal of regime change from the air is realistic."
https://www.theguardian.com/world/2026/mar/01/netanyahu-latest-war-few-critics-israel-embracing-militarism-iran
Regular reminder that just because funding was cut by the government, PBS is still here. You can donate $5 or more and get access to a ton of their stuff on streaming too, including lots of local shows. It's a good deal.
https://www.pbs.org/explore/passport/
Same with NPR. The funding cut sucks but I can still listen to my local Detroit NPR's music shows
On This Day in Working Class History mini podcast episode for 1 March: The Strike That Shook Nazi-Occupied Italy. Listen at https://www.spreaker.com/episode/the-strike-that-shook-nazi-occupied-italy--70369197
“For diplomacy to be successful, both sides need to agree on the issues subject to negotiation and also believe that peaceful resolution is more valuable than military engagement.”
–A former US nuclear negotiator on the failure of US-Iran negotiations
HRANA – On Thursday, February 26, Abdolnaser Mohaymeni, a journalist in Gorgan, was arrested. According to HRANA News Agency, citing Didban Iran, Mr. Mohaymeni was arrested at his home in Gorgan on the evening of Thursday, February 26, 2026. The report does not mention the arresting authority, the reasons for his arrest, or his place […]
HRANA – Following military attacks by the joint United States and Israel against Iran on February 28, 2026, preliminary data collected from field sources and published reports presents a picture of a large-scale, multi-wave operation: at least 59 incidents recorded across 18 provinces; a minimum estimated 333 civilian casualties; confirmed military casualties; damage to infrastructure […]
"(Beirut, February 28, 2026) – The United States and Israel on February 28, 2026 carried out airstrikes on Iran, which retaliated with strikes against Israel and Gulf states. All parties to the conflict are obligated to respect international humanitarian law, also known as the laws of war, and prioritize the protection of civilians. Human Rights Watch is currently investigating strikes by all parties that may have violated the laws of war.
Human Rights Watch has previously documented laws-of-war violations by the United States, Israel, and Iran and serious failures to protect civilians in conflict.
Since January 2025, under the administration of President Donald Trump, the US Defense Department has fired top military lawyers without cause and systematically rolled back legal oversight and mechanisms to mitigate harm to civilians, placing fewer constraints on military operations.
Defense Secretary Pete Hegseth has lifted restrictions on antipersonnel landmines and agreed to purchase cluster munitions – weapons inherently harmful to civilians – from Israel. The 2026 US National Defense Strategy omits civilian harm mitigation as an explicit policy consideration."
https://www.hrw.org/news/2026/02/28/us/israel/iran-all-parties-should-respect-laws-of-war
Studying how people interact, in the past (#CulturalAnalytics) and today (#EdTech #Crowdsourcing). Researcher at @IslabUnimi, University of Milan. Bulgarian activist for legal reform with @pravosadiezv. I use dedicated accounts for different languages.
My profile is searchable with https://www.tootfinder.ch/