The case of “vegetative electron microscopy” illustrated here shows what is badly needed in current #LLM research and has implications far beyond. We need tools that help us curate huge corpora. We need to be able to trace #hallucinations back to the training data and understand what are the specific (to a surprise, often #deterministic) reasons in the model input that cause that particular output.
If anyone is interested in collaborating on this, I'm in, have done some small-scale experiments and have already submitted a grant proposal.
https://theconversation.com/a-weird-phrase-is-plaguing-scientific-papers-and-we-traced-it-back-to-a-glitch-in-ai-training-data-254463
I think I now know where to draw the line between "good" and "bad" #GenAI, and possibly (or rather obviously) the same for #machineLearning. It's simply whether the input data has been constructed rigorously. Put this way it's the most obvious statement ever, but somehow #BigTech have convinced us all that they advance research by recklessly scraping #twitter, #4chan and who knows what else (they keep their training data secret).
What is good science in computational linguistics? Well, open data is a step towards it. But open and crap is not a solution. We need to actually _know_ and manage the data. And nobody in their right mind would want to plough through toxic data to clean it. We've all heard the horrors of Kenyan data workers who do it for money and still suffer doing it.
But better (yes, also smaller) corpora are of interest to scholars in the humanities and the social sciences. Think of https://textcreationpartnership.org or https://mlat.uzh.ch. Yes, they are too big for individual researchers or even teams to handle, but we have the organisational and technological infrastructure to work on them collectively. We've been doing it for ages and we will continue doing it. We just need to do it together.
And this is the goal of the European Research Council project proposal I'm submitting in this very moment.
Today at #CHR2025, I will be presenting our work on the evaluation of the historical adequacy of masked language models (MLMs) for #Latin. There are several models like this, and they represent the current state of the art for a number of downstream tasks, like semantic change and text reuse detection. However, a historical researcher, philologist or else would want to be sure that such models really represent the historical period of interest. For example, it would be an embarrasing hallucination if St. Augustine showed up in the context of the Roman senate.
Our evaluation confirms a known problem: LLMs and masked models in particular are trained on corpora without attention to historical periods. Unlike other research we've done on Early Modern English, this problem leads to models being barely distinguishable when it comes to their ability to generate based on a historical period. Even though history is a case where it is most obvious when models go wrong, this type of contamination is a known problem for LLM training overall, think of different legal jurisdictions using the same language, dialects in programming languages, etc.
This research was generously supported by AgileLab.
The full paper is available at:
https://anthology.ach.org/volumes/vol0003/the-latin-language-evolved-over-time-masked-models/
Our paper on the values found in fairy tales from some European countries has been published. We studied how values are explicitly present in tales from Germany, Italy and Portugal using various NLP techniques, but most notably Word2Vec and Word Embedding with a Compass. We visualise synchronic semantic variation to show certain differences based on observations of the corpus, some of them already observed in previous literature. A discussed example in our findings is how motherhood in Germany is strongly related to generosity, whereas in Italy and Portugal it has stronger relationship to wisdom.
Fulltext available at: https://aclanthology.org/2023.nlp4dh-1.8/
In the morning session today Sara Sullam and I will be presenting our work on exploring nominal (in our case study - bibliographical) data. We do it by borrowing a method from educational research - the notion of phenomenographic variation. #CHR2023🧵
The @resistanceschool.bsky.social@bsky.brid.gy Resistance Salon Series is back!
On 4/15, we will be hosting a virtual, underground conversation with @mollycrabapple.bsky.social@bsky.brid.gy and her new book, "Here Where We Live is Our Country: The Story of the Jewish Bund."
karenattiah.substack.com/p/register-n...
Register Now: Molly Crabapple ...
In what kind of surreal dream have these people been living?
“He promised a historic victory and security for generations, and in practice, we got one of the most severe strategic failures Israel has ever known. It’s a total failure that endangers Israel’s security for years to come.”
https://www.theguardian.com/world/2026/apr/08/war-with-no-winners-netanyahu-israel-iran-us-ceasefire
But I doubt they are waking up. It's just polemics, but what they say is dead right.
With the return of Israeli forces, the Lebanese parliament scrapped elections scheduled for May. The move is a recurring theme in the country’s fractured politics, explains a professor of migration.
https://theconversation.com/lebanons-political-elites-are-using-displacement-and-humanitarian-crisis-to-delay-elections-again-263677
A different Judaism was possible - the Bund in tzarist Russia
https://www.theguardian.com/world/2026/apr/07/molly-crabapple-new-book-jewish-socialism
"JD Vance has railed against the EU, accusing it of blatantly interfering in Hungary’s upcoming elections, even as the US vice-president said he had travelled to Budapest to “help” Viktor Orbán win Sunday’s vote."
https://www.theguardian.com/world/2026/apr/07/jd-vance-eu-interference-hungary-election-viktor-orban
What do Ukraine and Japan have in common? They are leaders in robotics, and for the right reasons
https://techcrunch.com/2026/04/05/japan-is-proving-experimental-physical-ai-is-ready-for-the-real-world/
Watching Fedi and the world react to the US president go absolutely unhinged in public, threatening war crimes as his cognitive grip disintegrates before our eyes, watching the horror and the outrage…there is something I want to tell you from Minneapolis.
And I’m not sure how, and I’m not sure if I can, but I want to try. People are always thanking us and calling us heroes and asking us for some kind of…something, anything we can offer in the face of the authoritarian march, and well, here it is, here is something, if I can figure out how to say it.
🧵
reminder that anthropic ran (and is still running) an ENTIRE AD CAMPAIGN around "Claude code is written with claude code" and after the source was leaked that has got to be the funniest self-own in the history of advertising because OH BOY IT SHOWS.
it's hard to get across in microblogging format just how big of a dumpster fire this thing is, because what it "looks like" is "everything is done a dozen times in a dozen different ways, and everything is just sort of jammed in anywhere. to the degree there is any kind of coherent structure like 'tools' and 'agents' and whatnot, it's entirely undercut by how the entire rest of the code might have written in some special condition that completely changes how any such thing might work." I have read a lot of unrefined, straight from the LLM code, and Claude code is a masterclass in exactly what you get when you do that - an incomprehensible mess.
Update. Here's the key passage from the new #Trump budget proposing a "Government-Wide Prohibition on Publishing and Subscription Fees." See p. 17.
https://www.whitehouse.gov/wp-content/uploads/2026/04/budget_fy2027.pdf
"The Budget ends the diversion of research dollars to high priced publishers across the Government. The Budget prohibits the use of Federal funds for expensive subscriptions to academic journals and prohibitively high publishing costs unless required by Federal statute or approved in advance by a Federal agency. Research funded by taxpayers should be publicly accessible; yet many publications charge the Government to both publish and to access the same research study. There are numerous low-cost outlets to make federally-funded research publicly available."
h/t Jim O'Donnell
#APCs #DoubleDipping #OpenAccess #Publishing #ScholComm #Subscriptions
Peter Thiel—who played a key role in JD Vance's conversion to Catholicism—just cannot stop talking about the Antichrist.
He just took his Antichrist circus to Rome, where an advisor to the Pope exposed the true meaning of Thiel’s bizarre religious delusions.
www.thenerdreich.com/peter-thiels...
Vatican Rebukes Peter Thiel's ...
This was an interesting breakdown of Claude Code’s leak, and I’m still loling at the “sentiment analysis“ code.
If you want Claude to know you’re mad, you better say “wtf,“ “shit,“ “fuck,” “horrible,“ “awful,“ or “terrible.” Don‘t try to get fancy with “atrocious“ or “balls.”
RE: https://bsky.app/profile/did:plc:776hcssquwcv2ihpptzvttxj/post/3minvxsgpb22r
AI can speak non-English languages fluently, but it still thinks in a Western worldview.
For instance, it will prioritize individual autonomy, direct communication and personal boundaries, even in cultures where harmony, community and relational awareness matter more.
More from a scholar of Indonesian society:
HRANA – As the military conflict between the United States–Israel and Iran, which began on February 28, 2026, continues, the implementation of death sentences in Iran has entered a new and deeply alarming phase, one marked by an exclusive focus on prisoners facing political and security-related charges and a noticeable acceleration in executions. During this […]
I have started a new project: #ProFed.
I’m trying to understand how far #professionalNetworking can go in the #Fediverse.
https://joinprofed.social/
https://codeberg.org/GrayDurian/ProFed
I’m interested in how you see this.
So apparently #telegram 's security is so unique, it was enough for the Russian state to block a step in its handshake protocol to block it entirely.
https://mastodon.social/@OfShad0ws/116333561644254976
Research proceeds on alternatives of polygraphs, but is true lie detection possible?
https://arstechnica.com/science/2026/03/polygraphs-have-major-flaws-are-there-better-options/
New guidelines on the use of artificial intelligence in the evaluation of ERC grant proposals. The aim is to safeguard the integrity of peer review while allowing limited use of AI where it does not compromise privacy or trust.
Read more 👉 https://bit.ly/47HqTld
---
https://nitter.net/ERC_Research/status/2037439643344638368#m
Studying how people interact, in the past (#CulturalAnalytics) and today (#EdTech #Crowdsourcing). Researcher at @IslabUnimi, University of Milan. Bulgarian activist for legal reform with @pravosadiezv. I use dedicated accounts for different languages.
My profile is searchable with https://www.tootfinder.ch/