The case of “vegetative electron microscopy” illustrated here shows what is badly needed in current #LLM research and has implications far beyond. We need tools that help us curate huge corpora. We need to be able to trace #hallucinations back to the training data and understand what are the specific (to a surprise, often #deterministic) reasons in the model input that cause that particular output.
If anyone is interested in collaborating on this, I'm in, have done some small-scale experiments and have already submitted a grant proposal.
https://theconversation.com/a-weird-phrase-is-plaguing-scientific-papers-and-we-traced-it-back-to-a-glitch-in-ai-training-data-254463
I think I now know where to draw the line between "good" and "bad" #GenAI, and possibly (or rather obviously) the same for #machineLearning. It's simply whether the input data has been constructed rigorously. Put this way it's the most obvious statement ever, but somehow #BigTech have convinced us all that they advance research by recklessly scraping #twitter, #4chan and who knows what else (they keep their training data secret).
What is good science in computational linguistics? Well, open data is a step towards it. But open and crap is not a solution. We need to actually _know_ and manage the data. And nobody in their right mind would want to plough through toxic data to clean it. We've all heard the horrors of Kenyan data workers who do it for money and still suffer doing it.
But better (yes, also smaller) corpora are of interest to scholars in the humanities and the social sciences. Think of https://textcreationpartnership.org or https://mlat.uzh.ch. Yes, they are too big for individual researchers or even teams to handle, but we have the organisational and technological infrastructure to work on them collectively. We've been doing it for ages and we will continue doing it. We just need to do it together.
And this is the goal of the European Research Council project proposal I'm submitting in this very moment.
Today at #CHR2025, I will be presenting our work on the evaluation of the historical adequacy of masked language models (MLMs) for #Latin. There are several models like this, and they represent the current state of the art for a number of downstream tasks, like semantic change and text reuse detection. However, a historical researcher, philologist or else would want to be sure that such models really represent the historical period of interest. For example, it would be an embarrasing hallucination if St. Augustine showed up in the context of the Roman senate.
Our evaluation confirms a known problem: LLMs and masked models in particular are trained on corpora without attention to historical periods. Unlike other research we've done on Early Modern English, this problem leads to models being barely distinguishable when it comes to their ability to generate based on a historical period. Even though history is a case where it is most obvious when models go wrong, this type of contamination is a known problem for LLM training overall, think of different legal jurisdictions using the same language, dialects in programming languages, etc.
This research was generously supported by AgileLab.
The full paper is available at:
https://anthology.ach.org/volumes/vol0003/the-latin-language-evolved-over-time-masked-models/
Our paper on the values found in fairy tales from some European countries has been published. We studied how values are explicitly present in tales from Germany, Italy and Portugal using various NLP techniques, but most notably Word2Vec and Word Embedding with a Compass. We visualise synchronic semantic variation to show certain differences based on observations of the corpus, some of them already observed in previous literature. A discussed example in our findings is how motherhood in Germany is strongly related to generosity, whereas in Italy and Portugal it has stronger relationship to wisdom.
Fulltext available at: https://aclanthology.org/2023.nlp4dh-1.8/
In the morning session today Sara Sullam and I will be presenting our work on exploring nominal (in our case study - bibliographical) data. We do it by borrowing a method from educational research - the notion of phenomenographic variation. #CHR2023🧵
Supply chain disruptions stemming from the conflict in Iran are beginning to create chokepoints across Japan's auto industry, including the network of companies surrounding Toyota. https://www.japantimes.co.jp/business/2026/04/29/companies/toyota-supply-chain-shortage/?utm_medium=Social&utm_source=mastodon #business #companies #toyota #carmakers #carparts #middleeast
Cyberbullying isn’t just ‘kids being kids’, it's a growing issue with real consequences. But what if parents held the key to a safer internet?
The PARTICIPATE project provides parents with tools, schools with guides and governments with data to act.
30 years of #MSCA = 30 years of science that works for society.
Read the full article: https://t.co/Y5ONeoL6kd
An outstanding article, and it's not about AI. It's about added value and doing business:
"A marketing manager with no engineering background opens Cursor on Monday morning. By Wednesday afternoon, she has a working customer-facing app. It looks polished. It performs the core task. She demos it to her VP, who forwards it to their CMO, who then shows it in the executive staff meeting as evidence that the team is “moving at AI speed.”
By Friday, it is in front of customers.
No one asked who owned the decision to ship it. No one tested it against the conditions it would actually face. No one had the cultural standing to say this looks great, and we are not putting it into production. The prototype became a product because the organization had no system for telling the difference.
I watched a version of this scenario play out recently in a boardroom. A senior executive demoed an AI-built internal tool. The room admired the speed. What received less attention were the harder questions: Who would own it after launch? Who would maintain it? And what would happen when it produced an answer that was confidently wrong?
This is what vibe coding is about to expose across businesses. The companies that think the story is about software are going to lose to the companies that understand the story is about judgment."
https://www.forbes.com/sites/jasonwingard/2026/04/23/vibe-coding-will-break-your-company/
“There’s a misconception that you can somehow influence or persuade AI systems directly. That’s not really how they work,” says Market Brew founder and Chief Technology Officer Scott Stouffer.
“What you can do is make sure that your information is structured, sourced, and aligned in a way that those systems are more likely to retrieve it when someone asks a question. It’s less about changing the conversation and more about making sure your facts are eligible to be part of it.”
Could using AI for simple tasks make you worse at them?
A new study found that people who relied on AI for basic maths and reading tasks performed better at first, but struggled more once it was removed and were less likely to persist.
#FrAIday: http://tr.ee/c47YkR
---
https://nitter.net/DigitalEU/status/2047587346371973506#m
Data centers are straining power grids.
But new research suggests they don’t have to. With the right design, they can generate energy, store it and even reuse waste heat to support nearby communities.
https://theconversation.com/data-centers-dont-have-to-be-a-burden-on-local-communities-and-can-even-support-them-by-generating-power-and-repurposing-waste-heat-276729
Italia, the 10 2025 climate trends photograph a country that does not accelerate
This is what emerges from the Italy for Climate annual report, which, although growing, sees Italia lagging behind the European average
#ClimateChange #GlobalWarming #UpheavalClimate #ClimateInstability #ClimateDisruption #MassAtrocity #pollution #ecology #environment #climate
Built for a hostile internet: Canonical VP of Engineering on Ubuntu 26.04 LTS https://www.zdnet.com/article/built-for-a-hostile-internet-canonical-vp-of-engineering-on-ubuntu-26-04-lts/#ftag=COS-05-10aaa0j by @sjvn0001
Everything you wanted to know about Ubuntu #Linux 26.04 from the Canonical executive in charge of building it.
‘Take me to the EU court’ – Kövesi defends her ‘outstanding prosecutors’ https://www.euractiv.com/news/take-me-to-the-eu-court-kovesi-defends-her-outstanding-prosecutors/?utm_source=eac&utm_medium=mastodon&utm_campaign=%40euractiv%40masto.ai
No, baby boys aren’t “less social.” That’s just a stereotype.
Decades of research shows boys and girls are equally wired to connect from day one. But boys are nudged toward toughness over tenderness, given fewer chances to practice empathy and subtly discouraged against connection.
That same year [2016] the NATO Strategic Communications Centre of Excellence’s official journal StratCom, published a paper entitled ‘It’s Time to Embrace Memetic Warfare’.
The paper proposed methods by which to undermine ISIS: “systematically lure and entrap” recruiters; subvert its messaging via “fake ‘sockpuppet’ accounts” – online personas manufactured to simulate grassroots support or opposition – and “expose and harass people” within its funding network, “including their family members”.
To the editors of the NATO journal, these may have appeared as novel strategic prescriptions. In fact, they had already appeared – in a different context entirely.
Those tactics had been developed and deployed over years by a loose network of far-right organisations – funded, in part, by figures directly connected to Thiel.
And they resemble too much mechanics used by Russian propaganda throughout Europe and beyond.
@reiver agents seem to be cool. But they raise even harder sustainability questions:
1. They consume even more natural resources, because harnesses mean that the LLMs behind will fidget until they fit the answer to the task.
2. They continue to rely on ever larger models, which is the unsustainable part. Will we finally manage to switch to smaller specialized models? No signs in sight.
Sounds too much like the Kremlin:
"Around the same time, White House chief of staff Susie Wiles forced a meeting of Trump’s most trusted advisers. The problem: No one was being honest with the president about the domestic impact of the war.
Privately, Wiles had expressed fears that the inner circle’s rose-tinted retelling of the conflict would leave Trump oblivious to the political reality of the war, just months ahead of a contentious midterm season, reported Time magazine earlier this month."
https://newrepublic.com/post/209262/donald-trump-iran-war-plans-screaming-aides
"Among those monitored were a Palestinian academic invited to give a guest lecture at Manchester Metropolitan University and a pro-Gaza PhD student at the London School of Economics, according to internal documents.
In October 2024, the University of Bristol provided the firm with a list of student protest groups it wished to receive alerts about, an internal university email suggests. It included pro-Palestinian and animal rights activists.
In total, 12 universities paid the firm to monitor campus protest activity. Others include the University of Oxford, Imperial College London, University College London (UCL), King’s College London (KCL), the University of Sheffield, the University of Leicester, the University of Nottingham and Cardiff Metropolitan University."
https://www.aljazeera.com/news/2026/4/20/uk-universities-pay-to-spy-on-students-social-media-accounts
“Someone who says ‘I’m against abortion but says I am in favor of the death penalty’ is not really pro-life,” Leo said. “Someone who says that ‘I’m against abortion, but I’m in agreement with the inhuman treatment of immigrants in the United States,’ I don’t know if that’s pro-life.”
https://www.huffpost.com/entry/pope-leo-breaks-down-why-maga-is-not-really-pro-life_n_68dd802ae4b0c450ba64c434
R to @ERC_Research: Learn more about what you need to know before applying for an ERC grant in the 2027 competitions:
• Resubmission restrictions
• Application rules
• Eligibility windows for #ERCStG and #ERCCoG https://link.europa.eu/BjTmVt
---
https://nitter.net/ERC_Research/status/2045062134799945758#m
One of the core problems of GenAI is that it's trained on junk data no one has ever read and reviewed. Once again we see that's a lesson not learned, because in their endless quest to more data for more-of-the-same models, GenAI companies have found a new source of mediocre training slop: work-related chats.
https://gizmodo.com/failed-companies-are-selling-old-slack-chats-and-email-archives-to-train-ai-2000747916
What are they thinking? Have they never participated in such conversations not to know that in the context of remote-first work, these are the equivalent of watercooler conversations? Noise is the norm there, and transformer models are supposed to filter this out? All this without considering the survivor bias of failed companies (nice pun).
Not everything you read online is true & AI is making it harder to tell what’s real.
Before you repost, check:
🔹who shared it
🔹the date and context
🔹if other trusted sources report the same story
Discover what the EU is doing against disinformation: http://link.europa.eu/wCp6NV
---
https://nitter.net/DigitalEU/status/2044721713410371726#m
No greens, no liberals, no reds: Changing of the guard in Budapest is anything but a swing to the left https://www.euractiv.com/news/no-greens-no-liberals-no-reds-changing-of-the-guard-in-budapest-is-anything-but-a-swing-to-the-left/?utm_source=eac&utm_medium=mastodon&utm_campaign=%40euractiv%40masto.ai
Studying how people interact, in the past (#CulturalAnalytics) and today (#EdTech #Crowdsourcing). Researcher at @IslabUnimi, University of Milan. Bulgarian activist for legal reform with @pravosadiezv. I use dedicated accounts for different languages.
My profile is searchable with https://www.tootfinder.ch/