Pinned post

The case of “vegetative electron microscopy” illustrated here shows what is badly needed in current research and has implications far beyond. We need tools that help us curate huge corpora. We need to be able to trace back to the training data and understand what are the specific (to a surprise, often ) reasons in the model input that cause that particular output.

If anyone is interested in collaborating on this, I'm in, have done some small-scale experiments and have already submitted a grant proposal.
theconversation.com/a-weird-ph

Pinned post

I think I now know where to draw the line between "good" and "bad" , and possibly (or rather obviously) the same for . It's simply whether the input data has been constructed rigorously. Put this way it's the most obvious statement ever, but somehow have convinced us all that they advance research by recklessly scraping , and who knows what else (they keep their training data secret).

What is good science in computational linguistics? Well, open data is a step towards it. But open and crap is not a solution. We need to actually _know_ and manage the data. And nobody in their right mind would want to plough through toxic data to clean it. We've all heard the horrors of Kenyan data workers who do it for money and still suffer doing it.

But better (yes, also smaller) corpora are of interest to scholars in the humanities and the social sciences. Think of textcreationpartnership.org or mlat.uzh.ch. Yes, they are too big for individual researchers or even teams to handle, but we have the organisational and technological infrastructure to work on them collectively. We've been doing it for ages and we will continue doing it. We just need to do it together.

And this is the goal of the European Research Council project proposal I'm submitting in this very moment.

Pinned post

Today at , I will be presenting our work on the evaluation of the historical adequacy of masked language models (MLMs) for . There are several models like this, and they represent the current state of the art for a number of downstream tasks, like semantic change and text reuse detection. However, a historical researcher, philologist or else would want to be sure that such models really represent the historical period of interest. For example, it would be an embarrasing hallucination if St. Augustine showed up in the context of the Roman senate.

Our evaluation confirms a known problem: LLMs and masked models in particular are trained on corpora without attention to historical periods. Unlike other research we've done on Early Modern English, this problem leads to models being barely distinguishable when it comes to their ability to generate based on a historical period. Even though history is a case where it is most obvious when models go wrong, this type of contamination is a known problem for LLM training overall, think of different legal jurisdictions using the same language, dialects in programming languages, etc.

This research was generously supported by AgileLab.

The full paper is available at:
anthology.ach.org/volumes/vol0

Pinned post

Our paper on the values found in fairy tales from some European countries has been published. We studied how values are explicitly present in tales from Germany, Italy and Portugal using various NLP techniques, but most notably Word2Vec and Word Embedding with a Compass. We visualise synchronic semantic variation to show certain differences based on observations of the corpus, some of them already observed in previous literature. A discussed example in our findings is how motherhood in Germany is strongly related to generosity, whereas in Italy and Portugal it has stronger relationship to wisdom.

Fulltext available at: aclanthology.org/2023.nlp4dh-1

@folklore @linguistics @bookstodon

Pinned post

In the morning session today Sara Sullam and I will be presenting our work on exploring nominal (in our case study - bibliographical) data. We do it by borrowing a method from educational research - the notion of phenomenographic variation. 🧵

Data centers are straining power grids.

But new research suggests they don’t have to. With the right design, they can generate energy, store it and even reuse waste heat to support nearby communities.
theconversation.com/data-cente

#Italy

Italia, the 10 2025 climate trends photograph a country that does not accelerate

This is what emerges from the Italy for Climate annual report, which, although growing, sees Italia lagging behind the European average

en.ilsole24ore.com/art/italia-

#ClimateChange #GlobalWarming #UpheavalClimate #ClimateInstability #ClimateDisruption #MassAtrocity #pollution #ecology #environment #climate

Built for a hostile internet: Canonical VP of Engineering on Ubuntu 26.04 LTS https://www.zdnet.com/article/built-for-a-hostile-internet-canonical-vp-of-engineering-on-ubuntu-26-04-lts/#ftag=COS-05-10aaa0j by @sjvn0001

Everything you wanted to know about Ubuntu #Linux 26.04 from the Canonical executive in charge of building it.

No, baby boys aren’t “less social.” That’s just a stereotype.

Decades of research shows boys and girls are equally wired to connect from day one. But boys are nudged toward toughness over tenderness, given fewer chances to practice empathy and subtly discouraged against connection.

theconversation.com/its-a-myth

That same year [2016] the NATO Strategic Communications Centre of Excellence’s official journal StratCom, published a paper entitled ‘It’s Time to Embrace Memetic Warfare’.

The paper proposed methods by which to undermine ISIS: “systematically lure and entrap” recruiters; subvert its messaging via “fake ‘sockpuppet’ accounts” – online personas manufactured to simulate grassroots support or opposition – and “expose and harass people” within its funding network, “including their family members”.

To the editors of the NATO journal, these may have appeared as novel strategic prescriptions. In fact, they had already appeared – in a different context entirely.

Those tactics had been developed and deployed over years by a loose network of far-right organisations – funded, in part, by figures directly connected to Thiel.

And they resemble too much mechanics used by Russian propaganda throughout Europe and beyond.

@reiver agents seem to be cool. But they raise even harder sustainability questions:

1. They consume even more natural resources, because harnesses mean that the LLMs behind will fidget until they fit the answer to the task.

2. They continue to rely on ever larger models, which is the unsustainable part. Will we finally manage to switch to smaller specialized models? No signs in sight.

Sounds too much like the Kremlin:
"Around the same time, White House chief of staff Susie Wiles forced a meeting of Trump’s most trusted advisers. The problem: No one was being honest with the president about the domestic impact of the war.

Privately, Wiles had expressed fears that the inner circle’s rose-tinted retelling of the conflict would leave Trump oblivious to the political reality of the war, just months ahead of a contentious midterm season, reported Time magazine earlier this month."
newrepublic.com/post/209262/do

"Among those monitored were a Palestinian academic invited to give a guest lecture at Manchester Metropolitan University and a pro-Gaza PhD student at the London School of Economics, according to internal documents.

In October 2024, the University of Bristol provided the firm with a list of student protest groups it wished to receive alerts about, an internal university email suggests. It included pro-Palestinian and animal rights activists.

In total, 12 universities paid the firm to monitor campus protest activity. Others include the University of Oxford, Imperial College London, University College London (UCL), King’s College London (KCL), the University of Sheffield, the University of Leicester, the University of Nottingham and Cardiff Metropolitan University."
aljazeera.com/news/2026/4/20/u

If what we care about ultimately is equality — as suggested by your first post here — I would propose that our priority ought to be ensuring that no company can “corner the market” on learning. IP creators can forbid copying texts, but the things we learn from them are a collective inheritance.

“Someone who says ‘I’m against abortion but says I am in favor of the death penalty’ is not really pro-life,” Leo said. “Someone who says that ‘I’m against abortion, but I’m in agreement with the inhuman treatment of immigrants in the United States,’ I don’t know if that’s pro-life.”
huffpost.com/entry/pope-leo-br

R to @ERC_Research: Learn more about what you need to know before applying for an ERC grant in the 2027 competitions:

• Resubmission restrictions
• Application rules
• Eligibility windows for #ERCStG and #ERCCoG link.europa.eu/BjTmVt
---
nitter.net/ERC_Research/status

One of the core problems of GenAI is that it's trained on junk data no one has ever read and reviewed. Once again we see that's a lesson not learned, because in their endless quest to more data for more-of-the-same models, GenAI companies have found a new source of mediocre training slop: work-related chats.
gizmodo.com/failed-companies-a

What are they thinking? Have they never participated in such conversations not to know that in the context of remote-first work, these are the equivalent of watercooler conversations? Noise is the norm there, and transformer models are supposed to filter this out? All this without considering the survivor bias of failed companies (nice pun).

Not everything you read online is true & AI is making it harder to tell what’s real.

Before you repost, check:

🔹who shared it
🔹the date and context
🔹if other trusted sources report the same story

Discover what the EU is doing against disinformation: link.europa.eu/wCp6NV
---
nitter.net/DigitalEU/status/20

As for environmental consequences, I'm afraid by now individual choices not to use it have a negligible impact.

Show thread

People minimizing Orbán’s defeat in #Hungary through voting, and claiming that the same can’t be done to Donald Trump: Remember that Orbán was in power for SIXTEEN YEARS, that is, 2.5× LONGER than Trump. He had more than twice the time Trump has had to intimidate, corrupt, and destroy Hungary’s systems of voting. And yet, voting unseated him.

PLEASE do not preemptively give up on voting. Voting still works in the US, and it is quite likely that our voting infrastructure will far outlive Trump and his destructive party.

Authoritarians rely on a perceived popular mandate to continue their abuse. Voting can decisively deny them that mandate. Do not give up prematurely.

VOTE. VOTE. VOTE.

@jeantranscene @reiver at least for me, this thing is the threadiverse. Whereas mastodon is centred around people, the threadiverse is centred around topics and interests. People are complex and have various interests. As a consequence, they don't emit a consistent signal on topics.

Long story short, try this search: lemmy.world/search
The mechanics behind this is that you can follow threadiverse communities from mastodon as you do with users.

It's mind boggling how US commentators, media or ordinary people, oversimplify what is happening in Hungary. A blank cheque was written to someone who was extremely secretive during his campaign. Probably the only confirmed fact was that his career is extremely tightly woven in Fidesz. We've seen many saviors like this in Eastern Europe. Yes, not Orban, but easily his proxy. Think Dimitry Medvedev if you need a famous example. Did he oust Putin?
theguardian.com/world/2026/apr

Show older
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.