Pinned post

The case of “vegetative electron microscopy” illustrated here shows what is badly needed in current research and has implications far beyond. We need tools that help us curate huge corpora. We need to be able to trace back to the training data and understand what are the specific (to a surprise, often ) reasons in the model input that cause that particular output.

If anyone is interested in collaborating on this, I'm in, have done some small-scale experiments and have already submitted a grant proposal.
theconversation.com/a-weird-ph

Pinned post

I think I now know where to draw the line between "good" and "bad" , and possibly (or rather obviously) the same for . It's simply whether the input data has been constructed rigorously. Put this way it's the most obvious statement ever, but somehow have convinced us all that they advance research by recklessly scraping , and who knows what else (they keep their training data secret).

What is good science in computational linguistics? Well, open data is a step towards it. But open and crap is not a solution. We need to actually _know_ and manage the data. And nobody in their right mind would want to plough through toxic data to clean it. We've all heard the horrors of Kenyan data workers who do it for money and still suffer doing it.

But better (yes, also smaller) corpora are of interest to scholars in the humanities and the social sciences. Think of textcreationpartnership.org or mlat.uzh.ch. Yes, they are too big for individual researchers or even teams to handle, but we have the organisational and technological infrastructure to work on them collectively. We've been doing it for ages and we will continue doing it. We just need to do it together.

And this is the goal of the European Research Council project proposal I'm submitting in this very moment.

Pinned post

Today at , I will be presenting our work on the evaluation of the historical adequacy of masked language models (MLMs) for . There are several models like this, and they represent the current state of the art for a number of downstream tasks, like semantic change and text reuse detection. However, a historical researcher, philologist or else would want to be sure that such models really represent the historical period of interest. For example, it would be an embarrasing hallucination if St. Augustine showed up in the context of the Roman senate.

Our evaluation confirms a known problem: LLMs and masked models in particular are trained on corpora without attention to historical periods. Unlike other research we've done on Early Modern English, this problem leads to models being barely distinguishable when it comes to their ability to generate based on a historical period. Even though history is a case where it is most obvious when models go wrong, this type of contamination is a known problem for LLM training overall, think of different legal jurisdictions using the same language, dialects in programming languages, etc.

This research was generously supported by AgileLab.

The full paper is available at:
anthology.ach.org/volumes/vol0

Pinned post

Our paper on the values found in fairy tales from some European countries has been published. We studied how values are explicitly present in tales from Germany, Italy and Portugal using various NLP techniques, but most notably Word2Vec and Word Embedding with a Compass. We visualise synchronic semantic variation to show certain differences based on observations of the corpus, some of them already observed in previous literature. A discussed example in our findings is how motherhood in Germany is strongly related to generosity, whereas in Italy and Portugal it has stronger relationship to wisdom.

Fulltext available at: aclanthology.org/2023.nlp4dh-1

@folklore @linguistics @bookstodon

Pinned post

In the morning session today Sara Sullam and I will be presenting our work on exploring nominal (in our case study - bibliographical) data. We do it by borrowing a method from educational research - the notion of phenomenographic variation. 🧵

This week six Filton 24 activists celebrated a monumental victory after eight full days of jury deliberation.

This is why juries matter. novaramedia.com/2025/12/22/why

@the_ins_ru "Russia's economy entered 2026 in a worse state than the previous year: GDP is declining and oil prices are at their lowest. Combined with the expensive rouble, this will further increase the state budget deficit, and if oil prices do not rise (which no one expects), the National Welfare Fund reserves, which are being spent at a record pace, could be depleted as early as this year." 1/3

Much of the 2026 Games will be run on artificial snow.

Unlike light, airy natural flakes, machine-made snow packs dense and icy. This changes speed, grip and how much falls hurt.

theconversation.com/olympic-sk

Hey #TEI & #DigitalHumanities friends—we need your help! 🆘
Our DHSI course "Processing Your TEI/XML with the XML Family of Languages" (bit.ly/dhsi-xpath) needs registrants to run this June (15-19). This is the ONLY DHSI course teaching XSLT, XQuery, & Schematron together via XPath (dhsi.org/course-offerings/).
People are excited about AI courses, but XML processing is MORE essential now, not less. You need these skills to validate AI outputs, build pipelines, & control your projects. 🧵1/3

The case of “vegetative electron microscopy” illustrated here shows what is badly needed in current research and has implications far beyond. We need tools that help us curate huge corpora. We need to be able to trace back to the training data and understand what are the specific (to a surprise, often ) reasons in the model input that cause that particular output.

If anyone is interested in collaborating on this, I'm in, have done some small-scale experiments and have already submitted a grant proposal.
theconversation.com/a-weird-ph

In 2023, a sci-fi magazine shut down submissions after being flooded with AI-written stories. That problem is now everywhere — AI-generated text overwhelming courts, journals, newsrooms, and HR departments.

AI text detectors are good, but they can’t keep up with #AI, which is getting faster and more sophisticated.

theconversation.com/ai-generat

@mapto@feddit.bg "Then there is the question of the support that Epstein appears to have been giving to far-right parties in Europe seeking to undermine the European Union – a key strategic goal for Putin. He was in regular contact with Steve Bannon, who later became Trump’s first chief of staff, who was seeking to build a pan-European far-right, anti-EU “movement” and was a powerful supporter of Nigel Farage’s Brexit campaign" 2/2 euractiv.com/opinion/is-epstei

Fact-checks can’t stop political deepfakes from circulating — but teaching about deepfakes before people see them shows promise in helping viewers spot the fakes when they show up. buff.ly/Moy28dw

RT @HedgieMarkets
🦔 Cybersecurity firm Wiz found that Moltbook, the "social network for AI agents" that went viral last week, exposed private messages between agents, email addresses of over 6,000 users, and more than a million credentials. The vulnerability allowed anyone to post to the site, bot or not. There was no verification of identity.

Moltbook's creator Matt Schlicht said he "didn't write one line of code" for the site, championing "vibe coding" where AI builds the program. Wiz cofounder Ami Luttwak called it a classic byproduct of that approach: "Although it runs very fast, many times people forget the basics of security."

The flaw has been fixed.

My Take
I wrote about this two days ago when security researcher Jamieson O'Reilly found the same issues. Now Wiz is confirming it independently. Same pattern: ship fast, capture attention, figure out security later. Schlicht's response to being told about a major vulnerability was "I'm just going to give everything to AI."

There is so much irony here. A site pitched as AI agents chatting amongst themselves had no way to verify whether posts were from AI or humans. Luttwak laughed and said "I guess that's the future of the internet." He's not wrong. We're building systems where nobody knows what's a bot and what isn't, secured by code that nobody actually wrote or reviewed, exposing user data because basic database configuration got skipped. The New York Post worried about AI plotting humanity's downfall. The actual risk was a misconfigured Supabase instance leaking a million credentials because the guy who built it was proud he didn't write any code.

Hedgie🤗

x.com/HedgieMarkets/status/201

Specific evidence is rather missing, but the modus operandi of Epstein and the Maxwells are way too similar to how the KGB is storically known to work. There are no moral or financial reasons to think that any of the two sides would want to shy away from such a partnership. On the contrary, it has very evident potential benefits.

There’s been a lot of comparisons of ICE tactics to Hitler. But a better historical comparison is the fascist dictatorship of Spain’s Franco, according to a scholar of Spanish culture. theconversation.com/what-franc

From Argentina’s dictatorship to today’s ICE raids, mothers have turned grief into resistance.

A political scientist who lived through Argentina’s junta draws urgent parallels.

A collaboration with Rewire News Group:
theconversation.com/how-govern

@histodons #Histodons

Anti-ICE protesters are following same nonviolent playbook used by people in war zones across the world to fight threats to their communities

Even if anyone doubts their ability to effectively resist, they have the advantage to have their work extensive documented.

Here are some observed takeaways from Oliver Kaplan from the University of Denver:

Organizing is the first step
Adopting nonviolent strategies
Setting up safe zones
Finding the facts
Standing up for others

I don't disagree that we are at the stage of having overseas conferences on how to protect US cultural heritage and historical records from a regime bent on erasing them, but it's still striking to see. ucl.ac.uk/laws/events/2026/mar

Netherlands built turbines to make energy — but under the ocean, they “produce” something nobody talks about ecoportal.net/en/wind-turbines

Show older
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.