Pinned post

I think I now know where to draw the line between "good" and "bad" , and possibly (or rather obviously) the same for . It's simply whether the input data has been constructed rigorously. Put this way it's the most obvious statement ever, but somehow have convinced us all that they advance research by recklessly scraping , and who knows what else (they keep their training data secret).

What is good science in computational linguistics? Well, open data is a step towards it. But open and crap is not a solution. We need to actually _know_ and manage the data. And nobody in their right mind would want to plough through toxic data to clean it. We've all heard the horrors of Kenyan data workers who do it for money and still suffer doing it.

But better (yes, also smaller) corpora are of interest to scholars in the humanities and the social sciences. Think of textcreationpartnership.org or mlat.uzh.ch. Yes, they are too big for individual researchers or even teams to handle, but we have the organisational and technological infrastructure to work on them collectively. We've been doing it for ages and we will continue doing it. We just need to do it together.

And this is the goal of the European Research Council project proposal I'm submitting in this very moment.

Pinned post

Today at , I will be presenting our work on the evaluation of the historical adequacy of masked language models (MLMs) for . There are several models like this, and they represent the current state of the art for a number of downstream tasks, like semantic change and text reuse detection. However, a historical researcher, philologist or else would want to be sure that such models really represent the historical period of interest. For example, it would be an embarrasing hallucination if St. Augustine showed up in the context of the Roman senate.

Our evaluation confirms a known problem: LLMs and masked models in particular are trained on corpora without attention to historical periods. Unlike other research we've done on Early Modern English, this problem leads to models being barely distinguishable when it comes to their ability to generate based on a historical period. Even though history is a case where it is most obvious when models go wrong, this type of contamination is a known problem for LLM training overall, think of different legal jurisdictions using the same language, dialects in programming languages, etc.

This research was generously supported by AgileLab.

The full paper is available at:
anthology.ach.org/volumes/vol0

Pinned post

Our paper on the values found in fairy tales from some European countries has been published. We studied how values are explicitly present in tales from Germany, Italy and Portugal using various NLP techniques, but most notably Word2Vec and Word Embedding with a Compass. We visualise synchronic semantic variation to show certain differences based on observations of the corpus, some of them already observed in previous literature. A discussed example in our findings is how motherhood in Germany is strongly related to generosity, whereas in Italy and Portugal it has stronger relationship to wisdom.

Fulltext available at: aclanthology.org/2023.nlp4dh-1

@folklore @linguistics @bookstodon

Pinned post

In the morning session today Sara Sullam and I will be presenting our work on exploring nominal (in our case study - bibliographical) data. We do it by borrowing a method from educational research - the notion of phenomenographic variation. 🧵

Starmer gives an average man's response to Carney's call for everyday heroes

But it isn't Britishness that makes him average. It's his refusal to stand up to the position he was elected at.

politico.eu/article/keir-starm

@balkanika It’s understandable to feel frustrated with political leaders of the past having manoeuvred Europe into this conundrum, or with those of today remaining in the “comfort zone of cowardice and inaction”, as Nathalie Tocci wrote. But we should also ask what role we have to play in this – the kind of people on the liberal left who enjoy thoughtful European arthouse cinema, or indeed those who make it

Show thread

Did you know that data protection should be part of your communication activities?

It might not be the flashiest part of your work, but it’s super important!

Use our practical checklist to ensure your work complies with EU data protection rules: link.europa.eu/dNq4jv

Using too much salt on sidewalks and driveways can harm local streams and drinking water.

An environmental scientist shares 3 easy tips to de-ice responsibly this winter:
theconversation.com/oversaltin

Moms who resist can change history.

In Argentina, state terror inspired mothers to “became a potent force in resisting authoritarianism and ultimately restoring democracy,” according to a political scientist who lived through the dictatorship in her native country.

theconversation.com/how-govern

A phrase for why so much AI art looks the same: cultural stagnation.

Creativity gets flattened into polished sameness when machines optimize for what’s familiar.
theconversation.com/ai-induced

HRANA – At least 150 women detained in connection with the nationwide protests, most of whom are female students, have been transferred to the political ward of Adelabad Prison in Shiraz, a ward that lacks the capacity and facilities to accommodate this number of prisoners. Based on information received by HRANA, the majority of these […]

We publish articles written by experts because of the web of knowledge “knotted together by visible signals of trust, such as degrees, publications, affiliations and accreditations.”

But public trust in science is fracturing along party lines: buff.ly/LmPVZGr

If you're really worried about carbon monoxide poisoning, why don't you _just_stop_burning_fossils_ at home?
propublica.org/article/how-to-

@TheConversationUS well, with BigTech aligned with Trump and providing the infrastructure to practically any big organisation worldwide, attacks do not have to use exploits at all. Think about what happened to the International Criminal Court.

"The bottom line is that like generative AI itself, agentic AI is both impossible and inevitable at the same time. There may not be a specific annum that will be looked back upon as “the year of the agent.” But hallucinations or not, every year from now on is going to be “the year of more agents,” as the delta between guardrails and hallucinations narrows. The industry has too much at stake not to make this happen. The tasks that agents perform will always require some degree of verification—and of course people will get sloppy and we’ll suffer small and large disasters—but eventually agents will match or surpass the reliability of human beings, while being faster and cheaper.

At that point, some bigger questions arise. One person I contacted to discuss the hallucination paper was computer pioneer Alan Kay, who is friendly with Sikka. His view is that “their argument was posed well enough to get comments from real computational theorists.” (A statement reminiscent of his 1984 take on the Macintosh as “the first personal computer good enough to be criticized.”) But ultimately, he says, the mathematical question is beside the point. Instead, he suggests people consider the issue in light of Marshall McLuhan’s famous “Medium is the message” dictum. “Don’t ask whether something is good or bad, right or wrong,” he paraphrases. “Find out what is going on.”

Here’s what’s going on: We may well be on the cusp of a massive automation of human cognitive activity. It’s an open question whether this will improve the quality of our work and our lives. I suspect that the ultimate assessment of that will not be mathematically verifiable."

wired.com/story/ai-agents-math

#AI #GenerativeAI #AIAgents #AgenticAI #Hallucinations

HRANA – A special session of the United Nations Human Rights Council was held on January 23, 2026  at the UN’s European headquarters in Geneva. The session was specifically dedicated to examining the human rights situation in the Islamic Republic of Iran and the widespread suppression of the nationwide January protests. During the meeting, a […]

🇪🇺🚌 With the #EU Parliament and Council ramping up their work on the #DigitalOmnibus, noyb continues to fight for changes to preserve your privacy rights.

🤝 You can help us by becoming a Supporting Member! Learn more ➤ noyb.eu/en/support-us

#MakePrivacyReality

The U.S. attack on Venezuela included a hacking attack that shut down Caracas’s power grid.

Troops don’t have to physically attack power plants any more to destroy them.

theconversation.com/hacking-th

With Iran still largely under an internet blackout, eyewitness testimony is key for understanding how angry demonstrations over economic hardship exploded into the biggest anti-government protests since 1979. japantimes.co.jp/news/2026/01/ #worldnews #society #iran #alikhamenei

Two months without screens after a concussion brought better sleep, a longer attention span and a sense of mental quiet to a public health researcher.

The experience reminded her that genuine restoration comes from reducing mental demands—not just the illusion of rest.

theconversation.com/why-unwind

Show older
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.