**Martin Ruskov** @mapto@qoto.org · Jun 12, 2026, 04:12

**Martin Ruskov** @mapto@qoto.org · Jun 12, 2026, 04:12

Martin Ruskov @mapto@qoto.org

Martin Ruskov @mapto@qoto.org

3.01K Posts

838 Following

332 Followers

Research: https://www.zotero.org/mapto/publications

Games: https://mapto.itch.io

Code: https://github.com/mapto

Others (+fedi): https://linktr.ee/mapto

Studying how people interact, in the past (#CulturalAnalytics) and today (#EdTech #Crowdsourcing). Researcher at @IslabUnimi, University of Milan. Bulgarian activist for legal reform with @pravosadiezv. I use dedicated accounts for different languages.

My profile is searchable with https://www.tootfinder.ch/

Joined Nov 2022

838 Following 332 Followers

Posts Posts and replies Media

Pinned post

Jun 12, 2026, 04:12

Martin Ruskov @mapto@qoto.org

Commercial LLMs keep coming back to Elias Thorne. Who is he? Why lighthouse keepers and clockmakers? Two researchers at Cornell dug in public corpora and found it out.

It turns out an AI generated story from the days of GPT-3.5 got proliferated in something that could be an indication of an early form of model collapse.

https://www.404media.co/elias-thorne-chatbots-llms-chatgpt-lighthouse-keeper-story/

This is not the first case when we see diffusion of strange data.
1/3

**Martin Ruskov** @mapto@qoto.org · Jan 12, 2026, 08:23

Pinned post

**Martin Ruskov** @mapto@qoto.org · Jan 12, 2026, 08:23

Jan 12, 2026, 08:23

Martin Ruskov @mapto@qoto.org

I think I now know where to draw the line between "good" and "bad" #GenAI, and possibly (or rather obviously) the same for #machineLearning. It's simply whether the input data has been constructed rigorously. Put this way it's the most obvious statement ever, but somehow #BigTech have convinced us all that they advance research by recklessly scraping #twitter, #4chan and who knows what else (they keep their training data secret).

What is good science in computational linguistics? Well, open data is a step towards it. But open and crap is not a solution. We need to actually _know_ and manage the data. And nobody in their right mind would want to plough through toxic data to clean it. We've all heard the horrors of Kenyan data workers who do it for money and still suffer doing it.

But better (yes, also smaller) corpora are of interest to scholars in the humanities and the social sciences. Think of https://textcreationpartnership.org or https://mlat.uzh.ch. Yes, they are too big for individual researchers or even teams to handle, but we have the organisational and technological infrastructure to work on them collectively. We've been doing it for ages and we will continue doing it. We just need to do it together.

And this is the goal of the European Research Council project proposal I'm submitting in this very moment.

**Martin Ruskov** @mapto@qoto.org · Dec 11, 2025, 07:47 *

Pinned post

**Martin Ruskov** @mapto@qoto.org · Dec 11, 2025, 07:47 *

Dec 11, 2025, 07:47 *

Martin Ruskov @mapto@qoto.org

Today at #CHR2025, I will be presenting our work on the evaluation of the historical adequacy of masked language models (MLMs) for #Latin. There are several models like this, and they represent the current state of the art for a number of downstream tasks, like semantic change and text reuse detection. However, a historical researcher, philologist or else would want to be sure that such models really represent the historical period of interest. For example, it would be an embarrasing hallucination if St. Augustine showed up in the context of the Roman senate.

Our evaluation confirms a known problem: LLMs and masked models in particular are trained on corpora without attention to historical periods. Unlike other research we've done on Early Modern English, this problem leads to models being barely distinguishable when it comes to their ability to generate based on a historical period. Even though history is a case where it is most obvious when models go wrong, this type of contamination is a known problem for LLM training overall, think of different legal jurisdictions using the same language, dialects in programming languages, etc.

This research was generously supported by AgileLab.

The full paper is available at:
https://anthology.ach.org/volumes/vol0003/the-latin-language-evolved-over-time-masked-models/

1c18f99feda64ccd.png

**Martin Ruskov** @mapto@qoto.org · Feb 14, 2024, 15:55

Pinned post

**Martin Ruskov** @mapto@qoto.org · Feb 14, 2024, 15:55

Feb 14, 2024, 15:55

Martin Ruskov @mapto@qoto.org

Our paper on the values found in fairy tales from some European countries has been published. We studied how values are explicitly present in tales from Germany, Italy and Portugal using various NLP techniques, but most notably Word2Vec and Word Embedding with a Compass. We visualise synchronic semantic variation to show certain differences based on observations of the corpus, some of them already observed in previous literature. A discussed example in our findings is how motherhood in Germany is strongly related to generosity, whereas in Italy and Portugal it has stronger relationship to wisdom.

Fulltext available at: https://aclanthology.org/2023.nlp4dh-1.8/

@folklore @linguistics @bookstodon

2c7d9ba17b52ca5c.png

**Martin Ruskov** @mapto@qoto.org · Dec 08, 2023, 08:11

Pinned post

**Martin Ruskov** @mapto@qoto.org · Dec 08, 2023, 08:11

Dec 08, 2023, 08:11

Martin Ruskov @mapto@qoto.org

In the morning session today Sara Sullam and I will be presenting our work on exploring nominal (in our case study - bibliographical) data. We do it by borrowing a method from educational research - the notion of phenomenographic variation. #CHR2023🧵

Martin Ruskov @mapto@qoto.org

Trending now

Resources

Developers

What is Mastodon?

qoto.org

More…