In the recent weeks I've been experimenting with an approach to find people worth following on Mastodon.

It's pretty brute force and not very clever, but better than nothing. Roughly it goes like:

1. Crawl the public timelines of many public instances to find active accounts.
2. Per account found, collect the 100 latest toots.
3. Remove a few obvious non-people (bots)
4. Make it searchable, rank the results by some factors.

The results are quite mixed. Will write about a few things I found.

The method yielded 28K active accounts. An active account by my definition is one that has published a status in the last 30 days (at the time of crawling).

Of these, 21K have published 100 toots or more.

When removing bots, 19K remain.

The crawl has started on December 2nd and lasted until now, with interruptions. It covers currently 4200 instances. By far not all of those are Mastodon instances.

There is an unexpected lot of servers that responds to some of the Mastodon API requests, but not all. Some are Mastodon forks. Anyway, for my purpose (finding people to follow from a Mastodon account) that wasn't really important.

The first obvious challenge I ran into: how to find people who post in the languages I understand?

Mastodon instances have a language code. But that code doesn't mean a lot, because users can post in whatever language they like. Also users have a language code (I currently don't know where this is configured.) Again, users post in whatever language they like. Many users use several language, dependeing on who they want to reach, or they write one language but re-blog content in others.

My solution to this is "dominant language detection". Basically I throw all 100 recent toots of an account into language detection. The top guess by the detector is the language I consider dominant. As a first attempt, this works well enough.

A better approach would likely detect all the languages an account uses. Then, at some point, it's up to me as a user to decide whether I want to follow someone who posts in English (OK for me) and Japanese (not readable for me).

I learned more about the difficulties of language detection.

(1) The more languages you have as candidates, the less confident your detection result will be.

(2) The shorter the text, the less confident the result. Some toots are definitely too short to guess. (Some toots don't even contain language, they are all media.)

@sendung hey, I'm interested in studying how language communities develop over the fediverse. Ideally, I'd like to do something similar to what you're doing and publish (both academic publication and web service along the lines of fedidb.org/ and fedistats.cc/) some analysis of multilingualism across instances and individuals. I personally participate in Italian and Bulgarian communities, but also follow a number of English, German and Russian accounts.

Clearly this interest is entirely at the aggregated level, but for small communities, we'll need to be careful regarding privacy implication of published data.

So, a few questions:
1. Do you still check your account?
2. Are you interested in collaborating on this?
3. Are you willing to share some work you've done (code, data, whatever)?

Thanks!

@mapto @sendung I would like to find posts (i.e. potential friends) in my exotic language: be. Are there any services available to help that?
Follow

@vics @sendung I'm certainly looking for the same and there's no real easy of doing that.

The workaround I'm currently using is:
1. Find instances with language of interest, e.g. fedistats.cc/nodes?sort=daily_
2. Check out their timeline, e.g. vkl.world/public/local

But I suppose your language is like mine, and you've exhausted the few relevant instances long ago. Plus, most of the community is dispersed elsewhere anyways.

Possibly a step 3. Could be followgraph.vercel.app/ , but stupidly it looks up who your contacts are following, and not who follows them. If you also think it should be otherwise, consider this issue github.com/gabipurcaru/followg

Sign in to participate in the conversation
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.