@Gargron Any reason you decided to go with this algorithm rather than one of the traditional algorithms used to identify spam and/or similar messages?
I suspect youll find a lot of false positives with your choice of algorithm.
@Gargron Yea I guess if you want to avoid any training period then those wouldnt be good choices. But still there are unsupervised classifiers you could pick.
Stemming is language specific but if you use a library that shouldn't be a problem, presuming you can detect the language reliably. If not you can fall back to not stemming i guess. But even with your current approach I suspect you'd see an improvement with stemming.
Personally what I'd suggest is a combination of supervised and unsupervised. Unsupervised being a first-pass, then as users block certain messages over time the supervised learning algorithm would improve and make suggestions.
Anyway thats my 2 cents.
@freemo Well, any classifier needs training data. We don't have any because the reports system neither has a "spam" class nor makes an effort to keep it when admins decide to ban the spammer.
The second problem is stemming is language-dependent, Mastodon is used all across the world, and our language detector is unreliable enough as it is.