Comparing Nilsimsa hashes of my own most recent 100 replies... 12 false positives, almost all of them involving GitHub links.

In practice all 12 would be skipped because they're replying to messages that mention me and/or the recipient follows me

One false positive is between the messages "Is this real" and "Why is this happening to my home feed" (Nilsimsa Compare Value of 55) which in fairness could be two very realistic replies to strangers

Follow

@Gargron Any reason you decided to go with this algorithm rather than one of the traditional algorithms used to identify spam and/or similar messages?

I suspect youll find a lot of false positives with your choice of algorithm.

@Gargron I guess that depends on just how much effort you want to put into it. Naive Fisher Classifier or
Bayes classifier after running through a Stemmer would be the simplest that comes to mind while still being very effective.

@freemo Well, any classifier needs training data. We don't have any because the reports system neither has a "spam" class nor makes an effort to keep it when admins decide to ban the spammer.

The second problem is stemming is language-dependent, Mastodon is used all across the world, and our language detector is unreliable enough as it is.

@Gargron Yea I guess if you want to avoid any training period then those wouldnt be good choices. But still there are unsupervised classifiers you could pick.

Stemming is language specific but if you use a library that shouldn't be a problem, presuming you can detect the language reliably. If not you can fall back to not stemming i guess. But even with your current approach I suspect you'd see an improvement with stemming.

Personally what I'd suggest is a combination of supervised and unsupervised. Unsupervised being a first-pass, then as users block certain messages over time the supervised learning algorithm would improve and make suggestions.

Anyway thats my 2 cents.

Sign in to participate in the conversation
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.