**Eugen Rochko** @Gargron@mastodon.social · Jul 02, 2019, 17:17

**Eugen Rochko** @Gargron@mastodon.social · Jul 02, 2019, 17:17

Eugen Rochko @Gargron@mastodon.social

Jul 02, 2019, 17:17

Eugen Rochko @Gargron@mastodon.social

Comparing Nilsimsa hashes of my own most recent 100 replies... 12 false positives, almost all of them involving GitHub links.

In practice all 12 would be skipped because they're replying to messages that mention me and/or the recipient follows me

**Eugen Rochko** @Gargron@mastodon.social · Jul 02, 2019, 17:19

**Eugen Rochko** @Gargron@mastodon.social · Jul 02, 2019, 17:19

Jul 02, 2019, 17:19

Eugen Rochko @Gargron@mastodon.social

One false positive is between the messages "Is this real" and "Why is this happening to my home feed" (Nilsimsa Compare Value of 55) which in fairness could be two very realistic replies to strangers

**🎓 Doc Freemo 🇳🇱** @freemo@qoto.org · 2019-07-02T17:21:03Z

🎓 Doc Freemo 🇳🇱 @freemo@qoto.org

@Gargron Any reason you decided to go with this algorithm rather than one of the traditional algorithms used to identify spam and/or similar messages?

I suspect youll find a lot of false positives with your choice of algorithm.

Jul 02, 2019, 17:21 · · · ·

**Eugen Rochko** @Gargron@mastodon.social · Jul 02, 2019, 17:25

**Eugen Rochko** @Gargron@mastodon.social · Jul 02, 2019, 17:25

Jul 02, 2019, 17:25

Eugen Rochko @Gargron@mastodon.social

@freemo Which algorithms do you consider traditional?

**🎓 Doc Freemo 🇳🇱** @freemo@qoto.org · Jul 02, 2019, 17:27

**🎓 Doc Freemo 🇳🇱** @freemo@qoto.org · Jul 02, 2019, 17:27

Jul 02, 2019, 17:27

🎓 Doc Freemo 🇳🇱 @freemo@qoto.org

@Gargron I guess that depends on just how much effort you want to put into it. Naive Fisher Classifier or
Bayes classifier after running through a Stemmer would be the simplest that comes to mind while still being very effective.

**Eugen Rochko** @Gargron@mastodon.social · Jul 02, 2019, 17:29

**Eugen Rochko** @Gargron@mastodon.social · Jul 02, 2019, 17:29

Jul 02, 2019, 17:29

Eugen Rochko @Gargron@mastodon.social

@freemo Well, any classifier needs training data. We don't have any because the reports system neither has a "spam" class nor makes an effort to keep it when admins decide to ban the spammer.

The second problem is stemming is language-dependent, Mastodon is used all across the world, and our language detector is unreliable enough as it is.

**🎓 Doc Freemo 🇳🇱** @freemo@qoto.org · Jul 02, 2019, 17:41

**🎓 Doc Freemo 🇳🇱** @freemo@qoto.org · Jul 02, 2019, 17:41

Jul 02, 2019, 17:41

🎓 Doc Freemo 🇳🇱 @freemo@qoto.org

@Gargron Yea I guess if you want to avoid any training period then those wouldnt be good choices. But still there are unsupervised classifiers you could pick.

Stemming is language specific but if you use a library that shouldn't be a problem, presuming you can detect the language reliably. If not you can fall back to not stemming i guess. But even with your current approach I suspect you'd see an improvement with stemming.

Personally what I'd suggest is a combination of supervised and unsupervised. Unsupervised being a first-pass, then as users block certain messages over time the supervised learning algorithm would improve and make suggestions.

Anyway thats my 2 cents.

Trending now

Resources

Developers

What is Mastodon?

qoto.org

More…