@Gargron Any reason you decided to go with this algorithm rather than one of the traditional algorithms used to identify spam and/or similar messages?
I suspect youll find a lot of false positives with your choice of algorithm.
@freemo Which algorithms do you consider traditional?
@Gargron I guess that depends on just how much effort you want to put into it. Naive Fisher Classifier or
Bayes classifier after running through a Stemmer would be the simplest that comes to mind while still being very effective.
@Gargron Yea I guess if you want to avoid any training period then those wouldnt be good choices. But still there are unsupervised classifiers you could pick.
Stemming is language specific but if you use a library that shouldn't be a problem, presuming you can detect the language reliably. If not you can fall back to not stemming i guess. But even with your current approach I suspect you'd see an improvement with stemming.
Personally what I'd suggest is a combination of supervised and unsupervised. Unsupervised being a first-pass, then as users block certain messages over time the supervised learning algorithm would improve and make suggestions.
Anyway thats my 2 cents.