Wonder if we should get rid of automatic language detection given how often it is inaccurate...

#mastodev

@freemo I don't think that's possible. Aren't you a Machine Learning innovator? You should know what this problem entails.

@Gargron I am yes. I didnt mean to suggest it would be trivial to solve with 100% accuracy. I am just suggting you work towards improving the error rate its ok if there is some) rather than eliminating a vital feature altogether.

@freemo Well, for a start I am not a C developer and not a ML expert. CLD3 is developed by Google and I seriously doubt that I could do something that they can't.

Follow

@Gargron Perhaps try a different third-party library? Or perhaps improve the way the library is applied. I havent looked at your code but im not suggesting you do the ML yourself at all. But there can be a huge difference in how you apply it.

Just an off the cuff example (not saying this is viable as i dont know enough). But for example I'd imagine a LOT of the error comes from shorter posts analyzed in isolation. However if your library is uncertain what language a particular post is in then it can do one of two things

1) display it anyway, no harm done if you display an unwanted language, only harm is done when you dont display a wanted language. So make the error of an acceptable nature if you cant improve it.

2) use more context, for example if 100% of a users identified posts are chinese and a short post is undetermined what language it is, then assume it is chinese as it should be weighted on context.

1 is the easy path, and probably the one I'd suggest since I dont see a need for perfection here.. but 2 might be a decent incremental step if you really feel perfection is needed.

Β· Β· 1 Β· 0 Β· 0

@freemo CLD3 doesn't offer a reliable confidence rating. You can give it a short string and it will be 95% confident about its wrong result. So while 1 is the better option it is not possible.

@Gargron Perhaps use a library that uses a confidence rating instead, or consider options beyond 1 and 2.

If you'd like me to provide some more serious help and suggestions I can review the code and library options more closely if youd like.

Sign in to participate in the conversation
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.