Some hard-earned wisdom about machine learning (ML):
* It is hard/impossible/prohibitively laborious to make an ML system that is 100% accurate.
* So, we need to think a lot about the types of errors the system will make, which kinds are the most expensive, and how we make something net useful despite these errors. This might mean manually correcting all errors, or designing our full system so that small errors don't matter too much.
1/?
* The errors ML systems make depend on the deployment data distribution, and this is almost never the same as your original validation data set.
* We should not deploy an ML system unless we understand and have quantified the errors it will make on the type of data we will use it with, and what the costs of those error are.
The above are part of the reason we do not yet have self-driving cars.
2/?
Why deploying large language models (LLMs) "in the wild" makes me uncomfortable:
* We have not thought enough about or quantified the types of errors that will be made, or their ramifications.
* Partly, this is because LLMs are trained using unsupervised learning. The relationship between what they are optimized for and what we actually want to use them for is unknown.
3/?
There are really nice benchmarks for LLMs that are trying to remedy this (e.g. https://gluebenchmark.com/tasks), but they are tiny in comparison to the amount of data wrangling done for training LLMs, and perhaps not representative of whatever crazy idea VCs have for LLMs.
4/4
@kristinmbranson Speaking with my engineering hat on, I like to think in terms of problem statement -> understand problem -> design appropriate solution.
Of course, that approach depends a lot on understanding the characteristics and best practices of various known solutions.
Do you think we're at the stage with LLMs that we can say things like: "These are the problems they are good at solving, and here are the best practices you should use when deploying."? Or is still too early?
@kristinmbranson @cstanhope Disagree - there are areas where LLMs are useful. For instance, ChatGPT is pretty good at code generation. Yes, it's often wrong but even the incorrect code can be helpful. I'm guessing that these kind of specialized applications are where LLMs will prove to be useful.
Mystified, though, at the rush to deploy them in search engines where the reputational risk is much higher.
@twitskeptic @cstanhope I haven't tried copilot myself yet, but the people I've talked to who have aren't trusting it for anything beyond adding comments. The reason I haven't is I hate reading and correcting buggy code much more than I hate writing fresh code. I should prob try it before expounding too much :). We've thought about using LLMs to generate code for a GUI, making the UI natural language. Even for coding, there are issues about stealing other people's code without acknowledgment.
@kristinmbranson @cstanhope I tried copilot but didn't like it due to the annoying UI in the IDE I use (PyCharm). I felt like I was fighting with it all of the time. But it seemed like it might have potential with a better UI.
As for ChatGPT, check out this example I tried. No, it won't write your program for you, but it's certainly good at generating snippets that are useful.
@twitskeptic This seem to fit your code generation scenario: "I think one would be more looking for tasks where the cost of failure is low/zero, and/or failures are easy to see."
For myself, I have no interest in performing code review on something so unreliable, nor providing free labor to companies deploying anything like that. I'd rather provide code review for a junior dev at this point. ¯\_(ツ)_/¯
@kristinmbranson