As the OSI prepares to make official its "open source AI" definition with a glaring lack of requirement that the actual source (training data) is made available, it's worth noting that their work is funded by google, meta, microsoft, salesforce, etc. What does open source even mean here if the literal source of the model isn't open? These companies are invested in making you think they're on your side while they boil the oceans to avoid paying human beings for labor.
The idea behind open source, as it grew out of the free software movement, has always been to water down software freedoms, to create something more palatable to corporate interests that *sounds* good but means very little. This continues that work for the current "gen AI" bubble. It's time to ditch open source as an ideal, and the OSI especially.
https://opensource.org/ai/drafts/the-open-source-ai-definition-1-0-rc2
#OpenSource #OpenSourceAI #OSI #OpenSourceInitiative #FreeSoftware #AI #GenAI #GenerativeAI
They posit you can still modify (tune) the distributed models without the training source. You can also modify a binary executable without its source code. Frankly that's unacceptable if we actually care about the human beings using the software.
A key pillar of freedom as it relates to software is reproducibility. The ability to build a tool from scratch, in your own environment, with your own parameters, is absolutely indispensable to both learning how the tool works and changing the tool to better serve your needs, especially if your needs fall on the outskirts of the bell curve.
There's also the issue of auditability. If you can't run the full build process yourself, producing your own results from scratch in a trusted environment to compare with what's distributed, it becomes exponentially harder to verify any claims about how a tool supposedly works.
Without the training data, this all becomes impossible for AI models. The OSI knows this. They're choosing to ignore it for the sake of expediency for the companies paying their bills, who want to claim "open" because it sounds good while actually hiding the (largely stolen and fraudulently or non-consentually acquired) source material of their current models.
Do we want a new definition of "open source" that actively thwarts analysis and tinkering, two fundamental requirements of software that respects human beings today? Reject this nonsense.
#OpenSource #OpenSourceAI #OSI #OpenSourceInitiative #FreeSoftware #AI #GenAI #GenerativeAI
Thanks for linking me in. I work with LLMs and AI tech at a low level (The company I founded is building a next-gen llm among other AI tools). So I know the internals of an LLM and how they work intimately. I read your post but not the whole thread yet, im a bit busy trying to run that company (Im founder and CTO so a lot is on my shoulders) so forgive me for not reading the entire thread. But, what questions can I help answer directly? We understand how LLMs actually quite well so im confused by what your asking, what is it you want to understand that you think we dont currently?
I've been tracking the OSI open-source AI discussion. I havent seen anything too meaningful come out of that discussion but I hope it will. But I can say this, many open source AI's will link to the corpus of training data used. What is more opaque is how it was trained using that data rather than the data itself. The final code is of course open source, but the meta code, the code that tells the system how to train is not. That is, however where the magic happens, more so than in the training data itself in some ways.
That said I do agree with the OP that if you want a truely good open source definition of AI that is encompassing of full replication you need to open up three things:
1) the training data
2) the training code
3) the running code (the code for running the model after trained).
Hope that helps, let me know if you have any questions.