As the OSI prepares to make official its "open source AI" definition with a glaring lack of requirement that the actual source (training data) is made available, it's worth noting that their work is funded by google, meta, microsoft, salesforce, etc. What does open source even mean here if the literal source of the model isn't open? These companies are invested in making you think they're on your side while they boil the oceans to avoid paying human beings for labor.

The idea behind open source, as it grew out of the free software movement, has always been to water down software freedoms, to create something more palatable to corporate interests that *sounds* good but means very little. This continues that work for the current "gen AI" bubble. It's time to ditch open source as an ideal, and the OSI especially.

opensource.org/ai/drafts/the-o

#OpenSource #OpenSourceAI #OSI #OpenSourceInitiative #FreeSoftware #AI #GenAI #GenerativeAI

They posit you can still modify (tune) the distributed models without the training source. You can also modify a binary executable without its source code. Frankly that's unacceptable if we actually care about the human beings using the software.

A key pillar of freedom as it relates to software is reproducibility. The ability to build a tool from scratch, in your own environment, with your own parameters, is absolutely indispensable to both learning how the tool works and changing the tool to better serve your needs, especially if your needs fall on the outskirts of the bell curve.

There's also the issue of auditability. If you can't run the full build process yourself, producing your own results from scratch in a trusted environment to compare with what's distributed, it becomes exponentially harder to verify any claims about how a tool supposedly works.

Without the training data, this all becomes impossible for AI models. The OSI knows this. They're choosing to ignore it for the sake of expediency for the companies paying their bills, who want to claim "open" because it sounds good while actually hiding the (largely stolen and fraudulently or non-consentually acquired) source material of their current models.

Do we want a new definition of "open source" that actively thwarts analysis and tinkering, two fundamental requirements of software that respects human beings today? Reject this nonsense.

#OpenSource #OpenSourceAI #OSI #OpenSourceInitiative #FreeSoftware #AI #GenAI #GenerativeAI

I'm in no way related with OSI, and I know very little of current LLM tech, but I've been thinking a lot about this issue from a software freedom philosophical perspective, trying to figure out how essential training data is for users to have the four essential freedoms.

it's not obvious to me whether having access to the training data places users and developers at an advantage or at a disadvantage compared with those that don't have access to it. training data is so massive, and the link from any of it to the system's behavior is so subtle, that it seems conceivable to me that probing the system's behavior and relying on incremental training might be more efficient and more reliable than analysis of the training set, for at least some past, current and future technology.

since I don't know enough about current systems to tell, I set out to devise a method to find the answer to that question. I'm thinking that an adversarial setting, in which users/developers who have access to the training data compete with users/developers who don't to find answers to questions about how the system works, and to modify the system so that it does what is requested (these are analogous to freedom #1), with questions and change requests coming from adversarial proponents. this would be a kind of Turing test on whether any given system respects freedom #1 (the other freedoms are much easier to tell), and it could be applied to any future such systems as well. has any such thing been considered? does it seem worth doing, or even thinking more of? cc: @joshuagay @zacchiro @freemo
Follow

@lxo

Thanks for linking me in. I work with LLMs and AI tech at a low level (The company I founded is building a next-gen llm among other AI tools). So I know the internals of an LLM and how they work intimately. I read your post but not the whole thread yet, im a bit busy trying to run that company (Im founder and CTO so a lot is on my shoulders) so forgive me for not reading the entire thread. But, what questions can I help answer directly? We understand how LLMs actually quite well so im confused by what your asking, what is it you want to understand that you think we dont currently?

I've been tracking the OSI open-source AI discussion. I havent seen anything too meaningful come out of that discussion but I hope it will. But I can say this, many open source AI's will link to the corpus of training data used. What is more opaque is how it was trained using that data rather than the data itself. The final code is of course open source, but the meta code, the code that tells the system how to train is not. That is, however where the magic happens, more so than in the training data itself in some ways.

That said I do agree with the OP that if you want a truely good open source definition of AI that is encompassing of full replication you need to open up three things:

1) the training data
2) the training code
3) the running code (the code for running the model after trained).

Hope that helps, let me know if you have any questions.

@chaz

Sign in to participate in the conversation
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.