@algernon I completely agree with the final goal you state. I'm trying to explore (unrelated to SWH) what are the requirements for a FOSS-friendly code LLMs.

But note that part of the problem is that "FOSS" contains very diverse group of people and goals. For instance, it seems to me that once again the lax/permissive vs copyleft split plays an important role here.

@zacchiro @algernon as someone very strongly in the BSD camp when it comes to licencing, I have a very strict ML/LLM policy.

The output is a function of the input, therefore its licence must be honoured. The licence gifts the work to the public under the small, not onerous, attribution requirement, so this attribution requirement, for which the authors give up so much of their rights, has a very high importance and must be honoured very strictly, more so than any individual requirement in a more complex licence (like Apache 2, Creative Commons, GPL family, EUPL, etc).

@mirabilos I'm very interested in your BSD camp take! In what form would you like to have attribution in the generated output?

One (silly) way to achieve that, would be to create a huge file with *all* attributions from the training set, and *always* emit it for any output no matter what. It would be impractical (which might be a feature! if we are anti code LLM no matter what). But if we ignore practicality for a moment, would you consider it acceptable?

Follow

@zacchiro

It would be a huge misattribution, as most mentioned authors would have no impact on the specific code produced.

In the notorious case of distributing the code of Quake 3 Arena, it should have simply attributed it exactly to the author.

Same should happen for any transformation of such code (reordered lines, symbols renames and so on).

If the output mix the code from multiple authors, each and all of those authors (and only those authors) should be mentioned.

As for license, such hypotethical software should only mix code from compatible licenses and distribute / output it with the proper copyright statements declared in each of the original sources.

In other words, a programmer using a software to create a derivative work of one or more project, should obey the exact same rules followed by any other programmer directly doing the same.

Yet your proposal is reasonable for the itself: it's a derived work of all the sources used to statistically program it, so it should be attributed to all the original authors and should strictly respect each of the source licenses as any other derivative work.

This is not anti-LLM, just good sense: expensive automatic proxies should never put who control them above the law.

@mirabilos

Sign in to participate in the conversation
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.