**Stefano Zacchiroli** @zacchiro@mastodon.xyz · Mar 18, 2024, 07:56

**Stefano Zacchiroli** @zacchiro@mastodon.xyz · Mar 18, 2024, 07:56

Stefano Zacchiroli @zacchiro@mastodon.xyz

Mar 18, 2024, 07:56

Stefano Zacchiroli @zacchiro@mastodon.xyz

@algernon I completely agree with the final goal you state. I'm trying to explore (unrelated to SWH) what are the requirements for a FOSS-friendly code LLMs.

But note that part of the problem is that "FOSS" contains very diverse group of people and goals. For instance, it seems to me that once again the lax/permissive vs copyleft split plays an important role here.

**mirabilos** @mirabilos@toot.mirbsd.org · Mar 21, 2024, 23:58

**mirabilos** @mirabilos@toot.mirbsd.org · Mar 21, 2024, 23:58

Mar 21, 2024, 23:58

mirabilos @mirabilos@toot.mirbsd.org

@zacchiro @algernon as someone very strongly in the BSD camp when it comes to licencing, I have a very strict ML/LLM policy.

The output is a function of the input, therefore its licence must be honoured. The licence gifts the work to the public under the small, not onerous, attribution requirement, so this attribution requirement, for which the authors give up so much of their rights, has a very high importance and must be honoured very strictly, more so than any individual requirement in a more complex licence (like Apache 2, Creative Commons, GPL family, EUPL, etc).

**Stefano Zacchiroli** @zacchiro@mastodon.xyz · Mar 22, 2024, 09:23

**Stefano Zacchiroli** @zacchiro@mastodon.xyz · Mar 22, 2024, 09:23

Mar 22, 2024, 09:23

Stefano Zacchiroli @zacchiro@mastodon.xyz

@mirabilos I'm very interested in your BSD camp take! In what form would you like to have attribution in the generated output?

One (silly) way to achieve that, would be to create a huge file with *all* attributions from the training set, and *always* emit it for any output no matter what. It would be impractical (which might be a feature! if we are anti code LLM no matter what). But if we ignore practicality for a moment, would you consider it acceptable?

**Shamar** @Shamar@qoto.org · 2024-03-22T10:44:36Z

Shamar @Shamar@qoto.org

@zacchiro

It would be a huge misattribution, as most mentioned authors would have no impact on the specific code produced.

In the notorious case of #GitHub #Copylot distributing the code of Quake 3 Arena, it should have simply attributed it exactly to the author.

Same should happen for any transformation of such code (reordered lines, symbols renames and so on).

If the output mix the code from multiple authors, each and all of those authors (and only those authors) should be mentioned.

As for license, such hypotethical software should only mix code from compatible licenses and distribute / output it with the proper copyright statements declared in each of the original sources.

In other words, a programmer using a software to create a derivative work of one or more project, should obey the exact same rules followed by any other programmer directly doing the same.

Yet your proposal is reasonable for the #LLM itself: it's a derived work of all the sources used to statistically program it, so it should be attributed to all the original authors and should strictly respect each of the source licenses as any other derivative work.

This is not anti-LLM, just good sense: expensive automatic proxies should never put who control them above the law.

@mirabilos

Mar 22, 2024, 10:44 · · · ·

Resources

Developers

What is Mastodon?

qoto.org

More…