**scy** @scy@chaos.social · Mar 20, 2024, 17:05

**scy** @scy@chaos.social · Mar 20, 2024, 17:05

scy @scy@chaos.social

Mar 20, 2024, 17:05

The BigCode project (supported by Hugging Face) created an "AI" dataset with 67 TB of code, a lot of it from GitHub users who did not agree to this. Some even claim that private repositories are included. 91 of my repositories are in it, many without an open-source license, but no private ones. They provide an opt-out link, but only for "future versions", and it simply creates an issue in a GitHub repo. 99.8 % of them are still in "open" state, dating back to March 2023.

https://huggingface.co/spaces/bigcode/in-the-stack

**scy** @scy@chaos.social · Mar 20, 2024, 17:09

**scy** @scy@chaos.social · Mar 20, 2024, 17:09

Mar 20, 2024, 17:09

scy @scy@chaos.social

Additional links:

Open opt-out requests:
https://github.com/bigcode-project/opt-out-v2/issues?page=20&q=is%3Aissue+%22opt-out+request%22
(yes, they're all publicly accessible)

The Stack dataset:
https://huggingface.co/datasets/bigcode/the-stack-v2

Claims about private repos being included:
https://post.lurk.org/@emenel/112111014479288871
(I can neither confirm nor deny this)

**scy** @scy@chaos.social · Mar 20, 2024, 17:17

**scy** @scy@chaos.social · Mar 20, 2024, 17:17

Mar 20, 2024, 17:17

scy @scy@chaos.social

So, this dataset contains a shitload of copyrighted code that does not allow redistribution, let alone creating derivative works from it, and the authors seem to have no intention of rectifying this.

They treat the existing datasets as immutable, and appear to ignore opt-out requests.

If you have a Hugging Face account, you can report the Stack v2 dataset via the three-dots menu on the top right at https://huggingface.co/datasets/bigcode/the-stack-v2

**scy** @scy@chaos.social · Mar 20, 2024, 17:25

**scy** @scy@chaos.social · Mar 20, 2024, 17:25

Mar 20, 2024, 17:25

scy @scy@chaos.social

Also note that while The Stack v1 contained code "from permissive licenses", v2 has extended this to "with permissive licenses or no license".

Yes, back when I was 16, I also thought that "no license" meant "no restrictions on what to do with it", but just to be clear: no, it means "you have no permission to do whatsoever".

Somebody please sue these guys into the ground?

**scy** @scy@chaos.social · Mar 20, 2024, 17:28

**scy** @scy@chaos.social · Mar 20, 2024, 17:28

Mar 20, 2024, 17:28

scy @scy@chaos.social

According to the BigCode project, they are "a community project jointly led by Hugging Face and ServiceNow. Both organizations committed research, engineering, ethics, governance, and legal resources".

https://www.bigcode-project.org/docs/about/organization/

So, maybe say hello to these companies' legal departments too …

**scy** @scy@chaos.social · Mar 20, 2024, 21:44

**scy** @scy@chaos.social · Mar 20, 2024, 21:44

Mar 20, 2024, 21:44

scy @scy@chaos.social

Wait, it's even worse. The dataset is based on @swheritage's archive, containing way more than just GitHub (e.g. @Codeberg is archived, too).

I assumed they were somewhat neutral, but they're praising the LLM usage of this unlicensed code:

https://www.softwareheritage.org/2024/02/28/responsible-ai-with-starcoder2/

Also, they're refusing to remove deadnames, even outright ignoring GDPR demands for it:

https://cohost.org/arborelia/post/5169338-the-software-heritag

I can only conclude that they're a bad actor and should be considered harmful by the #OpenSource community.

**Shamar** @Shamar@qoto.org · 2024-03-21T08:05:10Z

Shamar @Shamar@qoto.org

@scy

Well, I'd argue that they represent pretty well #OpenSource, that was exactly designed to marginalize #FreeSoftware (a political movement of hackers) and to serve corporate interests: https://thebaffler.com/salvos/the-meme-hustler

What you'd expect from an organization ethic-washing #Google, #Microsoft and so on?
https://www.softwareheritage.org/support/sponsors/

@swheritage @Codeberg