The BigCode project (supported by Hugging Face) created an "AI" dataset with 67 TB of code, a lot of it from GitHub users who did not agree to this. Some even claim that private repositories are included. 91 of my repositories are in it, many without an open-source license, but no private ones. They provide an opt-out link, but only for "future versions", and it simply creates an issue in a GitHub repo. 99.8 % of them are still in "open" state, dating back to March 2023.

huggingface.co/spaces/bigcode/

Additional links:

Open opt-out requests:
github.com/bigcode-project/opt
(yes, they're all publicly accessible)

The Stack dataset:
huggingface.co/datasets/bigcod

Claims about private repos being included:
post.lurk.org/@emenel/11211101
(I can neither confirm nor deny this)

Show thread

So, this dataset contains a shitload of copyrighted code that does not allow redistribution, let alone creating derivative works from it, and the authors seem to have no intention of rectifying this.

They treat the existing datasets as immutable, and appear to ignore opt-out requests.

If you have a Hugging Face account, you can report the Stack v2 dataset via the three-dots menu on the top right at huggingface.co/datasets/bigcod

Show thread

Also note that while The Stack v1 contained code "from permissive licenses", v2 has extended this to "with permissive licenses or no license".

Yes, back when I was 16, I also thought that "no license" meant "no restrictions on what to do with it", but just to be clear: no, it means "you have no permission to do whatsoever".

Somebody please sue these guys into the ground?

Show thread

According to the BigCode project, they are "a community project jointly led by Hugging Face and ServiceNow. Both organizations committed research, engineering, ethics, governance, and legal resources".

bigcode-project.org/docs/about

So, maybe say hello to these companies' legal departments too …

Wait, it's even worse. The dataset is based on @swheritage's archive, containing way more than just GitHub (e.g. @Codeberg is archived, too).

I assumed they were somewhat neutral, but they're praising the LLM usage of this unlicensed code:

softwareheritage.org/2024/02/2

Also, they're refusing to remove deadnames, even outright ignoring GDPR demands for it:

cohost.org/arborelia/post/5169

I can only conclude that they're a bad actor and should be considered harmful by the #OpenSource community.

Follow

@scy

Well, I'd argue that they represent pretty well , that was exactly designed to marginalize (a political movement of hackers) and to serve corporate interests: thebaffler.com/salvos/the-meme

What you'd expect from an organization ethic-washing , and so on?
softwareheritage.org/support/s

@swheritage @Codeberg

Sign in to participate in the conversation
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.