The BigCode project (supported by Hugging Face) created an "AI" dataset with 67 TB of code, a lot of it from GitHub users who did not agree to this. Some even claim that private repositories are included. 91 of my repositories are in it, many without an open-source license, but no private ones. They provide an opt-out link, but only for "future versions", and it simply creates an issue in a GitHub repo. 99.8 % of them are still in "open" state, dating back to March 2023.
Additional links:
Open opt-out requests:
https://github.com/bigcode-project/opt-out-v2/issues?page=20&q=is%3Aissue+%22opt-out+request%22
(yes, they're all publicly accessible)
The Stack dataset:
https://huggingface.co/datasets/bigcode/the-stack-v2
Claims about private repos being included:
https://post.lurk.org/@emenel/112111014479288871
(I can neither confirm nor deny this)
So, this dataset contains a shitload of copyrighted code that does not allow redistribution, let alone creating derivative works from it, and the authors seem to have no intention of rectifying this.
They treat the existing datasets as immutable, and appear to ignore opt-out requests.
If you have a Hugging Face account, you can report the Stack v2 dataset via the three-dots menu on the top right at https://huggingface.co/datasets/bigcode/the-stack-v2
Also note that while The Stack v1 contained code "from permissive licenses", v2 has extended this to "with permissive licenses or no license".
Yes, back when I was 16, I also thought that "no license" meant "no restrictions on what to do with it", but just to be clear: no, it means "you have no permission to do whatsoever".
Somebody please sue these guys into the ground?
According to the BigCode project, they are "a community project jointly led by Hugging Face and ServiceNow. Both organizations committed research, engineering, ethics, governance, and legal resources".
https://www.bigcode-project.org/docs/about/organization/
So, maybe say hello to these companies' legal departments too …
Well, I'd argue that they represent pretty well #OpenSource, that was exactly designed to marginalize #FreeSoftware (a political movement of hackers) and to serve corporate interests: https://thebaffler.com/salvos/the-meme-hustler
What you'd expect from an organization ethic-washing #Google, #Microsoft and so on?
https://www.softwareheritage.org/support/sponsors/