Like many other technologists, I gave my time and expertise for free to #StackOverflow because the content was licensed CC-BY-SA - meaning that it was a public good. It brought me joy to help people figure out why their #ASR code wasn't working, or assist with a #CUDA bug.

Now that a deal has been struck with #OpenAI to scrape all the questions and answers in Stack Overflow, to train #GenerativeAI models, like #LLMs, without attribution to authors (as required under the CC-BY-SA license under which Stack Overflow content is licensed), to be sold back to us (the SA clause requires derivative works to be shared under the same license), I have issued a Data Deletion request to Stack Overflow to disassociate my username from my Stack Overflow username, and am closing my account, just like I did with Reddit, Inc.

policies.stackoverflow.co/data

The data I helped create is going to be bundled in an #LLM and sold back to me.

In a single move, Stack Overflow has alienated its community - which is also its main source of competitive advantage, in exchange for token lucre.

Stack Exchange, Stack Overflow's former instantiation, used to fulfill a psychological contract - help others out when you can, for the expectation that others may in turn assist you in the future. Now it's not an exchange, it's #enshittification.

Programmers now join artists and copywriters, whose works have been snaffled up to create #GenAI solutions.

The silver lining I see is that once OpenAI creates LLMs that generate code - like Microsoft has done with Copilot on GitHub - where will they go to get help with the bugs that the generative AI models introduce, particularly, given the recent GitClear report, of the "downward pressure on code quality" caused by these tools?

While this is just one more example of #enshittification, it's also a salient lesson for #DevRel folks - if your community is your source of advantage, don't upset them.

@KathyReid Sincere question: shouldn't Creative Commons turn this into a class action lawsuit? it's so huge and out of proportion what they're doing, groups of individuals can't really tackle this alone.

@blogdiva @KathyReid apparently, it isn't as clear as it seems that they're breaking the license, see this post about images creativecommons.org/2023/02/17

@j3j5 @blogdiva @KathyReid this appears to be only an opinion, which 1) doesn't appear very informed to me, and 2) only mentions code-related issues, and does not address them.

More precisely, clearly when talking of code, this is not part of the process used:
"Diffusion models like Stable Diffusion and Midjourney take these inputs, add “noise” to them, corrupting them, and then train neural networks to remove the corruption."
Neither is this:
"This is because using the digitized books as part of the database provided information about the books and did not use them for their creative content"

Also, regarding the court's inverse question, it seems to me that this is an extremely valid concern:
"On this point, the court wrote that the better question to answer was not how much of the works [company] copied, but instead how much was available to users". Due to the granularity of SO, this could easily be 100%, even though this would need to be illustrated (=proven) as in the NYT case.

Could anyone identify "substantial, non infringing uses" of the SO data? Here using CC might make it difficult to establish what was the original interest of the contributor.

Final note, to me it remains a bit unclear what is the relevance of the Oracle Vs Google case according to the author.

Follow

@j3j5 @blogdiva @KathyReid excuse me, actually at a second thought BERT and GPT are exactly doing corrupt-and-train on the dataset. This certainly weakens my interpretation. Apologies for the confusion.

@mapto @j3j5 @blogdiva @KathyReid
Not sure what the relevance of corrupt-and-train is to the legal argument being made here. Wolfson claims "they do not piece together new images from bits of images from their training data" but one could argue that neither is transcoding a Disney movie into a lossy MPEG format. Each frame is regenerated from discrete cosine transforms and motion vectors. Error correction happens during storage. Does that make it fair use?

Sign in to participate in the conversation
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.