#AI demystified: a decompiler
To prove that any "artificial neural network" is just a statistically programmed (virtual) machine whose model software is a derivative work of the source dataset used during its "training", we provide a small suite of tools to assemble and program such machines and a decompiler that reconstruct the source dataset from the cryptic matrices that constitute the software executed by them.
Finally we test the suite on the classic #MNIST dataset and compare the decompiled dataset with the original one.
#ArtificialIntelligence
#MachineLearning
#ArtificialNeuralNetworks
#microsoft
#GitHubCopilot
#Python
#StatisticalProgramming
#VectorMappingMachine
http://www.tesio.it/2021/09/01/a_decompiler_for_artificial_neural_networks.html
According to the same reasoning, compiling a C program is a lossy compression that does not preserve the exact source code and, as such, it can be decompiled and reused freely.
(Note however the correction I did to the article)
@Shamar I don't know, to me compilation is a translation, not lossy compression. If some information is lost, then it is lost only because it has no meaning in the target language, otherwise it's a direct translation and also the main purpose of the program. Sometimes we even write the sources with specific translations in mind, like function inlining or tail call optimization.
That said any restrictions on binary distribution or reverse engineering, come from the license not from copyright law directly, and the arguments presented for copilot is that it's fair use, and licenses do not apply, bringing google book search as an example. From what I understand fair use essentially constitutes use that does not directly diminish the marketability of the original work, through copying substantial portions of it. If copilot somehow memorized even your entire project, it does not actually diminish the marketability of your project by itself, since the end product isn't even in the same market. Someone using copilot to produce a substantial copy of your work would do it, but that is on them. In my eyes the problem with this argument is that if that's the position that microsoft takes, then nobody in their right mind would want to use copilot as anything but curiosity. It's much more likely that they would want to take the position of lossy compression, that then generates original works, just like generating a random video with same number of red pixels, in which case you'll have the argument that it's not lossy enough, as it can produce substantial copies. Still don't think you can argue that ANN in general are derivative work of the data set. They are too general and the law is too fuzzy.
I might want to sell my source code to Microsoft for Copilot training so calling its usage "fair use" is wrong: it reduces the marketability of my work.
Btw, where did you find such definition of "fair use"? I'd like to give it a read.
@Shamar "I might want to" is not a market. You might want to sell and they might not want to buy and just stick to other projects. Is there an established market of selling software as ANN datasets and are you a player in it? Was your project built and marketed as ANN training data? If not then what purposes was it built and marketed for and how does copilot interfere with it? The court will rule based on the realities and common sense of today, no theoretical possibilities.
I didn't find my own opinion form anywhere, but if you are interested you can look up the law yourself. There are plenty of direct quotes from court rulings in wikipedia if you just want something to discuss. The gist is, it's open to interpretation, leaning toward practical rather than ideological side of things. If it doesn't serve as a substitute for your work, and take your users/readers/viewers away directly because of this substitution, then it'll likely be considered fair use.
@Shamar it really sounds more like a lossy compression then compilation for some "virtual" machine. You can't argue that in general lossy compression is derivative work. I can take screenshots of your entire codebase and redistribute them as jpegs. If the jpegs are readable it's derivative work or even straight up copy, if they are not, then it's clearly not either of those thing. Similarly I can scan your video for the number of red pixels per frame, then generate some white noise with the same number of red pixels per frame, again clearly not a derivative work (even if it turns out to be objectively better than your original video). ANN are too general to argue in absolutes about them, like you can with source to binary translation, and doing so only weakens the case against copilot specifically.
Also while the article links to the arguments of "lawyers and politicians" it does not in any way fairly represent them in the narrative or directly address them. The arguments basically boil down to copilot being a weird search engine, and that's it's up to the user to ensure they do not end up violating any license terms while using this weird search engine, basically by making sure that they never generate large bodies of code with it, and limit the use to small snippets. That's their explanation of why it fails under fair use and why copyright law doesn't apply, not that it does not contain the original sources. It does contain them is some form, just like a search engine database would and that's not a problem, since it does not somehow automatically release all of it under pulbic domain or something, it requires human input to do anything and said human then assumes all responsibility.