Epistemic status: probably wrong, and way too vague (what is information?), but very interesting: Idea about ANN grokking, generalization, and why scaling works: a particular generalization forms only when its prerequisite information is learned by the ANN. Since larger nets can encode more information at a given time, they have at any given time more generalizations' prerequisite information, so those generalizations form. In smaller nets, information is constantly being replaced in each new training batch, so its less likely any given generalization's prerequisite information is actually learned by the net
This would also imply:
* Training on ts examples in a particular order will develop the generalization for that order (if there is one)
* Generalizations are also learned information, and they are more resistant to being replaced than regular information, probably because they are more likely to be used in any given training example
As training time increases, it becomes more and more likely a net encounters batches in the right order to learn the information for a generalization to develop
This would indicate superior training methods might:
* Somehow prevent the most common information (which might merge into the most common generalizations) from being replaced (vector rejections of ts subset gradients maybe?)
* Train on more fundamental ts examples first (pretraining is probably also in this category)
I've been meaning to try training a simple net N times on various small ts subsets, take the one with the best validation set loss, mutate via random example replacement, and recurse. Might be a step in the direction of discovering ts subsets which form the best generalizations