Follow

Epistemic status: probably wrong, and way too vague (what is information?), but very interesting: Idea about ANN grokking, generalization, and why scaling works: a particular generalization forms only when its prerequisite information is learned by the ANN. Since larger nets can encode more information at a given time, they have at any given time more generalizations' prerequisite information, so those generalizations form. In smaller nets, information is constantly being replaced in each new training batch, so its less likely any given generalization's prerequisite information is actually learned by the net

This would also imply:
* Training on ts examples in a particular order will develop the generalization for that order (if there is one)
* Generalizations are also learned information, and they are more resistant to being replaced than regular information, probably because they are more likely to be used in any given training example

As training time increases, it becomes more and more likely a net encounters batches in the right order to learn the information for a generalization to develop

This would indicate superior training methods might:
* Somehow prevent the most common information (which might merge into the most common generalizations) from being replaced (vector rejections of ts subset gradients maybe?)
* Train on more fundamental ts examples first (pretraining is probably also in this category)

I've been meaning to try training a simple net N times on various small ts subsets, take the one with the best validation set loss, mutate via random example replacement, and recurse. Might be a step in the direction of discovering ts subsets which form the best generalizations

Sign in to participate in the conversation
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.