Meta-Principled Family of Hyperparameter Scaling StrategiesIn this note, we first derive a one-parameter family of hyperparameter
scaling strategies that interpolates between the neural-tangent scaling and
mean-field/maximal-update scaling. We then calculate the scalings of dynamical
observables -- network outputs, neural tangent kernels, and differentials of
neural tangent kernels -- for wide and deep neural networks. These calculations
in turn reveal a proper way to scale depth with width such that resultant
large-scale models maintain their representation-learning ability. Finally, we
observe that various infinite-width limits examined in the literature
correspond to the distinct corners of the interconnected web spanned by
effective theories for finite-width neural networks, with their training
dynamics ranging from being weakly-coupled to being strongly-coupled.
arxiv.org