Stabilizing Transformer Training by Preventing Attention Entropy CollapseTraining stability is of great importance to Transformers. In this work, we
investigate the training dynamics of Transformers by examining the evolution of
the attention layers. In particular, we track the attention entropy for each
attention head during the course of training, which is a proxy for model
sharpness. We identify a common pattern across different architectures and
tasks, where low attention entropy is accompanied by high training instability,
which can take the form of oscillating loss or divergence. We denote the
pathologically low attention entropy, corresponding to highly concentrated
attention scores, as $\textit{entropy collapse}$. As a remedy, we propose
$σ$Reparam, a simple and efficient solution where we reparametrize all
linear layers with spectral normalization and an additional learned scalar. We
demonstrate that the proposed reparameterization successfully prevents entropy
collapse in the attention layers, promoting more stable training. Additionally,
we prove a tight lower bound of the attention entropy, which decreases
exponentially fast with the spectral norm of the attention logits, providing
additional motivation for our approach. We conduct experiments with
$σ$Reparam on image classification, image self-supervised learning,
machine translation, automatic speech recognition, and language modeling tasks,
across Transformer architectures. We show that $σ$Reparam provides
stability and robustness with respect to the choice of hyperparameters, going
so far as enabling training (a) a Vision Transformer to competitive performance
without warmup, weight decay, layer normalization or adaptive optimizers; (b)
deep architectures in machine translation and (c) speech recognition to
competitive performance without warmup and adaptive optimizers.
arxiv.org