The Benefits of Mixup for Feature LearningMixup, a simple data augmentation method that randomly mixes two data points
via linear interpolation, has been extensively applied in various deep learning
applications to gain better generalization. However, the theoretical
underpinnings of its efficacy are not yet fully understood. In this paper, we
aim to seek a fundamental understanding of the benefits of Mixup. We first show
that Mixup using different linear interpolation parameters for features and
labels can still achieve similar performance to the standard Mixup. This
indicates that the intuitive linearity explanation in Zhang et al., (2018) may
not fully explain the success of Mixup. Then we perform a theoretical study of
Mixup from the feature learning perspective. We consider a feature-noise data
model and show that Mixup training can effectively learn the rare features
(appearing in a small fraction of data) from its mixture with the common
features (appearing in a large fraction of data). In contrast, standard
training can only learn the common features but fails to learn the rare
features, thus suffering from bad generalization performance. Moreover, our
theoretical analysis also shows that the benefits of Mixup for feature learning
are mostly gained in the early training phase, based on which we propose to
apply early stopping in Mixup. Experimental results verify our theoretical
findings and demonstrate the effectiveness of the early-stopped Mixup training.
arxiv.org