It's interesting - Meta said to achieve good performance on their top model, they had to throw away 95% of their SFT data! Less really is more for alignment (ref to their now several year old LIMA paper.) That sticks out to me, because what other skills can be taught to the largest model with few examples and then distilled?