I found the papers "Scaling Laws for Neural Language Models" (OpenAI, 2020) and "Training Compute-Optimal Large Language Models" (DeepMind, 2022) interesting:
https://arxiv.org/pdf/2001.08361.pdf
https://arxiv.org/pdf/2203.15556.pdf
They do a LOT of experiments training large language models (causal transformers) with varying hyperparameters, in particular model size, shape, batch size, and training data set size over many orders of magnitude. 1/?