@crude2refined yeah that doesn't surprise me much; for one, feature flags like bias or learnable scale and so on (but really the other 20 parameters baked into the 5 or 6 different layer types too) are easily overlooked hidden bits enabled by default in the off-the-shelf layers of DNN packages. And moreover, when transformers first burst onto the scene they appeared as one fairly complex piece of machinery with all sorts of design choices locked in by just one team (of course after plenty of testing and tuning). There's no way that all those choices could have been optimal even for their own dataset let alone everything that they've now been applied to, and of course the greater community has been finding quite a lot of big & small improvements over the years (eg long-context attention etc) and are still finding them. I'd say it's not even obvious that anything in GPT-2 or later would be a part of the ultimate sequence model that we'd have in 20+ years.