According to Andrej Karpathy, the original GPT-2 paper used a bias term in linear & normalization layers.

When there is already normalization, having a bias term doesn't do much since the normalization will re-center the values anyway.

Karpathy himself notes that removing the bias term in the GPT-2 architecture seems to speed up and improve things:
github.com/karpathy/nanoGPT/bl

It's interesting that even something as ground-breaking as GPT-2 has low hanging fruit improvements!

Follow

@crude2refined yeah that doesn't surprise me much; for one, feature flags like bias or learnable scale and so on (but really the other 20 parameters baked into the 5 or 6 different layer types too) are easily overlooked hidden bits enabled by default in the off-the-shelf layers of DNN packages. And moreover, when transformers first burst onto the scene they appeared as one fairly complex piece of machinery with all sorts of design choices locked in by just one team (of course after plenty of testing and tuning). There's no way that all those choices could have been optimal even for their own dataset let alone everything that they've now been applied to, and of course the greater community has been finding quite a lot of big & small improvements over the years (eg long-context attention etc) and are still finding them. I'd say it's not even obvious that anything in GPT-2 or later would be a part of the ultimate sequence model that we'd have in 20+ years.

Sign in to participate in the conversation
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.