**Mehrdad Yazdani** @crude2refined@sigmoid.social · Aug 13, 2024, 14:56

**Mehrdad Yazdani** @crude2refined@sigmoid.social · Aug 13, 2024, 14:56

Mehrdad Yazdani @crude2refined@sigmoid.social

Aug 13, 2024, 14:56

Mehrdad Yazdani @crude2refined@sigmoid.social

According to Andrej Karpathy, the original GPT-2 paper used a bias term in linear & normalization layers.

When there is already normalization, having a bias term doesn't do much since the normalization will re-center the values anyway.

Karpathy himself notes that removing the bias term in the GPT-2 architecture seems to speed up and improve things:
https://github.com/karpathy/nanoGPT/blob/master/model.py#L116

It's interesting that even something as ground-breaking as GPT-2 has low hanging fruit improvements!

**Christian Kothe** @christiankothe@qoto.org · 2024-08-13T15:59:00Z

Christian Kothe @christiankothe@qoto.org

@crude2refined yeah that doesn't surprise me much; for one, feature flags like bias or learnable scale and so on (but really the other 20 parameters baked into the 5 or 6 different layer types too) are easily overlooked hidden bits enabled by default in the off-the-shelf layers of DNN packages. And moreover, when transformers first burst onto the scene they appeared as one fairly complex piece of machinery with all sorts of design choices locked in by just one team (of course after plenty of testing and tuning). There's no way that all those choices could have been optimal even for their own dataset let alone everything that they've now been applied to, and of course the greater community has been finding quite a lot of big & small improvements over the years (eg long-context attention etc) and are still finding them. I'd say it's not even obvious that anything in GPT-2 or later would be a part of the ultimate sequence model that we'd have in 20+ years.

Aug 13, 2024, 15:59 · · · ·

Trending now

Resources

Developers

What is Mastodon?

qoto.org

More…