I have a creeping intuition that the residual connections, flowing through the network with no degradation/impediment, are somehow holding back modern large transformer architectures. ResNet was a breakthrough, but I wonder if there's another way that encourages better internal representations and specializations.

Follow

@ericflo I think there are learned circuits to suppress irrelevant information to keep the transformer trunk tidy.

Sign in to participate in the conversation
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.