This is SO COOL, maybe Chain of Thought is a short-lived hack instead of a fundamental building block, whereas looped transformers seem fundamental if they scale. This aligns with empirical evidence that looping inner transformer layers can improve performance even without retraining. https://sites.google.com/wisc.edu/looped-transformers-for-lengen/home
@dpwiz Hard to argue with the bitter lesson, but this seems orthogonal or even in line with it - those loops ain't free