There are some really compelling results in this paper (some intuitive, some not so much). The causality analysis shows some non-linearity worth further investigation and further analysis of the effect of parameter counts may be warranted, assuming the true dynamic is sigmoidal.
Overall, this work provides thorough and encouraging results for distilling pre-trained language models into recursive transformers. The idea of adding per-layer adaptors whilst re-using the MLP and Attention particularly interesting.
Overall this was a refreshing work from OpenAI that shines light on often underappreciated aspects of ML -- dataset curation and generalization behavior! Models and code are openly available at: