In practice, one should also consider tuning the hidden dimension "D", given its significant effect on the FLOPs.
Some facts:
- All datasets in GLUE only require T = 128. SQuAD, RACE, and many RC datasets require T = 512.
- D = 512/768/1024 for mobile-BERT/BERT-base/BERT-large