Zihang Dai @ZihangDai profile

Zihang Dai

@ZihangDai

Followers

15,979

Following

223

Media

1

Statuses

25

Working hard @xai

Bay Area

Joined March 2012

Don't wanna be here? Send us removal request.

Explore tweets Explore followers Explore following

Explore trending content on Musk Viewer

Garbage • 2132006 Tweets

ハロウィン • 1177937 Tweets

Valencia • 968461 Tweets

DANA • 760177 Tweets

Happy Halloween • 249605 Tweets

ポケポケ • 162301 Tweets

Trick or Treat • 122336 Tweets

#週刊ナイナイミュージック • 118081 Tweets

RTVE • 110485 Tweets

イタズラ • 70596 Tweets

SCOTUS • 47627 Tweets

Rachel Reeves • 46321 Tweets

#Budget2024 • 42844 Tweets

Arnold • 40478 Tweets

橋本環奈 • 38390 Tweets

SÃO PAULO TODAY • 35741 Tweets

ベイスターズ • 33258 Tweets

トリックオアトリート • 30708 Tweets

ミュウツー • 25075 Tweets

Jerez • 24384 Tweets

OVER THE MOON TEASER 1 • 24351 Tweets

#JO1ANNX • 21135 Tweets

GALA EN LA MEJOR • 20212 Tweets

Rishi • 18816 Tweets

横浜優勝 • 17422 Tweets

Sunak • 17112 Tweets

ハマスタ • 17057 Tweets

해피 할로윈 • 16357 Tweets

ウォーターマーク • 15933 Tweets

La Palma • 14373 Tweets

ミュージカル組 • 12933 Tweets

Andika Hendi

第1014回

だるまさん

livia

حسين الحبابي

セシルちゃん

할로윈 기념

瑞稀くん

推しの子最新話

きつねと緑のたぬき

ハピハロ

ザリガニ釣り

Pedro Sarmiento

Bilal Okudan

UmutHakkı EşitAF

はろうぃん

#اذكروا_الله_يذكركم

KISLASIZ BEDELLI ASKERLIK

#خدمات_تعقيب_О5О7561О28

Last Seen Profiles

@tadataka_k

@agalek757

@TiwirnueIZmxRi

@RMcconk

@anst0505reo

@Sooblick

@AugustoPedro78

@Francis65037303

@dexcheck

@afo_samson

@7_selmaa

@MateK1109

@f9dj7

@ShuethueEE9qMU

@nekonomea220

@asiflaher

@cupe

@G_Samito

@boys_cedar

Zihang Dai

@ZihangDai

11 months

Haven’t been in conference for a while. Looking forward to meeting friends on Thursday!

Jimmy Ba

@jimmybajimmyba

11 months

Excited to arrive at NeurIPS later today alongside some of my colleagues. @xai / @grok crew will have a Meet & Greet session on Thursday at 2:30pm local time by the registration desk. Drop by for some fun, giggles, and good roasts!

354

196

700

317

139

195

Zihang Dai

@ZihangDai

4 years

In NLP, the O(TD^2) linear projections in Transformer often cost more FLOPs than the O(T^2D) attention as commonly D > T. While many efforts focus on reducing quadratic attention to linear O(TKD), our Funnel-Transformer explores reducing T with clear gains:

7

31

163

Zihang Dai

@ZihangDai

19 days

Join us for the fun!

Yuhuai (Tony) Wu

@Yuhu_ai_

19 days

Three components of Reasoning for AI: 1. Foundation (Pre-training) 2. Self-improvement (RL) 3. Test-time compute (planning). @xai will soon have the best foundation in the world - Grok3. Join us to advance reasoning to the next-level! 🔥🔥

155

377

3K

5

38

131

Zihang Dai

@ZihangDai

5 years

@srush_nlp

2

3

88

Zihang Dai

@ZihangDai

5 years

@srush_nlp Thanks for the donation.

2

3

80

Zihang Dai

@ZihangDai

4 years

In practice, one should also consider tuning the hidden dimension "D", given its significant effect on the FLOPs. Some facts: - All datasets in GLUE only require T = 128. SQuAD, RACE, and many RC datasets require T = 512. - D = 512/768/1024 for mobile-BERT/BERT-base/BERT-large

15

2

21

Zihang Dai

@ZihangDai

5 years

@srush_nlp @JesseDodge @ssgrn @nlpnoah (1) From the bias-variance trade-off perspective, Transformer has a "weaker" model bias (2) Key success factors: (a) a deep-thin TFM rather than a shallow-fat TFM; (b) *copy* AWD-LSTM regularization (3) After (2), the variance is not high

2

1

15

Zihang Dai

@ZihangDai

5 years

@ZhitingHu Nice work and interesting results. It may be better and easier to use Transformer-XL for generation ().

GitHub - kimiyoung/transformer-xl

Contribute to kimiyoung/transformer-xl development by creating an account on GitHub.

github.com

1

13

Zihang Dai

@ZihangDai

5 years

@srush_nlp @JesseDodge @ssgrn @nlpnoah (1) It is mentioned in the paper "Similar to AWD-LSTM (Merity et al., 2017), we apply variational dropout and weight average to Transformer-XL". (2) From AWD-LSTM, it took me less than a week to get under 60 PPL (3) I got the final PPL within 2 weeks with 4 GPUs on my own machine

1

2

12

Zihang Dai

@ZihangDai

5 years

@srush_nlp @JesseDodge @ssgrn @nlpnoah For (2), I only meant the key success factors for **small datasets like PTB**.

1

0

2