Zihang Dai Profile
Zihang Dai

@ZihangDai

Followers
15,979
Following
223
Media
1
Statuses
25

Working hard @xai

Bay Area
Joined March 2012
Don't wanna be here? Send us removal request.
Explore trending content on Musk Viewer
@ZihangDai
Zihang Dai
11 months
Haven’t been in conference for a while. Looking forward to meeting friends on Thursday!
@jimmybajimmyba
Jimmy Ba
11 months
Excited to arrive at NeurIPS later today alongside some of my colleagues. @xai / @grok crew will have a Meet & Greet session on Thursday at 2:30pm local time by the registration desk. Drop by for some fun, giggles, and good roasts!
354
196
700
317
139
195
@ZihangDai
Zihang Dai
4 years
In NLP, the O(TD^2) linear projections in Transformer often cost more FLOPs than the O(T^2D) attention as commonly D > T. While many efforts focus on reducing quadratic attention to linear O(TKD), our Funnel-Transformer explores reducing T with clear gains:
Tweet media one
Tweet media two
Tweet media three
7
31
163
@ZihangDai
Zihang Dai
19 days
Join us for the fun!
@Yuhu_ai_
Yuhuai (Tony) Wu
19 days
Three components of Reasoning for AI: 1. Foundation (Pre-training) 2. Self-improvement (RL) 3. Test-time compute (planning). @xai will soon have the best foundation in the world - Grok3. Join us to advance reasoning to the next-level! 🔥🔥
155
377
3K
5
38
131
@ZihangDai
Zihang Dai
5 years
2
3
88
@ZihangDai
Zihang Dai
5 years
@srush_nlp Thanks for the donation.
2
3
80
@ZihangDai
Zihang Dai
4 years
In practice, one should also consider tuning the hidden dimension "D", given its significant effect on the FLOPs. Some facts: - All datasets in GLUE only require T = 128. SQuAD, RACE, and many RC datasets require T = 512. - D = 512/768/1024 for mobile-BERT/BERT-base/BERT-large
15
2
21
@ZihangDai
Zihang Dai
5 years
@srush_nlp @JesseDodge @ssgrn @nlpnoah (1) From the bias-variance trade-off perspective, Transformer has a "weaker" model bias (2) Key success factors: (a) a deep-thin TFM rather than a shallow-fat TFM; (b) *copy* AWD-LSTM regularization (3) After (2), the variance is not high
2
1
15
@ZihangDai
Zihang Dai
5 years
@ZhitingHu Nice work and interesting results. It may be better and easier to use Transformer-XL for generation ().
1
1
13
@ZihangDai
Zihang Dai
5 years
@srush_nlp @JesseDodge @ssgrn @nlpnoah (1) It is mentioned in the paper "Similar to AWD-LSTM (Merity et al., 2017), we apply variational dropout and weight average to Transformer-XL". (2) From AWD-LSTM, it took me less than a week to get under 60 PPL (3) I got the final PPL within 2 weeks with 4 GPUs on my own machine
1
2
12
@ZihangDai
Zihang Dai
5 years
@srush_nlp @JesseDodge @ssgrn @nlpnoah For (2), I only meant the key success factors for **small datasets like PTB**.
1
0
2