Naman Goyal
@NamanGoyal21
Followers
1K
Following
2K
Media
2
Statuses
195
Research engineer, LLM scaling at GenAI Meta | Worked on: llama-1,2,3, OPT, blenderbot, XLMR, Bart, Roberta
Joined November 2012
Facebook AI Research's sequence modeling library @fairseq has made it's twitter debut. Please follow for latest updates.
2
10
42
@StasBekman the best GEMM tflops across various large matmul sizes I got for A100 was ~286. And for H100 it’s currently about 750-800 for bf16 and 1550-1600 for fp8. I think and hope GEMM performance will improve overtime as NVIDIA optimises matmul for h100 more with newer cublas versions.
2
0
12
@jekbradbury I do agree that its gonna need decent amount of engineering work! Though my guess is any team with access to stable H100 cluster with 400Gbps Infiniband (or similar interconnect should reach there by max end of year.
2
1
10
@dome_271 The easiest way would be to disable flatten params and then set lr 0.0 for the params you don’t wanna update. I think after setting flatten params false, setting requries_grad=False also should work, but gotta check this to be sure.
2
0
4
@borisdayma @andrew_n_carr We recently noticed that the scalar of LN is also not needed at least beyond 6.7B model scale.
1
0
4
@giffmana @_arohan_ @achowdhery @arankomatsuzaki And also from one less inter GPU communication within tensor parallel gpus, which PaLM was doing. Interestingly, we are able to remove scaler of LN also without losing on PPL, so it's just normalization that helps.
2
0
4
@SashaMTL @Jsevillamol I was curious too, above link seems to show 90 kg CO2 per hour per passenger, assuming 333 passengers. So for a full flight it seems to be ~700 * 333 = ~233 tons, which seem to be very close to the ~271 tons.
1
0
3
@zhansheng I fine tuned MLMs (Roberta, Bart, xlmr from 100M to 10B scale) bunch around that time, but can’t remember this behavior specific to fine tuning. Mainly big models overall were less stable and for that I think two things changed: pre layer norm and bf16.
0
0
4
@OriolVinyalsML I have one request for flamingo output, I am really curious how does Flamingo do on this classic example from Andrej Karpathy's 10 years old blog post
1
0
3
@_arohan_ I remember this paper ( from Google Brain used MoE (gshard / switch transformer style configuration ) in VIT models to scale up to 15B parameters. It's conditional compute, so maybe not what you meant to ask?.
1
0
2
@lyeskhalil @UofT @uoftmie @uoftengineering @IVADO_Qc @polymtl @69alodi @BDilkina Wow!!! Congrats Elias!!!.
0
0
1
@StasBekman It’s totally unrelated to divergence but usually it’s a good idea to keep model_dim % num_heads a power of two instead of 0. I have seen empirical speed up with that.
2
0
1
@StasBekman I don’t know of any publicly available semi-official numbers. Plus I think it might vary a bit with exact configuration of the server, power capping or not, type of cooling etc.
1
0
1
@annargrs @myleott @vesko_st @LukeZettlemoyer @omerlevy_ @YinhanL @mandarjoshi_ @danqi_chen Yes, the variance for RTE and MRPC is higher compared to tasks with bigger datasets. Eg, for last row in the above table, SD for some tasks is: {RTE: 1.57, MRPC: 0.87, MNLI: 0.15, QNLI: 0.096, SST: 0.21} across 5 seeds. We will consider adding SD in the updated version of paper.
1
0
1
@Ethan_smith_20 @dome_271 if your frozen params are not at the end and in between the transformer layers, you anyways need to compute dgrad for everything. you will be extra calculating wgrad though. which can be max 1/3rd slower. I agree not ideal, but easy thing to try.
0
0
0
@LChoshen @YebHavinga @BramVanroy @YinhanL @thoma_gu @xl_nlp Thanks for the question, every individual sample instance was always from language but within a batch, each sample could be from different languages.
2
0
1