Naman Goyal @NamanGoyal21 profile

Naman Goyal

@NamanGoyal21

Followers

1K

Following

2K

Media

2

Statuses

195

Research engineer, LLM scaling at GenAI Meta | Worked on: llama-1,2,3, OPT, blenderbot, XLMR, Bart, Roberta

Joined November 2012

Don't wanna be here? Send us removal request.

Naman Goyal

@NamanGoyal21

2 years

Its crazy that, at 60% Model FLOPS (FP8) Utilization on H100, original GPT3 configuration can be trained in 3 days on 1024 H100s and PaLM on 12 days on 2048 H100s. That's roughly 50x lesser gpu hours for GPT3 paper 3 years back, and 9x lesser for palm released 9 months back.

13

35

380

Naman Goyal

@NamanGoyal21

1 year

One of the unfortunate things GPT4 architecture leak caused is, convincing many smart researchers across various labs that sparse models are the way to achieve GPT4 quality model. Data quality and FLOPS is all that matters is such a simple yet hard paradigm to follow.

8

9

225

Naman Goyal

@NamanGoyal21

1 year

Finished 30/30 radiation therapy sessions today. Past 3-4 months have been one of the most challenging part of my life. Recovery from surgery and radiation therapy was quite physically and mentally challenging. With due respect, Cancer, please stay from me from now on.

23

0

176

Naman Goyal

@NamanGoyal21

5 years

Facebook AI Research's sequence modeling library @fairseq has made it's twitter debut. Please follow for latest updates.

2

10

42

Naman Goyal

@NamanGoyal21

1 year

@StasBekman the best GEMM tflops across various large matmul sizes I got for A100 was ~286. And for H100 it’s currently about 750-800 for bf16 and 1550-1600 for fp8. I think and hope GEMM performance will improve overtime as NVIDIA optimises matmul for h100 more with newer cublas versions.

2

0

12

Naman Goyal

@NamanGoyal21

2 years

@jekbradbury I do agree that its gonna need decent amount of engineering work! Though my guess is any team with access to stable H100 cluster with 400Gbps Infiniband (or similar interconnect should reach there by max end of year.

2

1

10

Naman Goyal

@NamanGoyal21

5 years

@ragtdata @facebookai @ylecun Yes, we are going to release the pretrained models soon.

0

6

Naman Goyal

@NamanGoyal21

1 year

@dome_271 The easiest way would be to disable flatten params and then set lr 0.0 for the params you don’t wanna update. I think after setting flatten params false, setting requries_grad=False also should work, but gotta check this to be sure.

2

0

4

Naman Goyal

@NamanGoyal21

2 years

@borisdayma @andrew_n_carr We recently noticed that the scalar of LN is also not needed at least beyond 6.7B model scale.

1

0

4

Naman Goyal

@NamanGoyal21

2 years

@giffmana @_arohan_ @achowdhery @arankomatsuzaki And also from one less inter GPU communication within tensor parallel gpus, which PaLM was doing. Interestingly, we are able to remove scaler of LN also without losing on PPL, so it's just normalization that helps.

2

0

4

Naman Goyal

@NamanGoyal21

2 years

@_arohan_ Congrats on the release. one quick question though if you dont mind. I am not unable to understand what those tokens mean in the table 1, as flop=6ND doesnt seem to match and the abs value look to be too low. Could there be typo or some misunderstanding on my side? cc: @YiTayML

0

4

Naman Goyal

@NamanGoyal21

3 years

@SashaMTL @Jsevillamol I was curious too, above link seems to show 90 kg CO2 per hour per passenger, assuming 333 passengers. So for a full flight it seems to be ~700 * 333 = ~233 tons, which seem to be very close to the ~271 tons.

1

0

3

Naman Goyal

@NamanGoyal21

1 year

@zhansheng I fine tuned MLMs (Roberta, Bart, xlmr from 100M to 10B scale) bunch around that time, but can’t remember this behavior specific to fine tuning. Mainly big models overall were less stable and for that I think two things changed: pre layer norm and bf16.

0

4

Naman Goyal

@NamanGoyal21

3 years

@StasBekman My guess would be 2, 4 and 3 in that order.

1

0

3

Naman Goyal

@NamanGoyal21

2 years

@arkerr Amazing!!! also curious what are the timelines for FP8 GEMM support in cutlass?.

0

3

Naman Goyal

@NamanGoyal21

2 years

@OriolVinyalsML I have one request for flamingo output, I am really curious how does Flamingo do on this classic example from Andrej Karpathy's 10 years old blog post

1

0

3

Naman Goyal

@NamanGoyal21

2 years

@YiTayML @artetxem Congratulations, Mikel is amazing! Looking forward to really look things!.

0

3

Naman Goyal

@NamanGoyal21

2 years

@_arohan_ I remember this paper ( from Google Brain used MoE (gshard / switch transformer style configuration ) in VIT models to scale up to 15B parameters. It's conditional compute, so maybe not what you meant to ask?.

1

0

2

Naman Goyal

@NamanGoyal21

6 years

@lyeskhalil @UofT @uoftmie @uoftengineering @IVADO_Qc @polymtl @69alodi @BDilkina Wow!!! Congrats Elias!!!.

0

1

Naman Goyal

@NamanGoyal21

3 years

@StasBekman It’s totally unrelated to divergence but usually it’s a good idea to keep model_dim % num_heads a power of two instead of 0. I have seen empirical speed up with that.

2

0

1

Naman Goyal

@NamanGoyal21

1 year

@StasBekman I don’t know of any publicly available semi-official numbers. Plus I think it might vary a bit with exact configuration of the server, power capping or not, type of cooling etc.

1

0

1

Naman Goyal

@NamanGoyal21

6 years

@annargrs @myleott @vesko_st @LukeZettlemoyer @omerlevy_ @YinhanL @mandarjoshi_ @danqi_chen Yes, the variance for RTE and MRPC is higher compared to tasks with bigger datasets. Eg, for last row in the above table, SD for some tasks is: {RTE: 1.57, MRPC: 0.87, MNLI: 0.15, QNLI: 0.096, SST: 0.21} across 5 seeds. We will consider adding SD in the updated version of paper.

1

0

1

Naman Goyal

@NamanGoyal21

2 years

@alex_conneau @OpenAI Congrats Alexis!!.

0

1

Naman Goyal

@NamanGoyal21

1 year

@Ethan_smith_20 @dome_271 if your frozen params are not at the end and in between the transformer layers, you anyways need to compute dgrad for everything. you will be extra calculating wgrad though. which can be max 1/3rd slower. I agree not ideal, but easy thing to try.

0

Naman Goyal

@NamanGoyal21

2 years

@LChoshen @YebHavinga @BramVanroy @YinhanL @thoma_gu @xl_nlp Thanks for the question, every individual sample instance was always from language but within a batch, each sample could be from different languages.

2

0

1

Naman Goyal

@NamanGoyal21

4 years

@stefan_it_ @alex_conneau Looking into it, lets chat on github issue?.

0