Hadi Pouransari @HPouransari profile

Hadi Pouransari

@HPouransari

Followers

527

Following

308

Statuses

132

ML Research @Apple, PhD @Stanford.

California, USA

Joined July 2019

Don't wanna be here? Send us removal request.

Hadi Pouransari

@HPouransari

13 hours

RT @NeginRaoof_: Announcing OpenThinker-32B: the best open-data reasoning model distilled from DeepSeek-R1. Our results show that large, ca…

0

108

0

Hadi Pouransari

@HPouransari

14 hours

RT @ArwenBradley: When does composition of diffusion models “work”? Prior work (Du et al., 2023; Liu et al., 2022) has shown that compositi…

0

29

0

Hadi Pouransari

@HPouransari

2 days

RT @PreetumNakkiran: finally managed to sneak my dog into a paper

0

54

0

Hadi Pouransari

@HPouransari

16 days

RT @samira_abnar: 🚨 One question that has always intrigued me is the role of different ways to increase a model's capacity: parameters, par…

0

53

0

Hadi Pouransari

@HPouransari

22 days

RT @awnihannun: Sparsely activated models like MOEs and Apple silicon + MLX are a great match. - Lots of RAM to hold the entire model in m…

0

30

0

Hadi Pouransari

@HPouransari

22 days

RT @jramapuram: Small update on SigmoidAttn (arXiV incoming). - 1B and 7B LLM results added and stabilized. - Hybrid Norm [on embed dim,…

0

36

0

Hadi Pouransari

@HPouransari

24 days

RT @awnihannun: Wow, DeepSeek R1 Distill Qwen 7B (in 4-bit) nailed the first hard math question I asked it. Thought for ~3200 tokens in ab…

0

169

0

Hadi Pouransari

@HPouransari

1 month

You would do binary decomposition (hence the name of the paper 😃). Say if a document length is 82, you would write it as 82 = 2 + 16 + 64, and add each subsequence to the corresponding bucket. In practice, we can ignore too-short buckets (which amounts to a small number of total tokens), like lengths of 2 in this example, or we can simply pad them as you mentioned.

1

0

3

Hadi Pouransari

@HPouransari

1 month

Thank you @Grad62304977 "Grow P2" or other curricula discussed in the paper would only affect the order to see sequences with different lengths. (Pre)training speed would remain the same irrespective. What controls the overall speed is the dataset mixture (total number of sequences of each length). See "Step Time" column in Table 1 of the paper. We found the curriculum (the order to see sequences of different lengths) to be an important factor in model performance (see Table 2).

1

0

3

Hadi Pouransari

@HPouransari

1 month

We also include some results for training small LMs (160m to 1B) on ~1T tokens (using DCLM data with dataset decomposition):

0

2

Hadi Pouransari

@HPouransari

2 months

Check out our full paper: With the amazing team: @PavankumarVasu, @FartashFg, @chunliang_tw, Cem, Nate, Albert, Gokul, James, Peter, and @OncelTuzel (10/10🧵)

0

1

8