Hadi Pouransari Profile
Hadi Pouransari

@HPouransari

Followers
527
Following
308
Statuses
132

ML Research @Apple, PhD @Stanford.

California, USA
Joined July 2019
Don't wanna be here? Send us removal request.
@HPouransari
Hadi Pouransari
13 hours
RT @NeginRaoof_: Announcing OpenThinker-32B: the best open-data reasoning model distilled from DeepSeek-R1. Our results show that large, ca…
0
108
0
@HPouransari
Hadi Pouransari
14 hours
RT @ArwenBradley: When does composition of diffusion models “work”? Prior work (Du et al., 2023; Liu et al., 2022) has shown that compositi…
0
29
0
@HPouransari
Hadi Pouransari
2 days
RT @PreetumNakkiran: finally managed to sneak my dog into a paper
Tweet media one
0
54
0
@HPouransari
Hadi Pouransari
16 days
RT @samira_abnar: 🚨 One question that has always intrigued me is the role of different ways to increase a model's capacity: parameters, par…
0
53
0
@HPouransari
Hadi Pouransari
22 days
RT @awnihannun: Sparsely activated models like MOEs and Apple silicon + MLX are a great match. - Lots of RAM to hold the entire model in m…
0
30
0
@HPouransari
Hadi Pouransari
22 days
RT @jramapuram: Small update on SigmoidAttn (arXiV incoming). - 1B and 7B LLM results added and stabilized. - Hybrid Norm [on embed dim,…
0
36
0
@HPouransari
Hadi Pouransari
24 days
RT @awnihannun: Wow, DeepSeek R1 Distill Qwen 7B (in 4-bit) nailed the first hard math question I asked it. Thought for ~3200 tokens in ab…
0
169
0
@HPouransari
Hadi Pouransari
1 month
You would do binary decomposition (hence the name of the paper 😃). Say if a document length is 82, you would write it as 82 = 2 + 16 + 64, and add each subsequence to the corresponding bucket. In practice, we can ignore too-short buckets (which amounts to a small number of total tokens), like lengths of 2 in this example, or we can simply pad them as you mentioned.
1
0
3
@HPouransari
Hadi Pouransari
1 month
Thank you @Grad62304977 "Grow P2" or other curricula discussed in the paper would only affect the order to see sequences with different lengths. (Pre)training speed would remain the same irrespective. What controls the overall speed is the dataset mixture (total number of sequences of each length). See "Step Time" column in Table 1 of the paper. We found the curriculum (the order to see sequences of different lengths) to be an important factor in model performance (see Table 2).
1
0
3
@HPouransari
Hadi Pouransari
1 month
We also include some results for training small LMs (160m to 1B) on ~1T tokens (using DCLM data with dataset decomposition):
Tweet media one
0
0
2
@HPouransari
Hadi Pouransari
2 months
Check out our full paper: With the amazing team: @PavankumarVasu, @FartashFg, @chunliang_tw, Cem, Nate, Albert, Gokul, James, Peter, and @OncelTuzel (10/10🧵)
0
1
8