![Hadi Pouransari Profile](https://pbs.twimg.com/profile_images/1150495495992668161/ayA_DrkA_x96.jpg)
Hadi Pouransari
@HPouransari
Followers
527
Following
308
Statuses
132
ML Research @Apple, PhD @Stanford.
California, USA
Joined July 2019
RT @NeginRaoof_: Announcing OpenThinker-32B: the best open-data reasoning model distilled from DeepSeek-R1. Our results show that large, ca…
0
108
0
RT @ArwenBradley: When does composition of diffusion models “work”? Prior work (Du et al., 2023; Liu et al., 2022) has shown that compositi…
0
29
0
RT @samira_abnar: 🚨 One question that has always intrigued me is the role of different ways to increase a model's capacity: parameters, par…
0
53
0
RT @awnihannun: Sparsely activated models like MOEs and Apple silicon + MLX are a great match. - Lots of RAM to hold the entire model in m…
0
30
0
RT @jramapuram: Small update on SigmoidAttn (arXiV incoming). - 1B and 7B LLM results added and stabilized. - Hybrid Norm [on embed dim,…
0
36
0
RT @awnihannun: Wow, DeepSeek R1 Distill Qwen 7B (in 4-bit) nailed the first hard math question I asked it. Thought for ~3200 tokens in ab…
0
169
0
You would do binary decomposition (hence the name of the paper 😃). Say if a document length is 82, you would write it as 82 = 2 + 16 + 64, and add each subsequence to the corresponding bucket. In practice, we can ignore too-short buckets (which amounts to a small number of total tokens), like lengths of 2 in this example, or we can simply pad them as you mentioned.
1
0
3
Thank you @Grad62304977 "Grow P2" or other curricula discussed in the paper would only affect the order to see sequences with different lengths. (Pre)training speed would remain the same irrespective. What controls the overall speed is the dataset mixture (total number of sequences of each length). See "Step Time" column in Table 1 of the paper. We found the curriculum (the order to see sequences of different lengths) to be an important factor in model performance (see Table 2).
1
0
3
Check out our full paper: With the amazing team: @PavankumarVasu, @FartashFg, @chunliang_tw, Cem, Nate, Albert, Gokul, James, Peter, and @OncelTuzel (10/10🧵)
0
1
8