Yuchen Jin
@Yuchenj_UW
Followers
30K
Following
41K
Statuses
8K
Co-founder & CTO @hyperbolic_labs 🧑🍳 fun AI systems. Previously at OctoAI (acquired by @nvidia) building @ApacheTVM, PhD @uwcse 🤖
another Galaxy
Joined November 2016
Outperform GPT-3 with @karpathy's llm.c using just 1/3 training tokens ✨ Another day has passed, and I trained GPT-2 (124M) with llm.c for 150B tokens, achieving 35.5% accuracy on HellaSwag. This surpasses the GPT-3 paper’s 33.7% accuracy trained for 300B tokens. It matched the original paper’s 33.7% score at only ~95B tokens, using less than 1/3 training tokens compared to the GPT-3 paper. Key reasons are: (1) I tripled the max learning rate which sped up the training, more details in my last tweet: (2) I trained the model with @huggingface's FineWeb dataset, which is described as “cleaned and deduplicated English web data from CommonCrawl”. The GPT-3 paper, published 4 years ago, also primarily trained on filtered and deduplicated CommonCrawl data, and the paper discussed their data cleaning methods. The improvements might be due to the better quality of web data available over the past 4 years or Huggingface's data cleaning methods are better.
Apparently today is the 4th year anniversary of GPT-3! Which I am accidentally celebrating by re-training the smallest model in the miniseries right now :). HellaSwag 33.7 (Appendix H) almost reached this a few steps ago (though this is only 45% of the training done). I remember when the GPT-3 paper came out quite clearly because I had to interrupt work and go out for a walk. The realization hit me that an important property of the field flipped. In ~2011, progress in AI felt constrained primarily by algorithms. We needed better ideas, better modeling, better approaches to make further progress. If you offered me a 10X bigger computer, I'm not sure what I would have even used it for. GPT-3 paper showed that there was this thing that would just become better on a large variety of practical tasks, if you only trained a bigger one. Better algorithms become a bonus, not a necessity for progress in AGI. Possibly not forever and going forward, but at least locally and for the time being, in a very practical sense. Today, if you gave me a 10X bigger computer I would know exactly what to do with it, and then I'd ask for more. It's this property of AI that also gets to the heart of why NVIDIA is a 2.8T company today. I'm not sure how others experienced it, but the realization convincingly clicked for me with GPT-3, 4 years ago.
25
106
910
@yuejiedeli Whisper is an impressive speech recognition model actually, but it's from 2022 so many people who got into AI recently never heard about it
1
0
7
@Chiragjoshi_12 dude, I really hope so deepseek r1 is almost o1, I don't know why they don't open source it
1
0
16