GanjinZero Profile Banner
Zheng Yuan Profile
Zheng Yuan

@GanjinZero

Followers
870
Following
2K
Media
30
Statuses
556

NLP Researcher. The author of RRHF, RFT and MATH-Qwen. Focus on Medical & Formal & Informal Math & Alignment in LLMs. Prev @Alibaba_Qwen, Phd at @Tsinghua_Uni

Joined August 2013
Don't wanna be here? Send us removal request.
@GanjinZero
Zheng Yuan
10 months
I test MATH 500 test set on GPT-4-0409-turbo. Compared to previous GPT-4-turbo. I found the chain-of-thought answer style does not change too much while the performance improves a lot (especially on the hardest LEVEL 5 problems).
Tweet media one
18
35
306
@GanjinZero
Zheng Yuan
1 year
I test gpt-4-0125 on MATH test (It does not output \\boxed, hard to parse. I test 71 problems). The accuracy is 54/71=76%, stronger than the first GPT-4 (42.5%), and it’s same as PRM rerank 1860 times of last year’s GPT-4 (78.2%). There are two features of gpt-4-0125’s output. 🧵.
4
20
109
@GanjinZero
Zheng Yuan
11 months
Claude 3 and Gemini show losses improve significantly on code with extreme long context. That shows code dependent on contexts more than texts. Maybe we can see some repo level code generation this year using such long context.
Tweet media one
Tweet media two
4
19
100
@GanjinZero
Zheng Yuan
10 months
This paper shows strong performance with selective language modeling (just mask some unuseful tokens during pretraining!)
Tweet media one
1
14
90
@GanjinZero
Zheng Yuan
1 year
ChatGLM3-6B, Qwen-72B, Skywork-13B, Yi-34B, Deepseek-coder-33B, Baichuan-200k, Lingowhale-8B. Sooo many pretrained models this week.
2
10
82
@GanjinZero
Zheng Yuan
1 year
Can a small LM help in both decoding acceleration and quality? . We introduce **Speculative Contrastive Decoding**, a easy technique improves the decoding speed and quality by making the best of the distribution from a small LM. Arxiv: [1/2]
Tweet media one
4
11
65
@GanjinZero
Zheng Yuan
1 year
10^3!Got lots of citations last year. Don’t know if I can get 10^4 in my life.
Tweet media one
12
0
51
@GanjinZero
Zheng Yuan
1 year
Too many LLMs. How about using them all based on their expertise? . We introduce **Zooter**, a reward-guided query routing method. ✅ Comparable performance to reward model ranking multiple models.✅ Much fewer computation overhead. Arxiv: [1/2]
Tweet media one
1
10
49
@GanjinZero
Zheng Yuan
1 year
📢 Check out our latest paper - 🏷️#INSTAG: INSTRUCTION TAGGING FOR ANALYZING SUPERVISED FINE-TUNING OF LARGE LANGUAGE MODELS! . 🔍 We propose 🏷️#INSTAG, an open-set fine-grained tagger for analyzing SFT dataset. 🔖 We obtain 6.6K tags to describe comprehensive user queries.
Tweet media one
1
14
45
@GanjinZero
Zheng Yuan
1 year
🔥Check our paper on Math reasoning of LLM. Augmentation math problems is useful during math SFT, and we find:. ✅Augmented data is helpful for in-domain and has similar efficiency like human written . ❌Augmented data helps little on OOD.
Tweet media one
3
13
40
@GanjinZero
Zheng Yuan
1 year
This is god damn strong.
@_akhaliq
AK
1 year
DeepSeekMath. Pushing the Limits of Mathematical Reasoning in Open Language Models. paper page: DeepSeekMath 7B has achieved an impressive score of 51.7% on the competition-level MATH benchmark without relying on external toolkits and voting techniques,
Tweet media one
2
2
33
@GanjinZero
Zheng Yuan
1 year
@_lewtun I am one of the author in Qwen, I am sure no test set leakage in math related benchmark.
6
1
25
@GanjinZero
Zheng Yuan
1 year
I will be at NeurIPS 2023 in person on 12-15 Dec and have a poster about RRHF rank to align human preferences on 13 Dec poster Session 3 (. Will also attend math & instruction following workshop. Feel free to any talk!.
0
4
27
@GanjinZero
Zheng Yuan
2 years
Thanks for tweeting our paper!.
@_akhaliq
AK
2 years
Scaling Relationship on Learning Mathematical Reasoning with Large Language Models. paper page: Mathematical reasoning is a challenging task for large language models (LLMs), while the scaling relationship of it with respect to LLM capacity is
Tweet media one
0
1
23
@GanjinZero
Zheng Yuan
1 year
The performance is much stronger than every other models. Although its name is still gpt-4. But it has evolved a lot like other open source LLMs.
2
1
25
@GanjinZero
Zheng Yuan
11 months
The most interesting paper I read this year.
@iScienceLuvr
Tanishq Mathew Abraham, Ph.D.
11 months
Genie: Generative Interactive Environments. abs: project website: This paper from Google DeepMind introduces an 11B foundation world model called Genie, trained on unlabelled Internet videos of 2d Platformer games. Genie has three
Tweet media one
0
6
21
@GanjinZero
Zheng Yuan
10 months
However, it also makes very foolish mistakes like this
Tweet media one
2
0
17
@GanjinZero
Zheng Yuan
1 year
(2) They will try to find the misunderstanding of itself and try to fix it. This does not need a further interaction between user and GPT.
Tweet media one
1
1
19
@GanjinZero
Zheng Yuan
11 months
Scaling is all u need! Very similar to our observation in previous math sft scaling law and
@_akhaliq
AK
11 months
Common 7B Language Models Already Possess Strong Math Capabilities. Mathematical capabilities were previously believed to emerge in common language models only at a very large scale or require extensive math-related pre-training. This paper shows that the LLaMA-2 7B model
Tweet media one
1
2
17
@GanjinZero
Zheng Yuan
10 months
@EmilianoFratic2 The most reliable evaluation is new math contest problems every year.
1
0
12
@GanjinZero
Zheng Yuan
1 year
I built the math related parts in Qwen. Trying to approach Minerva!.
@Yampeleg
Yam Peleg
1 year
Qwen-14B (Alibaba). The most powerful open-source model for it's size. And the longest trained: 3T tokens. Comes in 5 different versions:.Base, Chat, Code, Math and Vision. (And is even trained for tool usage!). Opinion: You should consider it as your new "go-to". ---.Paper:
Tweet media one
0
0
14
@GanjinZero
Zheng Yuan
1 year
Glad to be accepted by ICLR. Great job. @KemingLu612 @yiguyuan20.
@GanjinZero
Zheng Yuan
1 year
📢 Check out our latest paper - 🏷️#INSTAG: INSTRUCTION TAGGING FOR ANALYZING SUPERVISED FINE-TUNING OF LARGE LANGUAGE MODELS! . 🔍 We propose 🏷️#INSTAG, an open-set fine-grained tagger for analyzing SFT dataset. 🔖 We obtain 6.6K tags to describe comprehensive user queries.
Tweet media one
1
1
13
@GanjinZero
Zheng Yuan
1 year
(1) They will list a step-by-step plan and calculate it step-by-step.
Tweet media one
1
0
12
@GanjinZero
Zheng Yuan
1 year
If we can improve gsm8k from 30-80 on 7B last year, maybe we can also improve math from 50-80 this year.
@GanjinZero
Zheng Yuan
1 year
This is god damn strong.
2
1
10
@GanjinZero
Zheng Yuan
1 year
#322 now!
Tweet media one
0
1
11
@GanjinZero
Zheng Yuan
1 year
@_lewtun I will try to do it with my paper instead of qwen.
1
0
8
@GanjinZero
Zheng Yuan
9 months
@oran_ge 就是这个文章的做法
1
0
10
@GanjinZero
Zheng Yuan
1 year
Yesterday I also saw a math reasoning paper work on iteratively generate new data for boosting. I believe using reward model ability inside the self model is a scalable way for improvement.
@jaseweston
Jason Weston
1 year
🚨New paper!🚨.Self-Rewarding LMs.- LM itself provides its own rewards on own generations via LLM-as-a-Judge during Iterative DPO.- Reward modeling ability improves during training rather than staying fixed. opens the door to superhuman feedback?.🧵(1/5)
Tweet media one
1
1
7
@GanjinZero
Zheng Yuan
1 year
Find this guy @KemingLu612 at nips instruction following workshop who do a lot on QWen alignment, instag(, and zooter(.
@JustinLin610
Junyang Lin
1 year
If you are at NOLA, feel free to chat with Luka😏
Tweet media one
0
0
8
@GanjinZero
Zheng Yuan
10 months
This is the real thing AI should do.
@AnthropicAI
Anthropic
10 months
From accelerating drug discovery to enabling personalized medicine, global healthcare organizations are turning to Claude for solutions to some of their biggest challenges.
Tweet media one
0
0
8
@GanjinZero
Zheng Yuan
1 year
Let’s do this.
0
1
8
@GanjinZero
Zheng Yuan
9 months
not-clever-than-me
Tweet media one
@iScienceLuvr
Tanishq Mathew Abraham, Ph.D.
9 months
It seems like OpenAI might actually be testing two models!. "im-a-good-gpt2-chatbot". "im-also-a-good-gpt2-chatbot"
Tweet media one
0
1
6
@GanjinZero
Zheng Yuan
1 year
@yuntiandeng @KpprasaA @rolandalong @paul_smolensky @vishrav @pmphlt Our paper show the scaling law of augmented dataset amount vs performance on gsm8k.
1
1
6
@GanjinZero
Zheng Yuan
1 year
Congratulate @kakakbibibi another work from us to investigate SFT data. We investigate data scaling curve for code, math, general abilities and how data composition influences each. We propose a dual-stage SFT to maintain math and code ability and have good general ability.
@kakakbibibi
kabi
1 year
👏👏Excited to share our paper:. 🧐How Abilities in Large Language Models are Affected by Supervised Fine-tuning Data Composition. 📎 🤩From a scaling view, we focus on the data composition between mathematics, coding, and general abilities in SFT stage.
Tweet media one
1
3
6
@GanjinZero
Zheng Yuan
11 months
Zooter is accepted by @naaclmeeting. Cong to @KemingLu612 @yiguyuan20.
@GanjinZero
Zheng Yuan
1 year
Too many LLMs. How about using them all based on their expertise? . We introduce **Zooter**, a reward-guided query routing method. ✅ Comparable performance to reward model ranking multiple models.✅ Much fewer computation overhead. Arxiv: [1/2]
Tweet media one
0
1
6
@GanjinZero
Zheng Yuan
1 year
@WizardLM_AI Congrats.
1
0
5
@GanjinZero
Zheng Yuan
1 year
📈 We use tags to define complexity and diversity and we find that the complex and diverse SFT dataset leads to better performance! .🎯 We use #INSTAG as a data selector to choose 6K samples for SFT. Our fine-tuned TagLM-13B outperforms Vicuna-13B on MT-Bench.
Tweet media one
1
1
5
@GanjinZero
Zheng Yuan
1 year
@iScienceLuvr add ?skip=0&show=1000.
1
0
5
@GanjinZero
Zheng Yuan
1 year
LLMs seem just do imitation on cot instead of real thinking.
@WenhuChen
Wenhu Chen
1 year
Some initial insights and I might be wrong. Obviously, LLMs are learning math in a different way than human beings. Humans tend to learn from textbooks and generalize better than LLMs. It seems to me that LLMs do need way more training data to actually understand math.
0
0
5
@GanjinZero
Zheng Yuan
1 year
Zooter is a little encoder transformer which routes a query to an expert LLM. 🧪 We select six 13B models as candidates and experiment on different tasks. 👍 Zooter outperforms the best single model on average and ranks first on 44% of tasks, even surpassing RM ranking. [2/2]
Tweet media one
0
0
5
@GanjinZero
Zheng Yuan
1 year
Really love this work!.
@sybilhyz
Peiyi Wang
1 year
🔥Excited to share our latest work: .Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations. With Math-Shepherd, Mistral-7B fine-tuned on MetaMATH achieves accuracy rates of 89.1% and 43.5% on GSM8K and MATH, respectively. Paper:
1
0
5
@GanjinZero
Zheng Yuan
11 months
It includes gsm8k rft as additional training dataset 🤗.
@BigCodeProject
BigCode
11 months
Introducing: StarCoder2 and The Stack v2 ⭐️. StarCoder2 is trained with a 16k token context and repo-level information for 4T+ tokens. All built on The Stack v2 - the largest code dataset with 900B+ tokens. All code, data and models are fully open!.
Tweet media one
0
0
5
@GanjinZero
Zheng Yuan
11 months
VLLM will generate new tokens for LLM pretraining.
@Francis_YAO_
Yao Fu
11 months
Finally, the top 1 vision feature is transcribing the text in image into text. -- you know that there are so many high-quality textbooks that are not yet digitalized, many of the are simply a scan. So you guess where the data for training the next generation model will be
Tweet media one
0
1
5
@GanjinZero
Zheng Yuan
1 year
@labloke11 Stated at Alibaba apsara conference.
1
0
4
@GanjinZero
Zheng Yuan
1 year
@WizardLM_AI It is crazy with only 7B.
0
0
4
@GanjinZero
Zheng Yuan
1 year
Paper link:
0
0
4
@GanjinZero
Zheng Yuan
1 year
Thank you tweeting our zooter! Zooter distills supervisions from reward models for query routing and inferences with little computational overhead.
@_akhaliq
AK
1 year
Routing to the Expert: Efficient Reward-guided Ensemble of Large Language Models. paper page: The complementary potential of Large Language Models (LLM) assumes off-the-shelf LLMs have heterogeneous expertise in a wide range of domains and tasks so that
Tweet media one
0
1
4
@GanjinZero
Zheng Yuan
1 year
@abacaj Qwen-7b underperforms Minerva-8b in math, so it is not too good to be true.
0
0
4
@GanjinZero
Zheng Yuan
1 year
Through our experiment, we find that the acceleration comes from decoding the “easy” tokens while the quality benefits from the contrastive elimination of systematic erroneous tendencies in “hard” tokens. [2/2]
Tweet media one
0
0
4
@GanjinZero
Zheng Yuan
1 year
@_lewtun Check to see how we balance math and other abilities.
1
0
4
@GanjinZero
Zheng Yuan
10 months
@timnott_it Fully mastering math is too hard for any human and any LLM.
0
0
3
@GanjinZero
Zheng Yuan
1 year
The reason is trivial: the distribution of MATH query is very different from GSM8K and augmented GSM8K. This tells us we can augment **every** benchmark to improve all downstream performances or pre-train a better model since augment a benchmark may not help another one.
Tweet media one
0
0
4
@GanjinZero
Zheng Yuan
1 year
🔗 Check out our models and codes here: 🔗 Try our online tagger here:
2
0
4
@GanjinZero
Zheng Yuan
1 year
Handsome.
@marksaroufim
Mark Saroufim
1 year
@jeremyphoward @rasbt @Tim_Dettmers @sourab_m Now @KemingLu612 is telling us how Qwen was built the winning A100 base model
Tweet media one
0
1
4
@GanjinZero
Zheng Yuan
1 year
@Yampeleg Our instag paper first defines sft data complexity based on query tag amount.
0
0
4
@GanjinZero
Zheng Yuan
11 months
They use instag and dmt to build their SFT 👀 @KemingLu612 @kakakbibibi.
@arankomatsuzaki
Aran Komatsuzaki
11 months
just released the paper on Yi models.
Tweet media one
1
0
4
@GanjinZero
Zheng Yuan
1 year
Cite bunch of our paper ^_^.
@omarsar0
elvis
1 year
Data Management For LLMs. Provides an overview of current research in data management within both the pretraining and supervised fine-tuning stages of LLMs. It covers different aspects of data management strategy design: data quantity, data quality, domain/task composition, and
Tweet media one
0
0
4
@GanjinZero
Zheng Yuan
10 months
@StringChaos They are really strong.
0
0
3
@GanjinZero
Zheng Yuan
1 year
Math is a good domain for researching how synthetic data can be used for LLM.
@dwarkesh_sp
Dwarkesh Patel
1 year
Would an AI that can win gold in the International Math Olympiad be capable of automating most jobs?. I say yes. @3blue1brown says no. Full episode out tomorrow:. "Math lends itself to synthetic data in the ways that a lot of other domains don't. You could have it produce a lot
0
2
4
@GanjinZero
Zheng Yuan
10 months
@teortaxesTex What is this visualization tool.
1
0
4
@GanjinZero
Zheng Yuan
1 year
So happy to find math reasoning SFT improves so fast.
@stefan_fee
Pengfei Liu
1 year
The potential of SFT is still not fully unlocked!!!!! Without using tools, without continue pre-training on math corpus, without RLHF, ONLY SFT, we achieve SOTA across opensource LLMs (no use external tool) on the GSM8k (83.62) and MATH (28.26) datasets:
Tweet media one
0
0
3
@GanjinZero
Zheng Yuan
1 year
Now we have Arxiv version with a bibtex lol.
@_akhaliq
AK
1 year
Qwen Technical Report. paper page: Large language models (LLMs) have revolutionized the field of artificial intelligence, enabling natural language processing tasks that were previously thought to be exclusive to humans. In this work, we introduce Qwen,
Tweet media one
0
0
4
@GanjinZero
Zheng Yuan
1 year
0
0
2
@GanjinZero
Zheng Yuan
10 months
Let’s see how strong the math can be.
@nikunjhanda
Nikunj Handa
10 months
1. Vision + function calling.2. Dec 2023 cutoff.3. *Huge* improvements across the board in our evals (particularly in math!).
0
0
3
@GanjinZero
Zheng Yuan
11 months
respect!.
@elonmusk
Elon Musk
11 months
This week, @xAI will open source Grok.
0
0
3
@GanjinZero
Zheng Yuan
1 year
0
0
3
@GanjinZero
Zheng Yuan
1 year
@StringChaos @WenhuChen Totally agree. I think the most effective tokens are SFT tokens and the less are pretrain tokens in scaling parameters. It is very interest to know how syn/rl tokens scaling parameters. Very interested to know why formal domains easier?.
2
0
3
@GanjinZero
Zheng Yuan
1 year
@xlr8harder gpt-5 grokked.
0
0
3
@GanjinZero
Zheng Yuan
1 year
@keirp1 Their math shepherd paper is a very good start point of building prm. I very like that paper.
0
0
3
@GanjinZero
Zheng Yuan
1 year
@_lewtun We cannot tell more about training data now.
1
0
3
@GanjinZero
Zheng Yuan
10 months
@doomslide @iammaestro04 @teortaxesTex @QuintinPope5 Translation is so hard but worth taking effort. LEAN and natural language are reasoning in different granularity. So many obvious things need to be proved by LEAN tactics.
1
0
3
@GanjinZero
Zheng Yuan
1 year
When I see AlphaGo as an undergraduate, I feel same shock.
0
0
2
@GanjinZero
Zheng Yuan
1 year
With scaling law predictions.
@teortaxesTex
Teortaxes▶️ (DeepSeek🐳 Cheerleader since 2023)
1 year
@generatorman_ai Better data engineering.Scrapping textbooks.Codex as a distinct target.More serious attitude to getting to a commercially viable product.Not even a moment's hesitation about it being more than experiment.
0
0
3
@GanjinZero
Zheng Yuan
10 months
@teortaxesTex They only open sourced chatglm3-6b.
1
0
3
@GanjinZero
Zheng Yuan
1 year
RRHF accepted by @NeurIPSConf.
@GanjinZero
Zheng Yuan
2 years
We just released the weights of our RRHF-trained Wombat-7B and Wombat-7B-GPT4 on Github and Huggingface.
0
0
2
@GanjinZero
Zheng Yuan
1 year
While we find such augmentation, have little help for MATH dataset.
Tweet media one
1
0
3
@GanjinZero
Zheng Yuan
10 months
respect.
@CohereForAI
Cohere For AI
10 months
Announcing C4AI Command R+ open weights, a state-of-the-art 104B LLM with RAG, tooling and multilingual in 10 languages. This release builds on our 35B and is a part of our commitment to make AI breakthroughs accessible to the research community. 🎉.
Tweet media one
0
0
3
@GanjinZero
Zheng Yuan
11 months
beast.
@DrJimFan
Jim Fan
11 months
Blackwell, the new beast in town. > DGX Grace-Blackwell GB200: exceeding 1 Exaflop compute in a single rack. > Put numbers in perspective: the first DGX that Jensen delivered to OpenAI was 0.17 Petaflops. > GPT-4-1.8T parameters can finish training in 90 days on 2000 Blackwells.
Tweet media one
Tweet media two
Tweet media three
0
0
3
@GanjinZero
Zheng Yuan
11 months
@abacaj Depends on tasks. 10k for style, and 1m for reasoning.
0
0
3
@GanjinZero
Zheng Yuan
1 year
Want see someone successful using PPO (with a reward model) to improve math reasoning.
@Yampeleg
Yam Peleg
1 year
I got a crazy theory about RLHF that I would like to debate about. No nice way to put it:.I am not sure RLHF was used for training GPT-3.5 and GPT-4. Please change my mind. Arguments:. -----.Supervised learning can go much farther than anyone thought it could. RLHF was never.
1
0
1
@GanjinZero
Zheng Yuan
1 year
@KemingLu612 @yiguyuan20 Another problem is long data means long responses means better alignment performance.
0
0
2
@GanjinZero
Zheng Yuan
1 year
Start reading….
@giffmana
Lucas Beyer (bl16)
1 year
ICLR submissions are online: Looks like there's:.- ~700 with diffusion in it, .- less than 100 with nerf,.- ~900 LLM.- ~100 chatgpt (8 bard, 16 claude) .- vs ~170 llama (yay).- ~200 clip (but not "clipping").- ~200 NLP.- ~750 vision(!?)
Tweet media one
0
0
2
@GanjinZero
Zheng Yuan
1 year
@abacaj Try use zooter for ensemble (.
0
0
2
@GanjinZero
Zheng Yuan
1 year
umm.
@Teknium1
Teknium (e/λ)
1 year
FYI the new code diffusion model paper by some people at Microsoft claims ChatGPT-3.5-turbo is 20B params.
Tweet media one
2
0
2
@GanjinZero
Zheng Yuan
11 months
@rosstaylor90 If people are comparing aligned models, I think it is fair to use expert iteration.
1
0
2
@GanjinZero
Zheng Yuan
11 months
@AlbertQJiang 学=tuning 思=inference.
0
0
2
@GanjinZero
Zheng Yuan
1 year
@altryne @labloke11 no idea.
0
0
2
@GanjinZero
Zheng Yuan
1 year
@Francis_YAO_ @xai @MistralAI @deepseek_ai Mistral is cot and deepseekcoder is pot in MATH.
0
0
2
@GanjinZero
Zheng Yuan
1 year
@sytelus @Francis_YAO_ @_akhaliq 1. r_ij means j selected paths after rejecting with k=100, usually it has like 5 paths. 2. Exactly.
0
0
2
@GanjinZero
Zheng Yuan
1 year
@solvay_1927 Both improved, I will get more details later.
0
0
2
@GanjinZero
Zheng Yuan
1 year
@huybery awesome work!.
0
0
2
@GanjinZero
Zheng Yuan
1 year
Looks like pipimi
Tweet media one
@ChatGPTapp
ChatGPT
1 year
i see the chat volume is down this evening . wishing you all a Happy Valentine's Day 😊
Tweet media one
0
0
2
@GanjinZero
Zheng Yuan
11 months
looks like diffusion decoding.
@matthen2
Matt Henderson
11 months
hello world in python. using a genetic algorithm
0
0
2
@GanjinZero
Zheng Yuan
1 year
@keirp1 @xai Maybe we need a benchmark updated by year to relief overfitting.
0
0
2
@GanjinZero
Zheng Yuan
1 year
We obtain new open-source SOTA on GSM8K by evolving queries from GSM8K which makes LLaMA2-7B obtains 68.4 accuracy.
Tweet media one
1
0
2
@GanjinZero
Zheng Yuan
1 year
@polynoamial @OpenAI My plan for improving MATH is iteratively improve policy model and reward model by RL.
0
0
1
@GanjinZero
Zheng Yuan
10 months
@abacaj These GPUs does not only fine-tune models but also fine-tune us.
0
0
2
@GanjinZero
Zheng Yuan
1 year
@labloke11 Not released yet.
1
0
2
@GanjinZero
Zheng Yuan
11 months
This problem also appears in the test set of MATH benchmark. It’s time for LLMs to solve this problem now.
@zebulgar
delian
11 months
When I was in middle school I qualified for Nationals at MathCounts. and I remember distinctly watching @ScottWu46 (CEO of Cognition), absolutely destroy in the Countdown round. That was when I realized I was very very good at math, but I was not Scott
0
0
1