NiJinjie Profile Banner
Jinjie Ni Profile
Jinjie Ni

@NiJinjie

Followers
520
Following
1K
Statuses
389

AI Researcher | LLM / VLM / Any2Any Foundation Models | 🔀 MixEval Series

Singapore
Joined April 2020
Don't wanna be here? Send us removal request.
@NiJinjie
Jinjie Ni
8 months
How to get ⚔️Chatbot Arena⚔️ model rankings with 2000× less time (5 minutes) and 5000× less cost ($0.6)? Maybe simply mix the classic benchmarks. 🚀 Introducing MixEval, a new 🥇gold-standard🥇 LLM evaluation paradigm standing on the shoulder of giants (classic benchmarks). 🕶️LLM Benchmark Mixture: We mine comprehensive and well-distributed 🌎real-world user queries from the web and match them with similar queries from off-the-shelf 💯ground-truth-based benchmarks. 🤔Why to Use MixEval? (1) 🎯 Accurate model ranking (0.96 correlation with Chatbot Arena) (2) ⚡️ Fast, cheap, and reproducible execution, requiring only 6% the time and cost of MMLU (3) 🌊 Dynamic benchmarking enabled by low-effort and stable updating mechanism (4) 🏔️ Challenging question set (GPT-4o, the top model on MixEval leaderboard, achieves 64.7% accuracy) (5) 🌌 Comprehensive and highly impartial query distribution, as it is deeply grounded in real-world user queries (6) ⚖️ Fair grading process without preference bias, ensured by its ground-truth-based nature ❌ What's wrong with the current LLM evaluation? (1)❓Query Bias: evaluation queries falling short of comprehensiveness or appropriate distribution a) ground-truth-based benchmarks b) LLM-judged benchmarks (2)👨‍⚖️Grading Bias: the grading process involving significant bias or error a) LLM-judged benchmarks (3)🔬Generalization Bias: models overfitting the evaluation data (contamination) a) ground-truth-based benchmark b) LLM-judged benchmarks 🤔 Any current benchmarks that are not so biased? ☑️ Yes. Large-scale user-facing benchmarks, e.g., ⚔️Chatbot Arena⚔️, solve (1) query bias by collecting a large number of real-world user queries, (2) grading bias by collecting a large number of real-world user preferences, and (3) generalization bias by continuously receiving fresh queries. But they are prohibitively 💰expensive (around $2936 for a single model, see the below figure), ⌚slow, and 🚫irreproducible! ✅MixEval addresses all these issues. It's not only highly unbiased in query, grading, and generalization, but also fast, cheap, and reproducible. 📊We provide extensive meta-evaluation and insights for MixEval and existing LLM benchmarks in our paper. 🔥We hope this will deepen the community’s understanding of LLM evaluation and guide future research directions! 🏆Our dynamic leaderboard is now available at: 🚀Join us in revolutionizing LLM evaluation! Test your model on MixEval and see where you stand on our dynamic leaderboard. 🌊 We will update the data points on a monthly basis. 🚀 Moving forward, we'll continuously add new benchmarks to our pool as they release. This will refine our mixtures and enhance dynamism at a higher level. This work is done by @NiJinjie, @XueFz, @xiangyue96, @yuntiandeng, Mahir Shah, Kabir Jain, @gneubig, and @YangYou1991. Kudos to the team! We also sincerely thank @Francis_YAO_ @gblazex @zhansheng @_jasonwei @p_nawrot @soldni @guanzhi_wang @deepanwayx @BoLi68567011 @JunhaoZHANG19 @99Solaris @ZangweiZheng @zian_andy_zheng @KevinQHLin @WenhuChen @billyuchenlin and colleagues from NUS HPC-AI Lab & CMU NeuLab for insightful discussions and pointers!
Tweet media one
30
63
240
@NiJinjie
Jinjie Ni
3 days
@junxian_he @zzlccc Oh yeah... Deepseek r1 shows a continuously increasing response length, but that's not the case here with Qwen2.5-Math-1.5B. It seems that the length/frequency hypothesis might not be universal. It's really interesting to dig deep through it!
0
0
5
@NiJinjie
Jinjie Ni
3 days
RT @duchao0726: Sharing interesting findings in R1-Zero-like training.
Tweet media one
0
1
0
@NiJinjie
Jinjie Ni
3 days
RT @TianyuPang1: 😯There may not be 𝗔𝗵𝗮 𝗠𝗼𝗺𝗲𝗻𝘁 in R1-Zero-like training! We observe (superficial) self-reflection patterns in base models a…
0
28
0
@NiJinjie
Jinjie Ni
4 days
I think reflection times/length and reflection effectiveness are different things. This figure didn't show it's correlation with performance? However, from this thread: we may guess that effective reasoning comes from comparatively concise reflections (less tokens per reflection) and a controlled number of reflection times?
@zzlccc
Zichen Liu
4 days
7/8 Moreover, we find response length may not be a good indicator of self-reflections, because they seem not correlated during R1-Zero-like training.
Tweet media one
1
0
3
@NiJinjie
Jinjie Ni
4 days
@zzlccc This figure should be put in the first thread; it's so informative and interesting! Congrats on the great work done
0
0
1
@NiJinjie
Jinjie Ni
4 days
RT @zzlccc: 🚨There May Not be Aha Moment in R1-Zero-like Training: A common belief about the recent R1-Zero-like t…
0
67
0
@NiJinjie
Jinjie Ni
4 days
RT @xiangyue96: Demystifying Long CoT Reasoning in LLMs Reasoning models like R1 / O1 / O3 have gained massive atte…
0
191
0
@NiJinjie
Jinjie Ni
5 days
RT @zhang_muru: Running your model on multiple GPUs but often found the speed not satisfiable? We introduce Ladder-residual, a parallelism-…
0
52
0
@NiJinjie
Jinjie Ni
5 days
RT @AlexGDimakis: Discovered a very interesting thing about DeepSeek-R1 and all reasoning models: The wrong answers are much longer while t…
0
216
0
@NiJinjie
Jinjie Ni
6 days
@sivil_taram Oh I see, missed that the authors involved "n-gram" embeddings to augment the input embedding size instead of directly augmenting the vocab table, that makes the vocab head flops independent of the input embedding scaling
1
0
1
@NiJinjie
Jinjie Ni
7 days
RT @arankomatsuzaki: Scalable-Softmax Is Superior for Attention - Proposes SSMax to process longer context length more effectively - Signi…
0
38
0
@NiJinjie
Jinjie Ni
7 days
RT @andrew_n_carr: If, during the RL phase, you interrupt thinking and append "wait," to the reasoning traces you bend the cost curve and g…
0
56
0
@NiJinjie
Jinjie Ni
8 days
RT @hwchung27: Happy to share Deep Research, our new agent model! One notable characteristic of Deep Research is its extreme patience. I t…
0
58
0
@NiJinjie
Jinjie Ni
8 days
RT @_philschmid: Paper: Github: Model & Data:
0
11
0
@NiJinjie
Jinjie Ni
8 days
RT @Yang_zy223: Excited to share that our paper on automated scientific discovery has been accepted to #ICLR2025! In brief, 1. It shows t…
0
10
0
@NiJinjie
Jinjie Ni
10 days
RT @xiangyue96: Introducing Critique Fine-Tuning (CFT): a more effective SFT method for enhancing LLMs' reasoning abilities. 📄 Paper: https…
0
72
0
@NiJinjie
Jinjie Ni
10 days
RT @TianyuPang1: 🤔Can 𝐂𝐡𝐚𝐭𝐛𝐨𝐭 𝐀𝐫𝐞𝐧𝐚 be manipulated? 🚀Our paper shows that You Can Improve Model Rankings on Chatbot Arena by Vote Rigging!…
0
20
0
@NiJinjie
Jinjie Ni
13 days
RT @p_nawrot: DMC release is finally complete. I'm still very enthusiastic about the concept of future models equipped with dynamic and sel…
0
2
0
@NiJinjie
Jinjie Ni
15 days
RT @zzlccc: Exploration is an important topic in traditional RL. How does it affect online RL for LLM reasoning, like o1/r1? The most comm…
0
14
0