![Ning Ding Profile](https://pbs.twimg.com/profile_images/1864897938482434048/DtdFRbQD_x96.jpg)
Ning Ding
@stingning
Followers
1K
Following
660
Statuses
221
RT @JiaLi52524397: 🚀 NuminaMath 1.5 is here! 🚀 900k+ high-quality competition math problems with CoT solutions, new problem metadata, manua…
0
68
0
RT @yang_zonghan: Timely reminder for me to be grateful of how fortunate I am to have started my research journey from NLP, and what an hon…
0
1
0
RT @lifan__yuan: 1/ PRIME is alive on arXiv💡! Building on our blog, we've added extensive experiments exploring: - Implicit PRM design ch…
0
14
0
RT @stingning: 📜 We are releasing the PRIME paper: Let's be clear at first, dense rewards are not dead. And they…
0
15
0
RT @PhysInHistory: Euler's identity combines these five numbers in a simple and elegant equation. Despite each of the constants representin…
0
201
0
RT @lindsayttsq: We improve clinical relevance through ⭐️Medical specialty coverage: MedXpertQA includes questions from 20+ exams of medica…
0
1
0
"Reasoning" encompasses much more than just mathematics and coding.
📈How far are leading models from mastering realistic medical tasks? MedXpertQA, our new text & multimodal medical benchmark, reveals existing gaps in model abilities. Compared with rapidly saturating benchmarks like MedQA, we raise the bar with harder questions and a sharper focus on medical reasoning. 📌Percentage scores on our Text subset: o3-mini: 37.30 R1: 37.76 - the clear frontrunner among open-source models o1: 44.67 - highest performance, but still much room for improvement! Preprint: Data files will be released shortly at: Key insights in 🧵
0
0
9
RT @lindsayttsq: 📈How far are leading models from mastering realistic medical tasks? MedXpertQA, our new text & multimodal medical benchmar…
0
6
0
In 2023, I noticed that DeepSeek released advertisements recruiting a group of "Data Virtuosos." In their job requirements, they mentioned hoping to find individuals with broad knowledge, proficient in literature, history, culture, science, anime, films, and more—quick on their feet and full of imagination—to help DeepSeek build its own data moat. The recruitment targeted people in mainland China. Although I'm not sure about the subsequent outcomes of this recruitment, it likely played a role. By the way, I also think DeepSeek's general reasoning ability in English is excellent.
1
1
7