![Ziru Chen Profile](https://pbs.twimg.com/profile_images/1403984201704116226/bZbcmq4G_x96.jpg)
Ziru Chen
@RonZiruChen
Followers
492
Following
401
Statuses
115
Ron | チン シジョ. Ph.D. student @osunlp. Researching #NLProc & #ConvAI. “Cogito, ergo sum.”
Joined January 2017
🎉ScienceAgentBench is accepted at #ICLR2025! 🚀 Ready to step beyond ML R&D? Test your agents on real-world, data-driven R&D tasks across diverse scientific disciplines. 🔬 👇 Resources and previous posts below:
0
0
1
Containerized evaluation update:
🚀ScienceAgentBench evaluation is now containerized! Inspired by SWE-Bench, we leverage Docker for task isolation, enabling multi-threaded execution and slashing evaluation time to under 30 minutes. Plus, evaluate your agents with just one bash command! Great work done by @YifeiLiPKU and @BotaoYu24 @osunlp!
0
0
1
RT @hhsun1: Very excited to learn that our 2023 paper, "G2Retro as a two-step graph generative models for retrosynthesis prediction (https:…
0
14
0
Hi there, nice work on systematically comparing AI agent's R&D capabilities with expert human performance! I can see that in practice, agents can collaborate with human experts by quickly drafting a reasonably good program, and then experts can further improve it. This is quite related to our recent work, ScienceAgentBench: With 102 real-world R&D tasks for data-driven scientific discovery, we show the potential of agents in boosting scientific productivity by generating program drafts within 10 minutes, while it can take human experts hours of efforts to write the programs. It would be great if you could discuss our benchmark in your paper as a piece of related work on "LLMs for science". Thanks!
0
1
6
RT @yugu_nlp: ❓Wondering how to scale inference-time compute with advanced planning for language agents? 🙋♂️Short answer: Using your LLM…
0
89
0
RT @DavidJAlba94: ScienceAgentBench ⚗️ by @RonZiruChen et al. A benchmark for evaluating language agents for data-driven scientific discov…
0
1
0
RT @BotaoYu24: 🤔 Can LLMs with tools always outperform those without? Perhaps not... 🚀 In our new work, we introduce ChemAgent, an enhance…
0
23
0
@mbodhisattwa @ysu_nlp Hi @mbodhisattwa, thanks again for your explanation! We've updated our manuscript accordingly:
1
0
1
🔍 Through a case study, we hypothesize that the low-level plans generated for each task during OpenAI o1’s reasoning process are essential for writing correct programs and identify three strategies OpenAI o1 might have been trained to use: (1) Reiterate task goal and requirements from the input prompt (2) Sketch an implementation plan in a chain-of-thought style (3) Refine the plan in-place with adjustments or improvements This suggests that inference-time compute is not just about scaling the number of tokens generated — it’s also important that LLMs are trained to use the right strategy! 📄Check out our updated preprint for more details: (3/3)
0
1
3
RT @gneubig: When people ask me "why build agents for software development", my standard response is "if we can agents can write software w…
0
9
0
RT @ysu_nlp: People into agents, let me pitch something to you: 🌟 An agent that works across every platform (web, desktop & mobile) 🌟 Visu…
0
93
0