Zaid Khan @codezakh profile

Zaid Khan

@codezakh

Followers

427

Following

812

Statuses

319

@uncnlp with @mohitban47 working on grounded reasoning + multimodal agents // currently @allen_ai formerly @neclabsamerica // bs+ms CompE @northeastern

Boston, USA

Joined June 2023

Don't wanna be here? Send us removal request.

Zaid Khan

@codezakh

22 hours

DataEnvGym will be a ⭐️ Spotlight ⭐️ at #ICLR2025 (Top 5%)! DataEnvGym is a testbed for RL-style data generation agents + teaching environments to automate post-training: the process of improving a model on diverse, open-ended tasks, based on automatically-discovered model skills / weaknesses. We’re going to be continually expanding DataEnvGym with more tasks and agents/policies — DataEnvGym now covers 4 domains (multimodal reasoning + math reasoning + coding + tool-use) + leaderboard.

Zaid Khan

@codezakh

4 months

Can we automate the process of generating data to improve a model on diverse, open-ended tasks, based on automatically-discovered model weaknesses? Introducing DataEnvGym - a testbed for data-generation agents + teaching environments. Environment trains/evaluates student model ➡️ Environment discovers skills/errors and gives feedback to agent ➡️ Agent generates updated training data to address weaknesses ➡️ Iterate Key Idea -- Frame data generation + model improvement as an RL-style sequential decision-making task: states encode student errors, policy decides actions encoding which data to generate, and reward is the performance of the student model. We provide several modular environments + teaching agents that can improve models on VQA/math/programming, and provide a leaderboard benchmarking these agents. We welcome more entries to our leaderboard! Thread 🧵👇 (1/9)

5

26

84

Zaid Khan

@codezakh

17 hours

@ryanmart3n Thanks Ryan! ❤️

0

Zaid Khan

@codezakh

18 hours

RT @EliasEskin: 🎉 Excited that DataEnvGym, our work on adaptive data generation agents, has been selected as a #ICLR2025 @iclr_conf Spotli…

0

4

0

Zaid Khan

@codezakh

19 hours

@NeginRaoof_ Thanks Negin! ❤️

0

1

Zaid Khan

@codezakh

19 hours

@trungthvu Thanks Trung! ❤️

0

Zaid Khan

@codezakh

19 hours

RT @trungthvu: Awesome work by our OpenThinker team to create the best open-data 32B reasoning model! Our model closely matches or beats…

0

2

0

Zaid Khan

@codezakh

19 hours

RT @ryanmart3n: 🧠 𝗢𝗽𝗲𝗻𝗧𝗵𝗶𝗻𝗸𝗲𝗿-𝟯𝟮𝗕 ⭐ Beating DeepSeek-R1-Distill-Qwen-32B, a closed data model, on MATH500 and GPQA-Diamond ⭐ Best performa…

0

4

0

Zaid Khan

@codezakh

19 hours

RT @etash_guha: 📈We have closed the gap from DeepSeek-R1-32B with OpenThinker-32B, our new Open-Data Reasoning Model! Not only does this mo…

0

6

0

Zaid Khan

@codezakh

19 hours

RT @NeginRaoof_: Announcing OpenThinker-32B: the best open-data reasoning model distilled from DeepSeek-R1. Our results show that large, ca…

0

112

0

Zaid Khan

@codezakh

22 hours

🙏Big thanks to my coauthors @EliasEskin @jmin__cho @mohitban47 @uncnlp @unccs More details: Paper: Project Page: Code: HuggingFace:

0

3

8

Zaid Khan

@codezakh

2 days

RT @hanlin_hl: Happy to share that “Ctrl-Adapter” is selected for ✨ Oral ✨ presentation (top 1.8%) at #ICLR2025! 🌟 Whenever new stronger d…

0

26

0

Zaid Khan

@codezakh

2 days

RT @rohanpaul_ai: The challenge lies in effectively debugging faulty code produced by LLMs due to the scarcity of unit tests that can pinpo…

0

8

0

Zaid Khan

@codezakh

5 days

RT @HuaxiuYaoML: 🚀 We introduce MJ-Bench-Video, a comprehensive fine-grained video preference benchmark, and MJ-Video, a powerful MoE-based…

0

30

0

Zaid Khan

@codezakh

8 days

RT @mohitban47: 🚨 Check out "UTGen & UTDebug" for learning to automatically generate unit tests (i.e., discovering inputs which break your…

0

18

0

Zaid Khan

@codezakh

9 days

RT @ArchikiPrasad: FYI: The debugging benchmarks we developed in our work to evaluate unit test generators & debuggers are now available on…

0

11

0

Zaid Khan

@codezakh

9 days

Testing is a critical part of software engineering — what if we could automatically discover inputs which break your code? This is a hard task even for frontier models (GPT-4o, DeepSeekV3). We show how to train SLMs (Qwen2.5-7B + Llama3.1-8B) to generate unit tests that break code and are useful for debugging! Lots of interesting followups are possible here, check it out! 🧵👇

Archiki Prasad

@ArchikiPrasad

9 days

🚨 Excited to share: "Learning to Generate Unit Tests for Automated Debugging" 🚨 which introduces ✨UTGen and UTDebug✨ for teaching LLMs to generate unit tests (UTs) and debugging code from generated tests. UTGen+UTDebug improve LLM-based code debugging by addressing 3 key questions: 1⃣ What are desirable properties of unit test generators? (A: high output acc and rate of uncovering errors) 2⃣ How good are models at 0-shot unit test generation (A: they are not great) ... so how do we improve LLMs' UT generation abilities? (A: bootstrapping from code-generation data via UTGen) 3⃣ How can we use potentially noisy feedback from generated tests for debugging? (A: via test-time scaling and validation + backtracking in UTDebug) 🧵👇

1

6

16

Zaid Khan

@codezakh

9 days

RT @cyjustinchen: Introducing ✨UTGen & UTDebug✨ for improving code debugging tasks w/ strong pass@1 gains. Unit tests help both humans and…

0

8

0

Zaid Khan

@codezakh

9 days

RT @EliasEskin: 🚨 Excited to announce UTGen and UTDebug, where we first learn to generate unit tests and then apply them to debugging gener…

0

10

0

Zaid Khan

@codezakh

9 days

RT @ArchikiPrasad: 🚨 Excited to share: "Learning to Generate Unit Tests for Automated Debugging" 🚨 which introduces ✨UTGen and UTDebug✨ for…

0

58

0

Zaid Khan

@codezakh

9 days

RT @EliasEskin: 🎉 Pleased that several projects from my postdoc @uncnlp @unccs that I'm excited about have been accepted to #ICLR2025 and…

0

21

0