Xingyao Wang @xingyaow_ profile

Xingyao Wang

@xingyaow_

Followers

2,200

Following

1,005

Media

52

Statuses

226

Co-founder @allhands_ai , building OpenDevin | PhD candidate @IllinoisCDS | BS @UMichCSE ('22) | Ex Intern @GoogleAI @Microsoft

https://t.co/VdBA7xm0wa

Champaign–Urbana, IL

Joined April 2019

Don't wanna be here? Send us removal request.

Explore tweets Explore followers Explore following

Explore trending content on Musk Viewer

LINGORM MY AMBULOVE • 368490 Tweets

BAMBAM JOYFUL IN CHIANGMAI • 275375 Tweets

2NE1 • 274131 Tweets

天使の日 • 127003 Tweets

Mancuso • 77734 Tweets

POEM BY MERZ X KTMP • 66034 Tweets

名探偵ピカチュウ • 54410 Tweets

MINQ FRIENDS OR MORE • 46146 Tweets

seungkwan • 39400 Tweets

オベロン • 30816 Tweets

Bielsa • 28740 Tweets

雇用統計 • 28162 Tweets

Super Tuna • 23070 Tweets

ヤムチャ • 22461 Tweets

#XiaoZhan33rdBDAY • 22322 Tweets

#ساعه_استجابه • 20810 Tweets

AFFAIR EP6 • 18831 Tweets

#孤独のグルメ • 17335 Tweets

マジルミエ • 16331 Tweets

菅野メジャー • 10904 Tweets

栞子ちゃん • 10416 Tweets

#白上フブキ生誕祭2024 • 10326 Tweets

#りおアップ

ジュドー

꼬들 1008

菅野さん

大西流星

リューさん

İstanbul Fatih

変身バンク

コダック

こっしー

ミュウツー

賢章さん

大倉くん

メタモン

オジー自慢のオリオンビール

菅野智之

ユメノトビラ

フシギダネ

ガクくん

Suchmos

遥ちゃん

第988回

Tencent

ひとらんさん

ブルラズ

フブちゃん

勤労感謝の日

#من_الش

Last Seen Profiles

@NAZLIHAMARAT

@yilaybulak

@MAISON_RIGAL

@CamRich8888

@JoeroganJoerog1

@Sims1Master

@PemuasBinor6

@shzsm1

@royapor79

@52eCFoGsM9WTI

@ColegioMedicosA

@Maesrireun

@bsawlonline

@Real_Potemkin

@prodvinzz

@PaCrimeComm

@NthTyneCycleSoc

@AbelSifredi

@BeA_translator

@Aryan_Bokep18x

Pinned Tweet

Xingyao Wang

@xingyaow_

2 months

Software is a powerful tool, enabling human developers to interact with the world in complex & profound ways. What if we could use software as a tool to create similar versatile AI agents? Meet OpenDevin: an open platform for AI software developers as generalist agents. 🧵 1/

5

55

198

Xingyao Wang

@xingyaow_

8 months

Large Language Model (LLM) agents promise to free us from mundane tasks, but how should they best interact with our world? Introducing CodeAct, an agent {framework, instruction-tuning dataset, model}, employs executable Python code to unify the actions of LLM agents. 🧵1/

3

91

406

Xingyao Wang

@xingyaow_

5 months

Introducing OpenDevin CodeAct 1.0 - a new State-of-the-art open coding agent! It achieves a 21% unassisted resolve rate on SWE-Bench Lite, a 17% relative improvement above the previous SOTA by SWE-Agent. Check out our blog or the thread 🧵for more details:

5

56

238

Xingyao Wang

@xingyaow_

1 year

Can pretrained language models (LMs) go beyond learning from labels and scalar rewards? Introducing LeTI, a new LM finetuning paradigm that explores LMs' potential to learn from textual interactions & feedback, allowing LMs to understand not just if they were wrong, but why. 🧵1/

2

36

188

Xingyao Wang

@xingyaow_

1 year

We often interact with Large Language Models (LLMs) like ChatGPT in multi-turn dialogues, yet we predominantly evaluate them with single-turn benchmarks. Bridging this gap, we introduce MINT, a new benchmark tailored for LLMs' multi-turn interactions. 🧵

3

36

167

Xingyao Wang

@xingyaow_

29 days

Excited to share that @allhands_ai has raised $5M -- and it's finally time to announce a new chapter in my life: I'm taking a leave from my PhD to focus full-time on All Hands AI. Let's push open-source agents forward together, in the open!

All Hands AI

@allhands_ai

29 days

We are proud to announce that All Hands has raised $5M to build the world’s best software development agents, and do it in the open 🙌 Thank you to @MenloVentures and our wonderful slate of investors for believing in the mission!

3

6

59

7

9

131

Xingyao Wang

@xingyaow_

5 months

I finally managed to integrate (most of) CodeAct into OpenDevin 🥳. Now, it can work end-to-end on model training (well - very simple linear regression😉). It is somewhat buggy - But I'm excited that we may have a fully open-sourced AI software engineer/data scientist in the near

2

16

123

Xingyao Wang

@xingyaow_

2 years

Thrilled to share that I'll be joining @IllinoisCS to pursue my PhD this fall with the amazing @elgreco_winter ! I'm incredibly grateful to my mentors, friends, and everyone along the way!

12

4

85

Xingyao Wang

@xingyaow_

5 months

Happy to share that CodeAct has been accepted to #ICML2024 ! 🥳 I will be in Vienna next week for #ICLR2024 , where I will present CodeAct at the LLM Agents workshop (Spotlight presentation #5 , 2pm, ).

Xingyao Wang

@xingyaow_

8 months

Large Language Model (LLM) agents promise to free us from mundane tasks, but how should they best interact with our world? Introducing CodeAct, an agent {framework, instruction-tuning dataset, model}, employs executable Python code to unify the actions of LLM agents. 🧵1/

3

91

406

1

13

72

Xingyao Wang

@xingyaow_

6 months

Kinda excited to see Llama-3 70B Instruct out-of-box perform very close to GPT-4 (0613) on MINT's multi-turn interaction tasks, especially on the math subset. 🤔 I wonder where these multi-turn perf. improvements come from: SFT, DPO, or PPO?

3

7

69

Xingyao Wang

@xingyaow_

3 years

Do you like reaction gifs? Do you wish NLP tools could treat gifs like language? Well, come hear about our EMNLP Findings paper (w/ @david__jurgens ) on building a gif-based dialog system! a 🧵 #EMNLP2021

3

7

63

Xingyao Wang

@xingyaow_

2 years

Code LLMs like Codex are good at solving coding exercises, but is that all? Our observation: the output of structured prediction problems in NLP can be rewritten in code and thus we can use repropose Code LLMs for the task! New preprint with @ZoeyLi20 @elgreco_winter , a 🧵

Heng Ji

@hengjinlp

2 years

This is probably one of the most bizarre and exciting papers from my group, by my brand new PhD student @xingyaow_ , collaborations with the amazing @ZoeyLi20 about using code generation for structured prediction from natural language:

0

9

63

3

6

41

Xingyao Wang

@xingyaow_

6 months

MINT's multi-turn interaction leaderboard just got a huge update 🥳! We've included results from 20+ additional LLMs and created two subsets specifically for measuring multi-turn performance in code & math. Check the leaderboard here:

0

6

37

Xingyao Wang

@xingyaow_

9 months

Happy to share that MINT has been accepted to #ICLR2024 ! See y’all in Vienna!

Xingyao Wang

@xingyaow_

1 year

We often interact with Large Language Models (LLMs) like ChatGPT in multi-turn dialogues, yet we predominantly evaluate them with single-turn benchmarks. Bridging this gap, we introduce MINT, a new benchmark tailored for LLMs' multi-turn interactions. 🧵

3

36

167

2

5

36

Xingyao Wang

@xingyaow_

6 months

Check out our newly released Eurus (model) and UltraInteract (data)! UltraInteract collects trees of multi-turn interactions with preference pairs that support both SFT and preference learning for challenging reasoning problems!

Lifan Yuan

@lifan__yuan

6 months

Introducing 🚀Eurus, a suite of state-of-the-art LLM reasoning generalists powered by a new member of Ultra-Series, UltraInteract🎉! Particularly, Eurus-70B beats GPT-3.5 Turbo in reasoning through a comprehensive benchmarking across 12 tests (mostly OOD) covering five tasks!

7

63

318

0

6

25

Xingyao Wang

@xingyaow_

8 months

Thanks for reading – We open-source everything! Code, data, model, and all the resources required to create your own model hosting & chat interface can be found here: 11/

GitHub - xingyaoww/code-act: Official Repo for ICML 2024 paper "Executable Code Actions Elicit...

Official Repo for ICML 2024 paper "Executable Code Actions Elicit Better LLM Agents" by Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, Heng Ji. - xingyaoww/code-act

github.com

0

2

23

Xingyao Wang

@xingyaow_

8 months

Beyond the {framework, data, model}, we've created a fully functional chat interface at . Huge thanks to the open-source community, including @huggingface for chat-ui, @ProjectJupyter for code executor, and many more for making this interface possible! 9/

2

5

22

Xingyao Wang

@xingyaow_

8 days

I've been thinking about one related thing for a while: Why do humans need a division of labor (i.e., a multi-agent system)? My hypothesis is that humans have intelligence limited by our biological brains -- one person can't know everything and do everything well. However, with

Graham Neubig

@gneubig

8 days

New blog: "Don't Sleep on Single-Agent Systems" Multi-agent systems are all the rage, but sometimes one agent is all you need! (and simpler, more maintainable, etc.) I also discuss design considerations for building versatile, powerful single agents.

6

61

358

7

6

47

Xingyao Wang

@xingyaow_

7 months

@_akhaliq Happy to see people caring about multi-turn interaction and code as action! A while ago, we found that using code as action (CodeAct) for LLM agents improves performance; We released a 7k multi-turn dataset, models, and demo for CodeActAgent :)

Xingyao Wang

@xingyaow_

8 months

Large Language Model (LLM) agents promise to free us from mundane tasks, but how should they best interact with our world? Introducing CodeAct, an agent {framework, instruction-tuning dataset, model}, employs executable Python code to unify the actions of LLM agents. 🧵1/

3

91

406

0

2

19

Xingyao Wang

@xingyaow_

8 months

Deeply integrated with a Python interpreter and libraries, CodeActAgent (Mistral-7B) can execute code actions, revise prior actions (e.g., self-debugging) or emit new actions upon new observations in multi-turn interactions. Complete example at: 8/

1

2

16

Xingyao Wang

@xingyaow_

8 months

This a joint work with @YangyiChen6666 , @lifan__yuan , @YizheZhangNLP , @YunzhuLiYZ , @haopeng_nlp , and @elgreco_winter . Check out our paper for more details: 10/

Executable Code Actions Elicit Better LLM Agents

Large Language Model (LLM) agents, capable of performing a broad range of actions, such as invoking tools and controlling robots, show great potential in tackling real-world challenges. LLM agents...

arxiv.org

1

2

15

Xingyao Wang

@xingyaow_

1 year

Thanks for reading! Please check out our preprint for more details. Code: Paper: 10/

LeTI: Learning to Generate from Textual Interactions

Fine-tuning pre-trained language models (LMs) is essential for enhancing their capabilities. Existing techniques commonly fine-tune on input-output pairs (e.g., instruction tuning) or with...

arxiv.org

0

1

13

Xingyao Wang

@xingyaow_

5 months

The agent is based on CodeAct, a framework that consolidates LLM agents’ actions into a unified code action space.

Xingyao Wang

@xingyaow_

8 months

Large Language Model (LLM) agents promise to free us from mundane tasks, but how should they best interact with our world? Introducing CodeAct, an agent {framework, instruction-tuning dataset, model}, employs executable Python code to unify the actions of LLM agents. 🧵1/

3

91

406

1

0

14

Xingyao Wang

@xingyaow_

8 months

The encouraging performance of CodeAct motivates us to build an open-source LLM agent that interacts with environments by executing interpretable code and collaborates with human users using natural language. 5/

1

2

13

Xingyao Wang

@xingyaow_

2 months

While OpenDevin agents may not achieve top performance in every category, they are designed with generality in mind. The same CodeActAgent shows competitive performance across software development, web browsing, and misc tasks, even when compared to specialists. 10/

1

0

11

Xingyao Wang

@xingyaow_

8 months

To this end, we collect an instruction-tuning dataset CodeActInstruct, consisting of 7k high-quality multi-turn interactions using CodeAct. 6/

1

10

Xingyao Wang

@xingyaow_

1 year

LeTI focuses on code generation tasks where models produce code from natural language instructions. This allows us to acquire automatic textual feedback in a natural and scalable way: error messages and stack traces from a Python interpreter. 2/

1

0

9

Xingyao Wang

@xingyaow_

2 months

Paper:

OpenDevin: An Open Platform for AI Software Developers as Generalist Agents

Software is one of the most powerful tools that we humans have at our disposal; it allows a skilled programmer to interact with the world in complex and profound ways. At the same time, thanks to...

arxiv.org

0

9

Xingyao Wang

@xingyaow_

1 year

LeTI emulates the iterative learning process of human developers: learning by interacting with a programming environment to write, execute, and debug code, gradually becoming better at avoiding similar mistakes. LeTI enables LMs to undergo a similar cycle of improvement. 🔄💻 3/

1

0

9

Xingyao Wang

@xingyaow_

8 months

CodeAct stands out by (1) leveraging existing LLMs' pre-training on code data for cost-effective adoption, (2) inherently supporting complex operations through control and data flow, and (3) using extensive software packages for an expanded action space and automated feedback. 3/

2

1

8

Xingyao Wang

@xingyaow_

1 year

When trained and tested on the MBPP dataset (w/o post-processing heuristics), LeTI substantially improved the performance of base LMs, producing 63.2% more executable code for the 2B LM in 6 iterations. It achieves this without requiring any ground-truth outputs for training! 5/

2

0

8

Xingyao Wang

@xingyaow_

8 months

Our extensive analysis of 17 LLMs on API-Bank and a newly curated benchmark M3ToolEval shows that CodeAct outperforms widely used alternatives like Text and JSON, achieving up to a 20% higher success rate. Please check our paper for a detailed analysis. 4/

1

8

Xingyao Wang

@xingyaow_

1 year

Special thanks to my amazing collaborators and mentors @haopeng_nlp @Reyhaneh @elgreco_winter on this research! 9/

1

8

Xingyao Wang

@xingyaow_

1 year

2️⃣ To assess the ability to leverage natural language feedback, we measure the performance gain when LLMs receive natural language feedback from GPT-4 (compare the above figure with and without the red dotted box).

1

0

7

Xingyao Wang

@xingyaow_

5 months

The conceptual idea is illustrated in the figure. At each turn, the agent can: Converse: Communicate with humans in natural language to ask for clarification, confirmation, etc. CodeAct: Choose to perform the task by executing code, including executing Linux bash commands or

1

8

Xingyao Wang

@xingyaow_

8 months

Why CodeAct? Most existing LLM agents are limited by generating actions in JSON or text formats, constraining them to a narrow action space (e.g., pre-defined tools) with less flexibility (e.g., cannot compose multiple tools together). 2/

1

8

Xingyao Wang

@xingyaow_

1 year

LeTI iteratively fine-tunes the LM on a concatenation of textual feedback, natural language instructions, and LM-generated programs. Prepended to this text, a binary reward token is used to differentiate correct and buggy solutions. 4/

1

0

7

Xingyao Wang

@xingyaow_

8 months

We finetune CodeActAgent from Llama2 and Mistral (7B) on a mixture of CodeActInstruct and general conversations. We show that CodeActInstruct can be used with existing conversation data to improve models in agent-oriented tasks without compromising their general capability. 7/

1

7

Xingyao Wang

@xingyaow_

7 months

@JustinLin610 @MetaGPT_ Happy to contribute! We released our CodeAct framework, interface, instruction-tuning data, and model earlier -- hope it can be useful for OpenDevin :-)

Xingyao Wang

@xingyaow_

8 months

Beyond the {framework, data, model}, we've created a fully functional chat interface at . Huge thanks to the open-source community, including @huggingface for chat-ui, @ProjectJupyter for code executor, and many more for making this interface possible! 9/

2

5

22

0

1

7

Xingyao Wang

@xingyaow_

5 months

We would love to have you join our community through Slack () or Discord (), and contribute or post issues on GitHub:

GitHub - All-Hands-AI/OpenHands: 🙌 OpenHands: Code Less, Make More

🙌 OpenHands: Code Less, Make More. Contribute to All-Hands-AI/OpenHands development by creating an account on GitHub.

github.com

1

7

Xingyao Wang

@xingyaow_

1 year

This a joint work with @ZihanWa36697175 , @JiatengLiu , @YangyiChen6666 , @lifan__yuan , @haopeng_nlp and @elgreco_winter .

1

0

7

Xingyao Wang

@xingyaow_

2 months

Join our community to build the future of AI agents, in the open! Slack: GitHub:

GitHub - All-Hands-AI/OpenHands: 🙌 OpenHands: Code Less, Make More

🙌 OpenHands: Code Less, Make More. Contribute to All-Hands-AI/OpenHands development by creating an account on GitHub.

github.com

1

0

7

Xingyao Wang

@xingyaow_

1 year

Thanks for reading! Please check out our preprint and project website for more details! Website (interactive visualization of results): Code: Paper:

MINT: Evaluating LLMs in Multi-turn Interaction with Tools and...

To solve complex tasks, large language models (LLMs) often require multiple rounds of interactions with the user, sometimes assisted by external tools. However, current evaluation protocols often...

arxiv.org

0

7

Xingyao Wang

@xingyaow_

1 year

Furthermore, we find RLHF hurt LLM-tool multi-turn interaction on the LLaMA-2 series. However, it's unclear if RLHF is problematic overall or if RLHF only hurts when applied to single-turn data (the case of LLaMA-2).

1

0

6

Xingyao Wang

@xingyaow_

2 years

Special shout out to @david__jurgens , Joyce Chai, @DanielMRomero , and @radamihalcea . Thanks for taking a chance on me and helping me become the person I am today!

1

0

6

Xingyao Wang

@xingyaow_

2 months

OpenDevin’s architecture consists of 3 main components: (1) an agent that produces actions, (2) a runtime that executes actions and generates observations, and (3) an event stream that connects two together. 3/

1

0

6

Xingyao Wang

@xingyaow_

2 months

OpenDevin is a platform to build generalist AI agents that interact with the world similarly to human software developers: writing code, interacting with a command line, and browsing the web. 2/

1

0

6

Xingyao Wang

@xingyaow_

5 months

In less than 2 months, we built a rough architecture with an agenthub that can support a variety of agent implementations: e.g., we have SWE-agent implemented, and other agents (e.g., CodeActAgent) can use a plugin that allows them to use the awesome bash tools from SWE-agent!

1

0

6

Xingyao Wang

@xingyaow_

1 year

Despite being trained on MBPP problems, LeTI demonstrated comparable or even better performance on unseen code generation problems from the HumanEval dataset. 6/

1

0

6

Xingyao Wang

@xingyaow_

6 months

@arankomatsuzaki @WenhuChen Sharing my two cents: I actually feel both long context & RAG + short context models can eventually work equally well if we devote enough resources to develop them. But they likely require different types of resources - an essential trade-off in developing advanced LLMs: - RAG

0

6

Xingyao Wang

@xingyaow_

5 months

We spent a lot of effort building a capable docker sandbox that the agent can SSH into and do all sorts of crazy stuff 😇

1

0

6

Xingyao Wang

@xingyaow_

1 year

To solve a problem, the LLM can (1) use external tools via Python programs ('Execute' in the figure) and/or (2) collect natural language feedback to refine its solutions (Red dotted box in the figure); the feedback is provided by GPT-4, aiming to simulate a human user.

1

0

5

Xingyao Wang

@xingyaow_

1 year

And here's the cherry on top: with a solution evaluator that provides feedback for any solution, LeTI can extend to natural language tasks! We demonstrated this adaptability by successfully applying LeTI to event argument extraction. 8/

1

0

6

Xingyao Wang

@xingyaow_

5 months

If you want to try out OpenDevin CodeAct 1.0 on your own projects, it’s easy! CodeAct 1.0 is now the default agent in OpenDevin v0.5, which you can download and use today:

1

0

5

Xingyao Wang

@xingyaow_

1 year

Similar to the case of LLM-Tool interaction, we find that SIFT and RLHF hurt models' ability to leverage feedback. The results on CodeLLama (except 7B), LLaMA-2, and Lemur-v1 show that SIFT/RLHF models all have lower Δfeedback and success rate, compared to their base variants.

1

0

5

Xingyao Wang

@xingyaow_

1 year

❗Surprisingly, Vicuna-v1.5 13B (trained on ShareGPT) performs worse than 7B! It produces escaped underscores "\_" that hurt performance (more severe on 13B). We can trace "\_" in ~15% of 94k ShareGPT data! A similar issue was noted in CodeLLaMA-Instruct; see paper for details.

1

0

5

Xingyao Wang

@xingyaow_

2 years

Special thanks also go to @jed_yang @bryant1410 for being a great mentor, and @skychwang @jed_yang @ziqiao_ma for huge help with my application 😉!

1

0

5

Xingyao Wang

@xingyaow_

2 months

OpenDevin thrives thanks to our incredible community: 170+ contributors, 1,300+ contributions, and 28k+ stargazers. We're deeply grateful to everyone who's been part of this journey!🚀 11/

1

0

5

Xingyao Wang

@xingyaow_

1 year

We find that task-solving ability could be orthogonal to feedback-providing ability: higher task-solving performance does not necessarily translate to better feedback-providing capability and vice versa.

1

0

5

Xingyao Wang

@xingyaow_

1 year

MINT mirrors real-world user-LLM-tool interactions to evaluate two key LLM multi-turn capabilities: 1️⃣ Tool-augmented problem-solving 2️⃣ Ability to leverage natural language feedback

1

0

5

Xingyao Wang

@xingyaow_

1 year

📊 Our Findings on Tool-augmented Task-Solving capabilities of LLMs We find all open-source models (only 4 are visualized) fall behind most commercial closed-source models in both success rate at k=5 and improvement rate (slope).

1

5

Xingyao Wang

@xingyaow_

2 years

Thanks for reading! To see more, please check out our paper at

Code4Struct: Code Generation for Few-Shot Event Structure Prediction

Large Language Model (LLM) trained on a mixture of text and code has demonstrated impressive capability in translating natural language (NL) into structured code. We observe that semantic...

arxiv.org

0

5

Xingyao Wang

@xingyaow_

1 year

📊 Findings on LLMs' Ability to Leverage Natural Language Feedback We find no significant difference between open- and closed-source models in Δfeedback (performance gain due to feedback).

1

0

5

Xingyao Wang

@xingyaow_

1 year

1️⃣ For tool-augmented task-solving, we analyze how performance improves as the number of interaction turns with tools increases, all without language feedback (refer to the above figure without the red dotted box).

1

0

5

Xingyao Wang

@xingyaow_

1 year

An interesting observation is that LeTI, by leveraging textual feedback, shows superior performance and sample efficiency compared to models that only use binary feedback. It achieves the same MBPP performance with fewer than half of the gradient steps on a 2B model! 7/

1

0

5

Xingyao Wang

@xingyaow_

2 months

Implementing an agent is pretty straightforward: you simply need to define the logic of converting a list of prior actions and observations into the next action to take! We've created a hub of community-contributed agents, including a generalist CodeActAgent, a browsing

1

0

4

Xingyao Wang

@xingyaow_

6 months

@OfirPress We observed something pretty similar (especially on the Claude series) in our prior work MINT. Interesting to see the bias towards single-turn capability (i.e., RAG SWE-Bench setup) still exists on Claude-3 🤔

Xingyao Wang

@xingyaow_

1 year

We often interact with Large Language Models (LLMs) like ChatGPT in multi-turn dialogues, yet we predominantly evaluate them with single-turn benchmarks. Bridging this gap, we introduce MINT, a new benchmark tailored for LLMs' multi-turn interactions. 🧵

3

36

167

0

4

Xingyao Wang

@xingyaow_

2 years

Looking at some example predictions: Code4Struct can leverage implicit commonsense knowledge in LLM to infer arguments not presented in the text (e.g., United States, Court) under zero-shot.

2

0

4

Xingyao Wang

@xingyaow_

1 year

📊 Findings on LLMs' Ability to *Provide* Natural Language Feedback We can assess LLMs' effectiveness as feedback providers by using different LLMs to provide feedback to a fixed LLM (gpt-3.5-turbo-0613).

1

0

4

Xingyao Wang

@xingyaow_

5 months

We are working on dockerizing and integrating SWE-Bench into OpenDevin -- in the ideal world, you can just define an agent with OpenDevin agent abstraction (e.g., implement state--step-->action function) and have everything evaluated by OpenDevin for you.

1

0

4

Xingyao Wang

@xingyaow_

1 year

🛠️We evaluate 20 LLMs, where 4 are closed- and 16 are open-source. We cover different sizes and three training techniques: 1️⃣pre-trained model (Base) 2️⃣supervised instruction-finetuning (SIFT) 3️⃣reinforcement learning from human feedback (RLHF).

1

0

4

Xingyao Wang

@xingyaow_

5 months

The current version of the harness is available here: Feel free to subscribe to our PR for the latest update:

1

0

4

Xingyao Wang

@xingyaow_

5 months

Comment below and let us know what you hope to use OpenDevin for (e.g., web development, training ML models, data analysis, etc.) - it helps us create a better roadmap!

1

0

4

Xingyao Wang

@xingyaow_

2 months

These actions are so powerful by themselves that you can have an agent interacting with them directly (CodeActAgent) or use them to create tools (e.g., use Python to write a calculator function) and have LLM calling them (e.g., via function-calling). 6/

1

0

4

Xingyao Wang

@xingyaow_

1 year

SIFT on multi-turn data can potentially be helpful. Vicuna-v1.5 (7B), which is a SIFT variant of LLaMA-2 trained on ShareGPT conversations (most are multi-turn), exhibits stronger performance compared to LLaMA-2 (Base and RLHF).

1

0

4

Xingyao Wang

@xingyaow_

1 year

For example, GPT-3.5 excelled in task-solving but struggled with self-feedback. On the other hand, CodeLLaMA-34B-Instruct (SIFT), despite performing the poorest in task-solving (-19% difference vs. GPT-3.5), can still provide feedback that improves the stronger GPT-3.5.

1

0

4

Xingyao Wang

@xingyaow_

1 year

We repurpose a diverse set of established datasets focusing on reasoning, coding, and decision-making and carefully curate them into a compact subset for efficient evaluation.

1

0

4

Xingyao Wang

@xingyaow_

10 months

@yu_bryan_zhou @YiFung10 @ZoeyLi20 @elgreco_winter @uiuc_nlp

1

0

4

Xingyao Wang

@xingyaow_

1 year

🚀 We hope MINT can help measure progress and incentivize research in improving LLMs' capabilities in multi-turn interactions, especially for open-source communities where multi-turn human evaluation has been less accessible compared to commercial LLMs.

1

0

4

Xingyao Wang

@xingyaow_

6 months

@OfirPress Thanks for the great work!! Any idea why? Would love to see any error analysis comparing these two models in the paper :) Could prompting be related to this? i.e., most prompts/instructions optimized for GPT-4 to follow instructions might not work well for Claude; this issue

1

0

4

Xingyao Wang

@xingyaow_

5 months

Huge shout out to @BowenLi2121 for his amazing work on standardizing & containerizing SWE-Bench ❤️

1

0

4

Xingyao Wang

@xingyaow_

2 months

Executing these actions is challenging! To ensure arbitrary code execution does not blow up your laptop 💥, we use a docker sandbox to execute all bash commands and IPython code and stream execution results back to the agent. 8/

2

0

3

Xingyao Wang

@xingyaow_

5 months

@casper_hansen_ Great idea!! That’s something we are actually thinking about doing — but we just got too much stuff on our priority list now 😂 hopefully we can do this in the next 3 months to greatly improve evals

2

0

3

Xingyao Wang

@xingyaow_

5 months

Always excited to chat about code, AI agents, OpenDevin, learning from interactions, LLMs, and more. Looking forward to meeting with familiar and new faces at ICLR! 😀

0

3

Xingyao Wang

@xingyaow_

5 months

Stay tuned! - or even better - join our community and contribute your ideas & code to the future of open AI agents 😉 👀 Get started by looking at good first issues:

0

3

Xingyao Wang

@xingyaow_

2 months

Inspired by CodeAct, OpenDevin connects an agent with the environment through a core set of programming-language-based general actions: (1) run arbitrary bash commands in a stateful SSH session, (2) execute Python code in an interactive Jupyter environment, and (3) browse the web

1

0

3

Xingyao Wang

@xingyaow_

5 months

To reproduce the demo video, Install from our main branch following , select CodeActAgent as the agent and gpt-4-turbo-2024-04-09 as the model. We will try to iron out a few details and include this in the next release so you can use it easily with one

OpenHands/Development.md at main · All-Hands-AI/OpenHands

🙌 OpenHands: Code Less, Make More. Contribute to All-Hands-AI/OpenHands development by creating an account on GitHub.

github.com

1

3

Xingyao Wang

@xingyaow_

5 months

We implemented a few essential actions: (1) bash, (2) IPython execution (as you see in the video), and (3) browser (WIP). Check the complete list of actions here:

1

0

3

Xingyao Wang

@xingyaow_

1 year

@arankomatsuzaki Thank you for introducing LeTI! For a deeper dive into our research, check out our Twitter thread below. 👇

Xingyao Wang

@xingyaow_

1 year

Can pretrained language models (LMs) go beyond learning from labels and scalar rewards? Introducing LeTI, a new LM finetuning paradigm that explores LMs' potential to learn from textual interactions & feedback, allowing LMs to understand not just if they were wrong, but why. 🧵1/

2

36

188

0

3

Xingyao Wang

@xingyaow_

2 months

We packed several commonly used tools (mainly for code editing) into an agent skill library (a Python package). Because they are just software, we write unit tests to ensure these tools stay reliable and useful as we evolve and improve the framework. 7/

1

0

3

Xingyao Wang

@xingyaow_

2 years

Pretraining on code could also grant the language model the ability to handle relations better, as shown by the superior performance of Codex + text prompt (7.6% absolute F1 better on Arg-C) compared to GPT-3 + text prompt.

1

0

3

Xingyao Wang

@xingyaow_

2 months

As AI agents can tackle complex problems today, their evaluation has also become challenging, especially the evaluation of generalist agents. To track our progress toward a generalist, we integrated 15 benchmarks covering software engineering, web browsing, and miscellaneous

1

0

3

Xingyao Wang

@xingyaow_

1 year

@sivil_taram Thanks Qian! I'm excited and very impressed to see the open-source community quickly catch up with these commercial models in multi-turn interaction. 😁

0

3

Xingyao Wang

@xingyaow_

6 months

@_akhaliq Thanks @_akhaliq for sharing our work! Feel free to checkout our thread for a quick overview of the work :)!

Lifan Yuan

@lifan__yuan

6 months

Introducing 🚀Eurus, a suite of state-of-the-art LLM reasoning generalists powered by a new member of Ultra-Series, UltraInteract🎉! Particularly, Eurus-70B beats GPT-3.5 Turbo in reasoning through a comprehensive benchmarking across 12 tests (mostly OOD) covering five tasks!

7

63

318

0

1

3

Xingyao Wang

@xingyaow_

3 years

We disclosed to the Imgur community that the account was actually doing this gif reply experiment in a general-public science write-up (part of their Science Week) and the reaction was very positive Not all online experiments go poorly 😅

1

0

1

Xingyao Wang

@xingyaow_

2 years

We showcase our proposed Code4Struct on the Event Argument Extraction (EAE) task, which aims to extract event structures from texts. Given an event definition and a sentence, we prompt Codex to generate code to instantiate the given event class (e.g., Transport).

1

0

2

Xingyao Wang

@xingyaow_

5 months

We are also working on a new simplified evaluation harness for testing coding agents on SWE-Bench, which we hope will be easy to use for agent developers and researchers, facilitating comprehensive evaluation and comparison.

1

0

2

Xingyao Wang

@xingyaow_

3 years

Our solution (Pepe the King Prawn) goes further and uses an OSCAR encoder that fuses image regions, extracted object types, and the gif caption into a single representation to compare with a message’s embedding from a RoBERTa model (plus fancy training stuff) to pick a reply.

1

0

1