Xingyao Wang Profile
Xingyao Wang

@xingyaow_

Followers
2,200
Following
1,005
Media
52
Statuses
226

Co-founder @allhands_ai , building OpenDevin | PhD candidate @IllinoisCDS | BS @UMichCSE ('22) | Ex Intern @GoogleAI @Microsoft

Champaign–Urbana, IL
Joined April 2019
Don't wanna be here? Send us removal request.
Explore trending content on Musk Viewer
Pinned Tweet
@xingyaow_
Xingyao Wang
2 months
Software is a powerful tool, enabling human developers to interact with the world in complex & profound ways. What if we could use software as a tool to create similar versatile AI agents? Meet OpenDevin: an open platform for AI software developers as generalist agents. 🧵 1/
Tweet media one
5
55
198
@xingyaow_
Xingyao Wang
8 months
Large Language Model (LLM) agents promise to free us from mundane tasks, but how should they best interact with our world? Introducing CodeAct, an agent {framework, instruction-tuning dataset, model}, employs executable Python code to unify the actions of LLM agents. 🧵1/
Tweet media one
3
91
406
@xingyaow_
Xingyao Wang
5 months
Introducing OpenDevin CodeAct 1.0 - a new State-of-the-art open coding agent! It achieves a 21% unassisted resolve rate on SWE-Bench Lite, a 17% relative improvement above the previous SOTA by SWE-Agent. Check out our blog or the thread 🧵for more details:
Tweet media one
5
56
238
@xingyaow_
Xingyao Wang
1 year
Can pretrained language models (LMs) go beyond learning from labels and scalar rewards? Introducing LeTI, a new LM finetuning paradigm that explores LMs' potential to learn from textual interactions & feedback, allowing LMs to understand not just if they were wrong, but why. 🧵1/
Tweet media one
2
36
188
@xingyaow_
Xingyao Wang
1 year
We often interact with Large Language Models (LLMs) like ChatGPT in multi-turn dialogues, yet we predominantly evaluate them with single-turn benchmarks. Bridging this gap, we introduce MINT, a new benchmark tailored for LLMs' multi-turn interactions. 🧵
Tweet media one
3
36
167
@xingyaow_
Xingyao Wang
29 days
Excited to share that @allhands_ai has raised $5M -- and it's finally time to announce a new chapter in my life: I'm taking a leave from my PhD to focus full-time on All Hands AI. Let's push open-source agents forward together, in the open!
@allhands_ai
All Hands AI
29 days
We are proud to announce that All Hands has raised $5M to build the world’s best software development agents, and do it in the open 🙌 Thank you to @MenloVentures and our wonderful slate of investors for believing in the mission!
3
6
59
7
9
131
@xingyaow_
Xingyao Wang
5 months
I finally managed to integrate (most of) CodeAct into OpenDevin 🥳. Now, it can work end-to-end on model training (well - very simple linear regression😉). It is somewhat buggy - But I'm excited that we may have a fully open-sourced AI software engineer/data scientist in the near
2
16
123
@xingyaow_
Xingyao Wang
2 years
Thrilled to share that I'll be joining @IllinoisCS to pursue my PhD this fall with the amazing @elgreco_winter ! I'm incredibly grateful to my mentors, friends, and everyone along the way!
12
4
85
@xingyaow_
Xingyao Wang
5 months
Happy to share that CodeAct has been accepted to #ICML2024 ! 🥳 I will be in Vienna next week for #ICLR2024 , where I will present CodeAct at the LLM Agents workshop (Spotlight presentation #5 , 2pm, ).
@xingyaow_
Xingyao Wang
8 months
Large Language Model (LLM) agents promise to free us from mundane tasks, but how should they best interact with our world? Introducing CodeAct, an agent {framework, instruction-tuning dataset, model}, employs executable Python code to unify the actions of LLM agents. 🧵1/
Tweet media one
3
91
406
1
13
72
@xingyaow_
Xingyao Wang
6 months
Kinda excited to see Llama-3 70B Instruct out-of-box perform very close to GPT-4 (0613) on MINT's multi-turn interaction tasks, especially on the math subset. 🤔 I wonder where these multi-turn perf. improvements come from: SFT, DPO, or PPO?
Tweet media one
Tweet media two
Tweet media three
3
7
69
@xingyaow_
Xingyao Wang
3 years
Do you like reaction gifs? Do you wish NLP tools could treat gifs like language? Well, come hear about our EMNLP Findings paper (w/ @david__jurgens ) on building a gif-based dialog system! a 🧵 #EMNLP2021
Tweet media one
3
7
63
@xingyaow_
Xingyao Wang
2 years
Code LLMs like Codex are good at solving coding exercises, but is that all? Our observation: the output of structured prediction problems in NLP can be rewritten in code and thus we can use repropose Code LLMs for the task! New preprint with @ZoeyLi20 @elgreco_winter , a 🧵
Tweet media one
@hengjinlp
Heng Ji
2 years
This is probably one of the most bizarre and exciting papers from my group, by my brand new PhD student @xingyaow_ , collaborations with the amazing @ZoeyLi20 about using code generation for structured prediction from natural language:
0
9
63
3
6
41
@xingyaow_
Xingyao Wang
6 months
MINT's multi-turn interaction leaderboard just got a huge update 🥳! We've included results from 20+ additional LLMs and created two subsets specifically for measuring multi-turn performance in code & math. Check the leaderboard here:
Tweet media one
Tweet media two
Tweet media three
0
6
37
@xingyaow_
Xingyao Wang
9 months
Happy to share that MINT has been accepted to #ICLR2024 ! See y’all in Vienna!
@xingyaow_
Xingyao Wang
1 year
We often interact with Large Language Models (LLMs) like ChatGPT in multi-turn dialogues, yet we predominantly evaluate them with single-turn benchmarks. Bridging this gap, we introduce MINT, a new benchmark tailored for LLMs' multi-turn interactions. 🧵
Tweet media one
3
36
167
2
5
36
@xingyaow_
Xingyao Wang
6 months
Check out our newly released Eurus (model) and UltraInteract (data)! UltraInteract collects trees of multi-turn interactions with preference pairs that support both SFT and preference learning for challenging reasoning problems!
@lifan__yuan
Lifan Yuan
6 months
Introducing 🚀Eurus, a suite of state-of-the-art LLM reasoning generalists powered by a new member of Ultra-Series, UltraInteract🎉! Particularly, Eurus-70B beats GPT-3.5 Turbo in reasoning through a comprehensive benchmarking across 12 tests (mostly OOD) covering five tasks!
Tweet media one
7
63
318
0
6
25
@xingyaow_
Xingyao Wang
8 months
Beyond the {framework, data, model}, we've created a fully functional chat interface at . Huge thanks to the open-source community, including @huggingface for chat-ui, @ProjectJupyter for code executor, and many more for making this interface possible! 9/
2
5
22
@xingyaow_
Xingyao Wang
8 days
I've been thinking about one related thing for a while: Why do humans need a division of labor (i.e., a multi-agent system)? My hypothesis is that humans have intelligence limited by our biological brains -- one person can't know everything and do everything well. However, with
@gneubig
Graham Neubig
8 days
New blog: "Don't Sleep on Single-Agent Systems" Multi-agent systems are all the rage, but sometimes one agent is all you need! (and simpler, more maintainable, etc.) I also discuss design considerations for building versatile, powerful single agents.
6
61
358
7
6
47
@xingyaow_
Xingyao Wang
7 months
@_akhaliq Happy to see people caring about multi-turn interaction and code as action! A while ago, we found that using code as action (CodeAct) for LLM agents improves performance; We released a 7k multi-turn dataset, models, and demo for CodeActAgent :)
@xingyaow_
Xingyao Wang
8 months
Large Language Model (LLM) agents promise to free us from mundane tasks, but how should they best interact with our world? Introducing CodeAct, an agent {framework, instruction-tuning dataset, model}, employs executable Python code to unify the actions of LLM agents. 🧵1/
Tweet media one
3
91
406
0
2
19
@xingyaow_
Xingyao Wang
8 months
Deeply integrated with a Python interpreter and libraries, CodeActAgent (Mistral-7B) can execute code actions, revise prior actions (e.g., self-debugging) or emit new actions upon new observations in multi-turn interactions. Complete example at: 8/
Tweet media one
1
2
16
@xingyaow_
Xingyao Wang
5 months
The agent is based on CodeAct, a framework that consolidates LLM agents’ actions into a unified code action space.
@xingyaow_
Xingyao Wang
8 months
Large Language Model (LLM) agents promise to free us from mundane tasks, but how should they best interact with our world? Introducing CodeAct, an agent {framework, instruction-tuning dataset, model}, employs executable Python code to unify the actions of LLM agents. 🧵1/
Tweet media one
3
91
406
1
0
14
@xingyaow_
Xingyao Wang
8 months
The encouraging performance of CodeAct motivates us to build an open-source LLM agent that interacts with environments by executing interpretable code and collaborates with human users using natural language. 5/
Tweet media one
1
2
13
@xingyaow_
Xingyao Wang
2 months
While OpenDevin agents may not achieve top performance in every category, they are designed with generality in mind. The same CodeActAgent shows competitive performance across software development, web browsing, and misc tasks, even when compared to specialists. 10/
Tweet media one
1
0
11
@xingyaow_
Xingyao Wang
8 months
To this end, we collect an instruction-tuning dataset CodeActInstruct, consisting of 7k high-quality multi-turn interactions using CodeAct. 6/
Tweet media one
1
1
10
@xingyaow_
Xingyao Wang
1 year
LeTI focuses on code generation tasks where models produce code from natural language instructions. This allows us to acquire automatic textual feedback in a natural and scalable way: error messages and stack traces from a Python interpreter. 2/
1
0
9
@xingyaow_
Xingyao Wang
1 year
LeTI emulates the iterative learning process of human developers: learning by interacting with a programming environment to write, execute, and debug code, gradually becoming better at avoiding similar mistakes. LeTI enables LMs to undergo a similar cycle of improvement. 🔄💻 3/
Tweet media one
1
0
9
@xingyaow_
Xingyao Wang
8 months
CodeAct stands out by (1) leveraging existing LLMs' pre-training on code data for cost-effective adoption, (2) inherently supporting complex operations through control and data flow, and (3) using extensive software packages for an expanded action space and automated feedback. 3/
Tweet media one
2
1
8
@xingyaow_
Xingyao Wang
1 year
When trained and tested on the MBPP dataset (w/o post-processing heuristics), LeTI substantially improved the performance of base LMs, producing 63.2% more executable code for the 2B LM in 6 iterations. It achieves this without requiring any ground-truth outputs for training! 5/
Tweet media one
Tweet media two
2
0
8
@xingyaow_
Xingyao Wang
8 months
Our extensive analysis of 17 LLMs on API-Bank and a newly curated benchmark M3ToolEval shows that CodeAct outperforms widely used alternatives like Text and JSON, achieving up to a 20% higher success rate. Please check our paper for a detailed analysis. 4/
Tweet media one
Tweet media two
1
1
8
@xingyaow_
Xingyao Wang
1 year
Special thanks to my amazing collaborators and mentors @haopeng_nlp @Reyhaneh @elgreco_winter on this research! 9/
1
1
8
@xingyaow_
Xingyao Wang
1 year
2️⃣ To assess the ability to leverage natural language feedback, we measure the performance gain when LLMs receive natural language feedback from GPT-4 (compare the above figure with and without the red dotted box).
1
0
7
@xingyaow_
Xingyao Wang
5 months
The conceptual idea is illustrated in the figure. At each turn, the agent can: Converse: Communicate with humans in natural language to ask for clarification, confirmation, etc. CodeAct: Choose to perform the task by executing code, including executing Linux bash commands or
Tweet media one
1
1
8
@xingyaow_
Xingyao Wang
8 months
Why CodeAct? Most existing LLM agents are limited by generating actions in JSON or text formats, constraining them to a narrow action space (e.g., pre-defined tools) with less flexibility (e.g., cannot compose multiple tools together). 2/
1
1
8
@xingyaow_
Xingyao Wang
1 year
LeTI iteratively fine-tunes the LM on a concatenation of textual feedback, natural language instructions, and LM-generated programs. Prepended to this text, a binary reward token is used to differentiate correct and buggy solutions. 4/
Tweet media one
1
0
7
@xingyaow_
Xingyao Wang
8 months
We finetune CodeActAgent from Llama2 and Mistral (7B) on a mixture of CodeActInstruct and general conversations. We show that CodeActInstruct can be used with existing conversation data to improve models in agent-oriented tasks without compromising their general capability. 7/
Tweet media one
1
1
7
@xingyaow_
Xingyao Wang
7 months
@JustinLin610 @MetaGPT_ Happy to contribute! We released our CodeAct framework, interface, instruction-tuning data, and model earlier -- hope it can be useful for OpenDevin :-)
@xingyaow_
Xingyao Wang
8 months
Beyond the {framework, data, model}, we've created a fully functional chat interface at . Huge thanks to the open-source community, including @huggingface for chat-ui, @ProjectJupyter for code executor, and many more for making this interface possible! 9/
2
5
22
0
1
7
@xingyaow_
Xingyao Wang
1 year
Furthermore, we find RLHF hurt LLM-tool multi-turn interaction on the LLaMA-2 series. However, it's unclear if RLHF is problematic overall or if RLHF only hurts when applied to single-turn data (the case of LLaMA-2).
Tweet media one
1
0
6
@xingyaow_
Xingyao Wang
2 years
Special shout out to @david__jurgens , Joyce Chai, @DanielMRomero , and @radamihalcea . Thanks for taking a chance on me and helping me become the person I am today!
1
0
6
@xingyaow_
Xingyao Wang
2 months
OpenDevin’s architecture consists of 3 main components: (1) an agent that produces actions, (2) a runtime that executes actions and generates observations, and (3) an event stream that connects two together. 3/
Tweet media one
1
0
6
@xingyaow_
Xingyao Wang
2 months
OpenDevin is a platform to build generalist AI agents that interact with the world similarly to human software developers: writing code, interacting with a command line, and browsing the web. 2/
Tweet media one
1
0
6
@xingyaow_
Xingyao Wang
5 months
In less than 2 months, we built a rough architecture with an agenthub that can support a variety of agent implementations: e.g., we have SWE-agent implemented, and other agents (e.g., CodeActAgent) can use a plugin that allows them to use the awesome bash tools from SWE-agent!
Tweet media one
1
0
6
@xingyaow_
Xingyao Wang
1 year
Despite being trained on MBPP problems, LeTI demonstrated comparable or even better performance on unseen code generation problems from the HumanEval dataset. 6/
Tweet media one
1
0
6
@xingyaow_
Xingyao Wang
6 months
@arankomatsuzaki @WenhuChen Sharing my two cents: I actually feel both long context & RAG + short context models can eventually work equally well if we devote enough resources to develop them. But they likely require different types of resources - an essential trade-off in developing advanced LLMs: - RAG
0
0
6
@xingyaow_
Xingyao Wang
5 months
We spent a lot of effort building a capable docker sandbox that the agent can SSH into and do all sorts of crazy stuff 😇
1
0
6
@xingyaow_
Xingyao Wang
1 year
To solve a problem, the LLM can (1) use external tools via Python programs ('Execute' in the figure) and/or (2) collect natural language feedback to refine its solutions (Red dotted box in the figure); the feedback is provided by GPT-4, aiming to simulate a human user.
Tweet media one
1
0
5
@xingyaow_
Xingyao Wang
1 year
And here's the cherry on top: with a solution evaluator that provides feedback for any solution, LeTI can extend to natural language tasks! We demonstrated this adaptability by successfully applying LeTI to event argument extraction. 8/
Tweet media one
1
0
6
@xingyaow_
Xingyao Wang
5 months
If you want to try out OpenDevin CodeAct 1.0 on your own projects, it’s easy! CodeAct 1.0 is now the default agent in OpenDevin v0.5, which you can download and use today:
1
0
5
@xingyaow_
Xingyao Wang
1 year
Similar to the case of LLM-Tool interaction, we find that SIFT and RLHF hurt models' ability to leverage feedback. The results on CodeLLama (except 7B), LLaMA-2, and Lemur-v1 show that SIFT/RLHF models all have lower Δfeedback and success rate, compared to their base variants.
1
0
5
@xingyaow_
Xingyao Wang
1 year
❗Surprisingly, Vicuna-v1.5 13B (trained on ShareGPT) performs worse than 7B! It produces escaped underscores "\_" that hurt performance (more severe on 13B). We can trace "\_" in ~15% of 94k ShareGPT data! A similar issue was noted in CodeLLaMA-Instruct; see paper for details.
Tweet media one
Tweet media two
1
0
5
@xingyaow_
Xingyao Wang
2 years
Special thanks also go to @jed_yang @bryant1410 for being a great mentor, and @skychwang @jed_yang @ziqiao_ma for huge help with my application 😉!
1
0
5
@xingyaow_
Xingyao Wang
2 months
OpenDevin thrives thanks to our incredible community: 170+ contributors, 1,300+ contributions, and 28k+ stargazers. We're deeply grateful to everyone who's been part of this journey!🚀 11/
1
0
5
@xingyaow_
Xingyao Wang
1 year
We find that task-solving ability could be orthogonal to feedback-providing ability: higher task-solving performance does not necessarily translate to better feedback-providing capability and vice versa.
1
0
5
@xingyaow_
Xingyao Wang
1 year
MINT mirrors real-world user-LLM-tool interactions to evaluate two key LLM multi-turn capabilities: 1️⃣ Tool-augmented problem-solving 2️⃣ Ability to leverage natural language feedback
1
0
5
@xingyaow_
Xingyao Wang
1 year
📊 Our Findings on Tool-augmented Task-Solving capabilities of LLMs We find all open-source models (only 4 are visualized) fall behind most commercial closed-source models in both success rate at k=5 and improvement rate (slope).
Tweet media one
1
1
5
@xingyaow_
Xingyao Wang
1 year
📊 Findings on LLMs' Ability to Leverage Natural Language Feedback We find no significant difference between open- and closed-source models in Δfeedback (performance gain due to feedback).
Tweet media one
1
0
5
@xingyaow_
Xingyao Wang
1 year
1️⃣ For tool-augmented task-solving, we analyze how performance improves as the number of interaction turns with tools increases, all without language feedback (refer to the above figure without the red dotted box).
1
0
5
@xingyaow_
Xingyao Wang
1 year
An interesting observation is that LeTI, by leveraging textual feedback, shows superior performance and sample efficiency compared to models that only use binary feedback. It achieves the same MBPP performance with fewer than half of the gradient steps on a 2B model! 7/
1
0
5
@xingyaow_
Xingyao Wang
2 months
Implementing an agent is pretty straightforward: you simply need to define the logic of converting a list of prior actions and observations into the next action to take! We've created a hub of community-contributed agents, including a generalist CodeActAgent, a browsing
Tweet media one
1
0
4
@xingyaow_
Xingyao Wang
6 months
@OfirPress We observed something pretty similar (especially on the Claude series) in our prior work MINT. Interesting to see the bias towards single-turn capability (i.e., RAG SWE-Bench setup) still exists on Claude-3 🤔
Tweet media one
@xingyaow_
Xingyao Wang
1 year
We often interact with Large Language Models (LLMs) like ChatGPT in multi-turn dialogues, yet we predominantly evaluate them with single-turn benchmarks. Bridging this gap, we introduce MINT, a new benchmark tailored for LLMs' multi-turn interactions. 🧵
Tweet media one
3
36
167
0
0
4
@xingyaow_
Xingyao Wang
2 years
Looking at some example predictions: Code4Struct can leverage implicit commonsense knowledge in LLM to infer arguments not presented in the text (e.g., United States, Court) under zero-shot.
Tweet media one
2
0
4
@xingyaow_
Xingyao Wang
1 year
📊 Findings on LLMs' Ability to *Provide* Natural Language Feedback We can assess LLMs' effectiveness as feedback providers by using different LLMs to provide feedback to a fixed LLM (gpt-3.5-turbo-0613).
1
0
4
@xingyaow_
Xingyao Wang
5 months
We are working on dockerizing and integrating SWE-Bench into OpenDevin -- in the ideal world, you can just define an agent with OpenDevin agent abstraction (e.g., implement state--step-->action function) and have everything evaluated by OpenDevin for you.
1
0
4
@xingyaow_
Xingyao Wang
1 year
🛠️We evaluate 20 LLMs, where 4 are closed- and 16 are open-source. We cover different sizes and three training techniques: 1️⃣pre-trained model (Base) 2️⃣supervised instruction-finetuning (SIFT) 3️⃣reinforcement learning from human feedback (RLHF).
1
0
4
@xingyaow_
Xingyao Wang
5 months
The current version of the harness is available here: Feel free to subscribe to our PR for the latest update:
1
0
4
@xingyaow_
Xingyao Wang
5 months
Comment below and let us know what you hope to use OpenDevin for (e.g., web development, training ML models, data analysis, etc.) - it helps us create a better roadmap!
1
0
4
@xingyaow_
Xingyao Wang
2 months
These actions are so powerful by themselves that you can have an agent interacting with them directly (CodeActAgent) or use them to create tools (e.g., use Python to write a calculator function) and have LLM calling them (e.g., via function-calling). 6/
1
0
4
@xingyaow_
Xingyao Wang
1 year
SIFT on multi-turn data can potentially be helpful. Vicuna-v1.5 (7B), which is a SIFT variant of LLaMA-2 trained on ShareGPT conversations (most are multi-turn), exhibits stronger performance compared to LLaMA-2 (Base and RLHF).
Tweet media one
1
0
4
@xingyaow_
Xingyao Wang
1 year
For example, GPT-3.5 excelled in task-solving but struggled with self-feedback. On the other hand, CodeLLaMA-34B-Instruct (SIFT), despite performing the poorest in task-solving (-19% difference vs. GPT-3.5), can still provide feedback that improves the stronger GPT-3.5.
Tweet media one
1
0
4
@xingyaow_
Xingyao Wang
1 year
We repurpose a diverse set of established datasets focusing on reasoning, coding, and decision-making and carefully curate them into a compact subset for efficient evaluation.
1
0
4
@xingyaow_
Xingyao Wang
1 year
🚀 We hope MINT can help measure progress and incentivize research in improving LLMs' capabilities in multi-turn interactions, especially for open-source communities where multi-turn human evaluation has been less accessible compared to commercial LLMs.
1
0
4
@xingyaow_
Xingyao Wang
6 months
@OfirPress Thanks for the great work!! Any idea why? Would love to see any error analysis comparing these two models in the paper :) Could prompting be related to this? i.e., most prompts/instructions optimized for GPT-4 to follow instructions might not work well for Claude; this issue
1
0
4
@xingyaow_
Xingyao Wang
5 months
Huge shout out to @BowenLi2121 for his amazing work on standardizing & containerizing SWE-Bench ❤️
1
0
4
@xingyaow_
Xingyao Wang
2 months
Executing these actions is challenging! To ensure arbitrary code execution does not blow up your laptop 💥, we use a docker sandbox to execute all bash commands and IPython code and stream execution results back to the agent. 8/
2
0
3
@xingyaow_
Xingyao Wang
5 months
@casper_hansen_ Great idea!! That’s something we are actually thinking about doing — but we just got too much stuff on our priority list now 😂 hopefully we can do this in the next 3 months to greatly improve evals
2
0
3
@xingyaow_
Xingyao Wang
5 months
Always excited to chat about code, AI agents, OpenDevin, learning from interactions, LLMs, and more. Looking forward to meeting with familiar and new faces at ICLR! 😀
0
0
3
@xingyaow_
Xingyao Wang
5 months
Stay tuned! - or even better - join our community and contribute your ideas & code to the future of open AI agents 😉 👀 Get started by looking at good first issues:
0
0
3
@xingyaow_
Xingyao Wang
2 months
Inspired by CodeAct, OpenDevin connects an agent with the environment through a core set of programming-language-based general actions: (1) run arbitrary bash commands in a stateful SSH session, (2) execute Python code in an interactive Jupyter environment, and (3) browse the web
1
0
3
@xingyaow_
Xingyao Wang
5 months
To reproduce the demo video, Install from our main branch following , select CodeActAgent as the agent and gpt-4-turbo-2024-04-09 as the model. We will try to iron out a few details and include this in the next release so you can use it easily with one
1
1
3
@xingyaow_
Xingyao Wang
5 months
We implemented a few essential actions: (1) bash, (2) IPython execution (as you see in the video), and (3) browser (WIP). Check the complete list of actions here:
1
0
3
@xingyaow_
Xingyao Wang
1 year
@arankomatsuzaki Thank you for introducing LeTI! For a deeper dive into our research, check out our Twitter thread below. 👇
@xingyaow_
Xingyao Wang
1 year
Can pretrained language models (LMs) go beyond learning from labels and scalar rewards? Introducing LeTI, a new LM finetuning paradigm that explores LMs' potential to learn from textual interactions & feedback, allowing LMs to understand not just if they were wrong, but why. 🧵1/
Tweet media one
2
36
188
0
0
3
@xingyaow_
Xingyao Wang
2 months
We packed several commonly used tools (mainly for code editing) into an agent skill library (a Python package). Because they are just software, we write unit tests to ensure these tools stay reliable and useful as we evolve and improve the framework. 7/
1
0
3
@xingyaow_
Xingyao Wang
2 years
Pretraining on code could also grant the language model the ability to handle relations better, as shown by the superior performance of Codex + text prompt (7.6% absolute F1 better on Arg-C) compared to GPT-3 + text prompt.
1
0
3
@xingyaow_
Xingyao Wang
2 months
As AI agents can tackle complex problems today, their evaluation has also become challenging, especially the evaluation of generalist agents. To track our progress toward a generalist, we integrated 15 benchmarks covering software engineering, web browsing, and miscellaneous
Tweet media one
1
0
3
@xingyaow_
Xingyao Wang
1 year
@sivil_taram Thanks Qian! I'm excited and very impressed to see the open-source community quickly catch up with these commercial models in multi-turn interaction. 😁
0
0
3
@xingyaow_
Xingyao Wang
6 months
@_akhaliq Thanks @_akhaliq for sharing our work! Feel free to checkout our thread for a quick overview of the work :)!
@lifan__yuan
Lifan Yuan
6 months
Introducing 🚀Eurus, a suite of state-of-the-art LLM reasoning generalists powered by a new member of Ultra-Series, UltraInteract🎉! Particularly, Eurus-70B beats GPT-3.5 Turbo in reasoning through a comprehensive benchmarking across 12 tests (mostly OOD) covering five tasks!
Tweet media one
7
63
318
0
1
3
@xingyaow_
Xingyao Wang
3 years
We disclosed to the Imgur community that the account was actually doing this gif reply experiment in a general-public science write-up (part of their Science Week) and the reaction was very positive Not all online experiments go poorly 😅
Tweet media one
1
0
1
@xingyaow_
Xingyao Wang
2 years
We showcase our proposed Code4Struct on the Event Argument Extraction (EAE) task, which aims to extract event structures from texts. Given an event definition and a sentence, we prompt Codex to generate code to instantiate the given event class (e.g., Transport).
Tweet media one
1
0
2
@xingyaow_
Xingyao Wang
5 months
We are also working on a new simplified evaluation harness for testing coding agents on SWE-Bench, which we hope will be easy to use for agent developers and researchers, facilitating comprehensive evaluation and comparison.
1
0
2
@xingyaow_
Xingyao Wang
3 years
Our solution (Pepe the King Prawn) goes further and uses an OSCAR encoder that fuses image regions, extracted object types, and the gif caption into a single representation to compare with a message’s embedding from a RoBERTa model (plus fancy training stuff) to pick a reply.
Tweet media one
1
0
1