wzhao_nlp Profile Banner
Wenting Zhao Profile
Wenting Zhao

@wzhao_nlp

Followers
2K
Following
201
Media
22
Statuses
297

PhD student @cornell_tech NLP + AI opinions are my own

NYC
Joined June 2013
Don't wanna be here? Send us removal request.
@wzhao_nlp
Wenting Zhao
5 months
Introducing the commit0 interactive environment for coding agents. Challenge: generate Python libraries from scratch. Commit0 is designed with interactivity, dependencies, and specifications as first-class considerations. We include a benchmark with 50+ challenging libraries.
13
58
364
@wzhao_nlp
Wenting Zhao
1 year
Hi friends, for a limited time, our WildChat project will provide free access to GPT-4 turbo: For any data we collect during this period, we will give it back to the community 😊. Paper: Thank @ai2_mosaic for the generous support.
5
21
115
@wzhao_nlp
Wenting Zhao
1 year
I'm happy to share that WildChat has been accepted as a spotlight paper at #ICLR2024! Since the release of the dataset, it has been able to support so much research such as multi-turn conversation evaluation, cultural analysis, etc. We will release more data soon, stay tuned πŸ’™.
@yuntiandeng
Yuntian Deng
1 year
WildChat dataset is out @ai2_mosaicπŸš€ Explore 650K user-ChatGPT interactions in the wild:.πŸ”— A huge shoutout to the team @wzhao_nlp @xiangrenNLP @jmhessel @clairecardie @YejinChoinka. Fun fact: The ChatGPT/GPT-4 chatbot often thought it was GPT-3🀣
Tweet media one
2
12
106
@wzhao_nlp
Wenting Zhao
9 months
WildChat-1M is now the #1 trending dataset on HF πŸ™€
Tweet media one
14
17
98
@wzhao_nlp
Wenting Zhao
3 months
I’m at #EMNLP2024 this week presenting our work on reformulating unanswerable questions (Nov 12, 16-17:30). These days, I think about how to use formal tools and harder evals to get LMs closer to intelligence. I’m also on the faculty job market for 2024-2025! Please come say hi!
Tweet media one
4
12
97
@wzhao_nlp
Wenting Zhao
6 months
Can you really train a smart LLM without copyrighted material? There has been hope that small LM + retrieval might circumvent data requirements. We think this approach is a bit of a mirage, which only improves performance on simple tasks, but hurts the reasoning capabilities.
6
12
89
@wzhao_nlp
Wenting Zhao
2 months
Eval platforms like Chatbot Arena attract users to provide preference votes. But what are the incentives of these users? Are they apathetic, or are they adversarial and just aiming to inflate their model rankings? We show 10% adversarial votes change the model rankings by a lot!
Tweet media one
3
18
89
@wzhao_nlp
Wenting Zhao
1 month
πŸ“£Announcing VerifAI: AI Verification in the Wild, a workshop at #ICLR2025. VerifAI will gather researchers to explore topics at the intersection of genAI/trustworthyML and verification: @celine_ylee @theo_olausson @ameeshsh @wellecks @taoyds
Tweet media one
0
23
86
@wzhao_nlp
Wenting Zhao
9 months
Thanks for featuring WildChat! I'm really excited about our new release, which includes geographic locations, hashed IP addresses, and HTTP request headers. Thanks @yuntiandeng for being the driving force behind all of this!. Dataset can be downloaded at
@_akhaliq
AK
9 months
WildChat. 1M ChatGPT Interaction Logs in the Wild. Chatbots such as GPT-4 and ChatGPT are now serving millions of users. Despite their widespread use, there remains a lack of public datasets showcasing how these tools are used by a population of users in practice. To bridge this
Tweet media one
4
12
74
@wzhao_nlp
Wenting Zhao
7 months
Reviewing for ARR feels so crazy to me πŸ₯².1. Two papers in my batch propose exactly the same benchmark. 2. three days after the deadline, and yet I'm still the first one to submit my reviews.3. papers all are weak accept, excitement is low, but no major flaws to point out.
5
3
59
@wzhao_nlp
Wenting Zhao
2 months
Super excited to see WildChat has powered so many new research directions πŸ₯° (Although I am a little annoyed that Anthropic made cooler figures for WildChat than I did. )
Tweet media one
@AnthropicAI
Anthropic
2 months
New Anthropic research: How are people using AI systems in the real world?. We present a new system, Clio, that automatically identifies trends in Claude usage across the world.
Tweet media one
1
1
59
@wzhao_nlp
Wenting Zhao
2 months
Commit0 πŸ’» Leaderboard Update πŸ”₯.OpenHands @allhands_ai solves two Python libraries entirely from scratch by using interactive test feedback πŸš€.The best agent on Commit0 (split: all) jumps from 6% to 15%!
3
6
54
@wzhao_nlp
Wenting Zhao
1 year
shout-out to my amazing mentors @alsuhr @xiang_lorraine I would do another PhD if they were my advisors 🫢.
@allen_ai
Ai2
1 year
We're delighted to extend hearty congratulations to the AI2 Outstanding Intern of the Year recipients for 2023! Henry Herzog, Wenting Zhao (@wzhao_nlp), and Yue Yang (@YueYangAI) made exceptional contributions to the team. Learn more about our winners:
Tweet media one
5
1
49
@wzhao_nlp
Wenting Zhao
10 months
Wow! Llama 3 absolutely blows my mind! The retrieval component works so well, and the information is up to date. Meta must have incorporated retrieval in the training to get this level of performance. @yuntiandeng Sorry for picking on you lol
Tweet media one
4
3
54
@wzhao_nlp
Wenting Zhao
6 months
How often do SoTA LLMs generate incorrect information on user-queried topics? To find out, we had LLMs produce information on entities extracted from WildChat. We think that accurate knowledge of these entities is necessary for correctly answering related user queries.
2
6
52
@wzhao_nlp
Wenting Zhao
2 years
Heading to #ACL2023πŸš€ My collaborators @megamor2 @billyuchenlin @michiyasunaga @aman_madaan @taoyds and I will be presenting a cutting-edge tutorial on Complex Reasoning in Natural Language - diving into recent methods for accurate, robust & trustworthy reasoning systemsπŸ€– 1/2.
2
11
49
@wzhao_nlp
Wenting Zhao
4 months
I'll be in Philly for COLM πŸš€ These days, I think about reasoning, evaluation, the role of scaling, and commit0! If you're also interested in these topics, let's chat:
1
3
44
@wzhao_nlp
Wenting Zhao
5 months
It took me a good five years to realize these tips. I wish I read this in my first year.
@lateinteraction
Omar Khattab
5 months
πŸ”— Thoughts on Research Impact in AI. Grad students often ask: how do I do research that makes a difference in the current, crowded AI space?. This is a blogpost that summarizes my perspective in six guidelines for making research impact via open-source artifacts. Link below.
Tweet media one
0
1
42
@wzhao_nlp
Wenting Zhao
3 months
Sasha’s talk is making informed speculations but I want to give a version where it is all my baseless spicy takes to superintelligence reasoning 😌.
@srush_nlp
Sasha Rush
3 months
Working Talk: Speculations on Test-Time Scaling. Slides: (A lot of smart folks thinking about this topic, so I would love any feedback.). Public Lecture (tomorrow):
Tweet media one
Tweet media two
Tweet media three
Tweet media four
3
2
40
@wzhao_nlp
Wenting Zhao
2 months
+1 this is absolutely essential and I am happy to see we are finally thinking about infra to support this. Loading and offloading model weights has been a huge bottleneck (like the code below I wrote for a self-training project).
Tweet media one
@natolambert
Nathan Lambert
2 months
Very real discussions going down on how to improve open source infra for RLHF / post train. Randomly bumped into @johnschulman2 in the comments. Good time to speak up if you're interested. Thanks folks @vllm_project for being proactive here too. TLDR quickly syncing the weights
Tweet media one
1
1
40
@wzhao_nlp
Wenting Zhao
5 months
Join us on developing on commit0! We believe commit0 will push the limits of transformer models and language agents. Install: pip install commit0.Doc: Github: Leaderboard+analysis:
1
3
34
@wzhao_nlp
Wenting Zhao
10 months
Looking at how users interact with chatbots in the wild is so fun. I can probably generate 100 research ideas by just browsing these real-world conversations πŸ”₯.In case you're also curious: (Fixed the previous link haha didn't expect so much interest!).
2
3
30
@wzhao_nlp
Wenting Zhao
1 year
I am presenting a few cool projects at #EMNLP2023!!!.1. "Symbolic Planning and Code Generation for Grounded Dialogue" (SPC) led by @justintchiu. He has the best vision for human-robot optimal communication, and he's currently on the job market! He's my no.1 goto collaborator 😍.
1
4
30
@wzhao_nlp
Wenting Zhao
1 year
The NLP community has created many instruction fine-tuning datasets by imagining what are the use cases of chatbots, but what do the users really ask them to do? Check out WildChat to find out 😍.
@yuntiandeng
Yuntian Deng
1 year
WildChat dataset is out @ai2_mosaicπŸš€ Explore 650K user-ChatGPT interactions in the wild:.πŸ”— A huge shoutout to the team @wzhao_nlp @xiangrenNLP @jmhessel @clairecardie @YejinChoinka. Fun fact: The ChatGPT/GPT-4 chatbot often thought it was GPT-3🀣
Tweet media one
0
2
29
@wzhao_nlp
Wenting Zhao
4 months
holy ++ 🩷.
@m2saxon
Michael Saxon (on academic job market🫑)
4 months
We were brought to a holy site
Tweet media one
Tweet media two
1
0
28
@wzhao_nlp
Wenting Zhao
5 months
We release a new AI coding benchmark, where models are asked to produce Python libraries from scratch, given a specification document and unit tests. commit0 contains 50+ popular Python libraries, from deprecated to networkx.
Tweet media one
2
0
26
@wzhao_nlp
Wenting Zhao
2 years
I will be also presenting our paper (w/ @justintchiu @clairecardie @srush_nlp) on "Abductive Commonsense Reasoning Exploiting Mutually Exclusive Explanations", diving into unsupervised abductive reasoning by leveraging the mutual exclusiveness of explanations. Eager to chatπŸ’‘.
0
2
24
@wzhao_nlp
Wenting Zhao
10 months
Personally, I believe in data, but it would be more fun if I could be proven wrong by architecture, optimization, and other learning algorithm researchers.
@mattshumer_
Matt Shumer
10 months
The dataset is everything. Great read:
Tweet media one
2
0
20
@wzhao_nlp
Wenting Zhao
5 months
Thanks to @modal_labs for their generous support to our project. Their service enables us to auto-scale running any arbitrary number of unit tests simultaneously, hence our interactivity.
@wzhao_nlp
Wenting Zhao
5 months
Introducing the commit0 interactive environment for coding agents. Challenge: generate Python libraries from scratch. Commit0 is designed with interactivity, dependencies, and specifications as first-class considerations. We include a benchmark with 50+ challenging libraries.
0
4
22
@wzhao_nlp
Wenting Zhao
5 months
Commit0 provides signal through unit test execution feedback, including both error messages and locations. It also offers comprehensive linting feedback including type checking. Finally, it keeps track of agents' coding progress to allow future backtracking.
Tweet media one
1
0
20
@wzhao_nlp
Wenting Zhao
5 months
Thanks to my awesome collaborators: @nanjiangwill.@celine_ylee @justintchiu @clairecardie @srush_nlp. Nan is an amazing undergrad who has brilliant ideas and gets them executed! He'll be an amazing PhD student. Recruit him to join your lab 😍.
1
2
19
@wzhao_nlp
Wenting Zhao
9 months
@tianle_cai Spam is such a great part of web text. .
1
1
19
@wzhao_nlp
Wenting Zhao
1 year
Awww, thanks Jack! β™₯️ I’m at EMNLP, happy to chat about reasoning, evaluation, imitation learning, and anything about graduate school 😍.
@jxmnop
jack morris
1 year
here are two awesome researchers you should follow: @WeijiaShi2 at UW and @wzhao_nlp at Cornell!! some of their recent work:. weijia shi (@WeijiaShi2): .- built INSTRUCTOR, the embedding model that lots of startups / companies use (.- proposed a more.
0
0
18
@wzhao_nlp
Wenting Zhao
6 months
In our new paper, we evaluate kNN-LMs with a few state-of-the-art base models + different datastores. We find that kNN-LMs do better in simple tasks like sentiment classification and entailment but perform worse on multi-hop and math reasoning tasks.
Tweet media one
1
2
16
@wzhao_nlp
Wenting Zhao
2 months
Big updates on the commit0 leaderboard are coming! Spoiler alert: OpenHands is really awesome πŸš€.
@allhands_ai
All Hands AI
2 months
Introducing OpenHands 0.14.3πŸ™Œ. Exciting updates:.1. One-button click to push to github.2. Incorporation of the commit0 benchmark for building apps from scratch (thanks @wzhao_nlp -- benchmark results soon!).3. Many other smaller robustness improvements
Tweet media one
0
2
17
@wzhao_nlp
Wenting Zhao
5 months
Commit0 is a challenging benchmark. First, the specification given to the model is often hundreds of pages. The specification also has both texts and figures, while understanding figures is not necessary, humans use them to obtain a holistic view of the library.
Tweet media one
1
0
16
@wzhao_nlp
Wenting Zhao
3 months
Commit0 will release interactive testing (per unit test), linting, and coverage feedback for SWE-bench in the coming weeks. And we are going full open-source! What other features would you like? πŸ€”
Tweet media one
0
2
15
@wzhao_nlp
Wenting Zhao
5 months
On commit0 Lite (an easy split of 10 libraries out of 56), our baseline agent that leverages unit test feedback can pass a portion of the unit tests occasionally. However, the overall pass rate is only ~30%. On commit0 all, the overall pass rate drops to close to 0%.
Tweet media one
1
0
15
@wzhao_nlp
Wenting Zhao
5 months
The last challenge is the actual library generation part. Large Python libraries are quite long and contain up to thousands of modules. The files also have importing dependencies where generating some modules requires identifying which other modules to condition on.
1
0
14
@wzhao_nlp
Wenting Zhao
3 months
Dear @allhands_ai developers: we would like to get OpenHands to do library generation. Would you suggest just kicking off the agent with a prompt "can you complete all function implementations" or explicitly instruct the agent to fill in file by file? @xingyaow_ @gneubig.
3
0
14
@wzhao_nlp
Wenting Zhao
18 days
tbh we haven't been able to use any of the models launched by google due to rate limit, and they never get back to us about it. we were able to run deepseek, mistral, o1, etc. and observe reliable improvement over past models. Is it only us who find google models so hard to use?.
@_akhaliq
AK
18 days
Google ships.
4
1
14
@wzhao_nlp
Wenting Zhao
6 months
And, researchers focus exclusively on perplexity as an indicator of downstream performance. Does this hold in the non-parametric world? We question this assumption and show that although kNN-LMs achieve better in-domain perplexity, this doesn't translate into better reasoning.
2
1
12
@wzhao_nlp
Wenting Zhao
9 months
try out GPT-4o for free now.
@yuntiandeng
Yuntian Deng
9 months
GPT-4o is so fast! πŸš€ We've updated the free chatbot behind WildChat to use GPT-4o. Try it out here:
0
1
12
@wzhao_nlp
Wenting Zhao
3 months
Xi is an amazing mentor! So knowledgeable and caring.
@xiye_nlp
Xi Ye
3 months
πŸ”” I'm recruiting multiple fully funded MSc/PhD students @UAlberta for Fall 2025! Join my lab working on NLP, especially reasoning and interpretability (see my website for more details about my research). Apply by December 15!.
0
0
12
@wzhao_nlp
Wenting Zhao
9 months
πŸ”₯.
@yuntiandeng
Yuntian Deng
9 months
If you're at #ICLR, come to our poster on WildChat tomorrow afternoon! We collected 1 million user-ChatGPT conversations with geographic info. Presented by @wzhao_nlp and me. πŸ“… Time: Thursday, May 9, 4:30 PM - 6:30 PM CEST.πŸ“ Location: Halle B #239. πŸ”—
Tweet media one
1
0
11
@wzhao_nlp
Wenting Zhao
6 months
Paper: Code: This work is led by our intern @GengShangyi. Congrats on the great work!.
1
3
11
@wzhao_nlp
Wenting Zhao
1 year
Come meet and chat with your fellow reasoning researchers ☺️.
@YiyuanLi1
Yiyuan Li @Summer
1 year
Overwhelmed massive works about reasoning but still feel lack of nutritions? @wzhao_nlp and I are hosting a BoF session regarding "Recipes in Building Language Reasoners" in #EMNLP2023 at Singapore @emnlpmeeting. 1/3
Tweet media one
0
1
10
@wzhao_nlp
Wenting Zhao
2 months
For those who're interested in what's happening after WildChat, I think this is a super cool project 😊.
@ShayneRedford
Shayne Longpre
2 months
Interested in how LLMs are really used?. We are starting a research project to find out! In collaboration w/ @sarahookr @AnkaReuel @ahmetustun89 @niloofar_mire and others. We are looking for two junior researchers to join us. Apply by Dec 15th!.
0
0
10
@wzhao_nlp
Wenting Zhao
5 months
this tangible, actionable roadmap for the open human feedback ecosystem is my favorite outcome from this project. now is so in time to work on these problems.
@LChoshen
Leshem Choshen πŸ€–πŸ€—
5 months
Then, we hone on 6 crucial areas to develop open human feedback ecosystems:.Incentives to contribute, reducing contribution efforts, getting expert and diverse feedback, ongoing dynamic feedback, privacy and legal issues.
Tweet media one
2
0
9
@wzhao_nlp
Wenting Zhao
2 months
Is your agent good at coding? Show it on commit0! 😈 .Install: pip install commit0.Doc: Github: Leaderboard (includes detailed analysis and generated code): Preprint:
0
2
9
@wzhao_nlp
Wenting Zhao
6 months
Congrats on the new position! I really love working with Joe. I’ve always learned so much from our collaboration. As a mentor, he is also so knowledgeable and so caring. ☺️.
@jzhou_jz
Jiawei (Joe) Zhou
6 months
Thanks Sasha! And yes, I am moving to Stony Brook University this fall. Lmk if you are interested in doing PhD or research with strong technical background and interests in the crazy NLP and related (multimodal, ml, etc.) areas.
1
0
9
@wzhao_nlp
Wenting Zhao
3 months
0
0
8
@wzhao_nlp
Wenting Zhao
5 months
@yoavartzi I vote for COLM-0. lol.
0
0
7
@wzhao_nlp
Wenting Zhao
6 months
(2) Small LLMs with retrieval hallucinate more than large LLMs. Therefore, empirically, retrieval helps to an extent but the room for improvement is large. Most prominent source of RAG errors was genuine generation errors by the base models even provided with correct documents.
1
0
7
@wzhao_nlp
Wenting Zhao
6 months
Our findings are (1) LLMs hallucinate more on entities without associated Wikipedia pages compared to those with available Wikipedia pages.
Tweet media one
1
2
7
@wzhao_nlp
Wenting Zhao
7 months
such fun papers to read.
@jacobandreas
Jacob Andreas
7 months
Hi ICML! We're presenting papers this week:. - understanding how LMs learn new langs in-context - agents for automated interpretability research: - calibrating LMs by marginalizing over missing context:
0
0
7
@wzhao_nlp
Wenting Zhao
1 year
@ABosselut I don't know about others, but our second workshop on Natural Language Reasoning and Structured Explanations will be held at ACL'24 (shameless self-promotion🀣).
0
0
7
@wzhao_nlp
Wenting Zhao
9 months
@ericzelikman we spent so much time to think about star πŸ€—.
@justintchiu
Justin T Chiu
9 months
wrote a short blog post on the latent variable model behind STaR:
0
0
6
@wzhao_nlp
Wenting Zhao
4 days
@gneubig This terminology really bugs me, and we argue a lot when writing papers. To me *S*FT implies FT on human annotations.
0
0
6
@wzhao_nlp
Wenting Zhao
2 months
Getting high-quality annotations for free-text generations has always been tricky. Open community platforms like Chatbot Arena provide large-scale, rich data, making now a perfect time to study this topic!. Check out our new preprint:
1
1
6
@wzhao_nlp
Wenting Zhao
1 year
love LLMs as productivity tools.
@karpathy
Andrej Karpathy
1 year
ChatGPT "Advanced Data Analysis" (which doesn't really have anything to do with data specifically) is an awesome tool for creating diagrams. I could probably code these diagrams myself, but it's soo much better to just sit back, and iterate in English. In this example, I was
Tweet media one
Tweet media two
0
0
5
@wzhao_nlp
Wenting Zhao
3 months
@allhands_ai @xingyaow_ @gneubig our current agent design wraps aider to do each of the following steps. What I am unsure about is if we should just replace aider with openhands, or if you are doing this in an entirely different way?
Tweet media one
1
0
5
@wzhao_nlp
Wenting Zhao
1 year
Thanks to all my awesome collaborators: @justintchiu @saujasv @dan_fried @derek @derekchen14 and my most supportive, helpful, visionary advisors: @clairecardie @srush_nlp.
1
0
5
@wzhao_nlp
Wenting Zhao
1 year
@shauryr This is standard student F1 visa :).
0
0
5
@wzhao_nlp
Wenting Zhao
5 months
@Samhanknr Even with memorization and this data is included in pretraining, models do not do well. :).
1
0
4
@wzhao_nlp
Wenting Zhao
1 year
2. "HOP, UNION, GENERATE (HUG): Explainable Multi-hop Reasoning without Rationale Supervision" RAG is used for retrieval-augmented generation, but RAG retrieves documents independently. By explicitly modeling the multi-hop relationship between documents, HUG beats RAG!.
1
0
5
@wzhao_nlp
Wenting Zhao
6 months
@WilliamWangNLP I can echo this feeling. I felt very much excluded from the community after going over the slides.
0
0
4
@wzhao_nlp
Wenting Zhao
1 year
3. "MiniChain: A Small Library for Large LMs" @srush_nlp made a really lightweight and easy-to-use library that allows people to quickly prototype software with LLMs in the loop. @justintchiu uses minichain to build his SPC dialogue system πŸš€.
1
0
4
@wzhao_nlp
Wenting Zhao
5 months
@srush_nlp @paul_cal Using modal for simple script execution is very cheap. Running unit tests for our 50+ repos only costs 1 dollar. Compared to LM API calls, this is a tiny portion of that. Modal also gives $30 free credit every month, which already goes a long way.
1
0
4
@wzhao_nlp
Wenting Zhao
6 months
Here are example entities that we extracted from WildChat. And wow! I'm truly fascinated by the breadth of topics users talk about.
Tweet media one
1
0
4
@wzhao_nlp
Wenting Zhao
8 months
@sarahwiegreffe @jeremyphoward @aryaman2020 you're my favorite researcher in mech interp πŸ˜„.
2
0
2
@wzhao_nlp
Wenting Zhao
7 months
check it out! from now on, I don't need to memorize all the numbers, cuz I have the language and tools to calculate these numbers πŸ€—.
@srush_nlp
Sasha Rush
7 months
Tutorial: Street Fighting Transformers πŸ₯·. A bottleneck for young CS researchers is internalizing the core scale and unit calculations of LLMs. This video is about learning think in a "physics" style. (Its also probably helpful for job interviews. ).
0
0
4
@wzhao_nlp
Wenting Zhao
1 year
Please also check out @jxmnop's paper on "Text Embeddings Reveal (Almost) As Much as Text" His paper totally changes my belief about embeddings, from which you can exactly reconstruct the text. The experiments in this paper blows my mind πŸ₯³.
0
0
3
@wzhao_nlp
Wenting Zhao
1 year
@srush_nlp Have you seen this one? "transformer decoders with a logarithmic number of decoding steps (w.r.t. the input length) push the limits of standard transformers only slightly, while a linear number of decoding steps adds a clear new ability. ".
1
0
4
@wzhao_nlp
Wenting Zhao
10 months
I guess what I really believe in is human intelligence. We can be more data efficient, and we can design models with better inductive biases.
0
0
3
@wzhao_nlp
Wenting Zhao
5 months
Also thanks to @astral_sh for their amazing developer tools. uv enables fast build of all Python libraries and ruff generates the lint feedback.
@wzhao_nlp
Wenting Zhao
5 months
Introducing the commit0 interactive environment for coding agents. Challenge: generate Python libraries from scratch. Commit0 is designed with interactivity, dependencies, and specifications as first-class considerations. We include a benchmark with 50+ challenging libraries.
0
1
4
@wzhao_nlp
Wenting Zhao
7 months
@yoavartzi @sewon__min this paper is also a fun read.
@Kangwook_Lee
Kangwook Lee
11 months
🧡Let me explain why the early ascent phenomenon occursπŸ”₯. We must first understand that in-context learning exhibits two distinct modes. When given samples from a novel task, the model actually learns the pattern from the examples. We call this mode the "task learning" mode.
Tweet media one
0
0
4
@wzhao_nlp
Wenting Zhao
5 months
@gazorp5 The reference_commit has the reference implementation that can pass all unit tests.
1
0
4
@wzhao_nlp
Wenting Zhao
7 months
@NathanYan2012 see tutorial 5.
1
0
4
@wzhao_nlp
Wenting Zhao
6 months
(3) LLMs exhibit varying hallucination rates across different domains, with higher rates in the people and finance domains, and lower rates in geographic and computing-related domains.
1
0
4
@wzhao_nlp
Wenting Zhao
14 days
This is mind-blowing.
@junxian_he
Junxian He
14 days
We replicated the DeepSeek-R1-Zero and DeepSeek-R1 training on 7B model with only 8K examples, the results are surprisingly strong. πŸš€ Starting from Qwen2.5-Math-7B (base model), we perform RL on it directly. No SFT, no reward model, just 8K MATH examples for verification, the
Tweet media one
Tweet media two
0
0
3
@wzhao_nlp
Wenting Zhao
10 months
@jxmnop Every time when I get stuck I talk to @justintchiu.
0
0
3
@wzhao_nlp
Wenting Zhao
2 months
@ScottWu46 Will Devin consider showing its coding capabilities on commit0? It is really good at differentiating agent's capabilities! Swe-agent got 2.98% while OpenHands got 15% on Commit0 (split: all).
@wzhao_nlp
Wenting Zhao
2 months
Commit0 πŸ’» Leaderboard Update πŸ”₯.OpenHands @allhands_ai solves two Python libraries entirely from scratch by using interactive test feedback πŸš€.The best agent on Commit0 (split: all) jumps from 6% to 15%!
0
0
3
@wzhao_nlp
Wenting Zhao
2 months
mine is a different issue. My learning node and generation node are the same node, loading and offloading new weights takes a lot of time.
3
0
3
@wzhao_nlp
Wenting Zhao
5 days
@JerryWeiAI @AnthropicAI Congrats, and thanks for using WildChat πŸ’™.
0
0
3
@wzhao_nlp
Wenting Zhao
1 year
@jxmnop @Teknium1 @geepytee i don't know for sure obviously πŸ˜… but discriminative models seem to be doing better at identifying bad requests from our recent work.
0
0
3
@wzhao_nlp
Wenting Zhao
26 days
@sea_snell What should the N be if we want to see the benefit of PRM?.
1
0
3
@wzhao_nlp
Wenting Zhao
1 year
@jxmnop lol i know how you feel, i can probably earn a million $ by managing people's wandb accounts to reduce their stress.
0
0
3
@wzhao_nlp
Wenting Zhao
2 months
@OfficialLoganK Give us enough rate limit to evaluate commit0 πŸ₯Ή.
1
0
3
@wzhao_nlp
Wenting Zhao
1 year
@yanaiela Can you also include the projects that were killed and didn't even make to submission stages? πŸ€”.
1
0
3
@wzhao_nlp
Wenting Zhao
1 year
@random_walker @maria_antoniak @aryaman2020 @lmsysorg We actually have evidence that users regularly use this external APIs given that otherwise the users would have to pay money out of their own pockets to use GPT-4. For example, there was an editor used our service to draft emails for a few times.
0
0
2
@wzhao_nlp
Wenting Zhao
1 year
@NirantK @arankomatsuzaki Thanks for your interest. We will release the dataset and model once we get internal clearance. Stay tuned 😊.
0
0
3
@wzhao_nlp
Wenting Zhao
6 months
@yoavartzi @srush_nlp sorry, this is Sasha out twittered Sasha πŸ˜‚πŸ˜‚πŸ˜‚.
1
0
3
@wzhao_nlp
Wenting Zhao
1 year
@niloofar_mire Please bug me a lot if you have questions ❀️.
0
0
3
@wzhao_nlp
Wenting Zhao
1 year
@ReviewAcl @koustavagoswami @soldni We received a meta-review that contains factual errors. It also only summarizes the pre-rebuttal reviews and misses corrections and acknowledgments from the reviewers made during the rebuttal. The way it lists bullet points also seems to be ChatGPT generated.
1
0
3
@wzhao_nlp
Wenting Zhao
6 months
Paper: Dataset:
0
0
3
@wzhao_nlp
Wenting Zhao
2 months
@hbXNov Thanks for the pointer! The blog looks really cool.
0
0
3
@wzhao_nlp
Wenting Zhao
5 months
This is super exciting!.
@jxmnop
jack morris
5 months
sneak preview 🍿 of our new embedding model: cde-small-v1 . cde-small-v1 is the text embedding model that we (@srush_nlp and i) have been working on at Cornell for about a year. tested the model yesterday on MTEB, the text embeddings benchmark; turns out we have state-of-the-art
Tweet media one
Tweet media two
Tweet media three
0
0
3
@wzhao_nlp
Wenting Zhao
1 year
@jmhessel oh nooo!! hope you feel better soon 🩹.
0
0
2