Xiang Yue Profile
Xiang Yue

@xiangyue96

Followers
2,284
Following
526
Media
50
Statuses
332

Postdoc @LTIatCMU . PhD from Ohio State @osunlp . Author of MMMU, MAmmoTH. Training & evaluating foundation models. Previously @MSFTResearch . Opinions are my own.

Pittsburgh, PA
Joined August 2021
Don't wanna be here? Send us removal request.
Explore trending content on Musk Viewer
Pinned Tweet
@xiangyue96
Xiang Yue
7 months
🚀 Introducing MMMU, a Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI. 🧐 Highlights of the MMMU benchmark: > 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks >
Tweet media one
Tweet media two
Tweet media three
Tweet media four
18
184
745
@xiangyue96
Xiang Yue
1 month
Thank @_akhaliq for sharing our work! Paper: Key takeaways: 1) Transformers can learn to implicitly reason, but only through extended training far beyond overfitting, a phenomenon known as grokking. 2) Transformers exhibit different levels of
Tweet media one
@_akhaliq
AK
1 month
Grokked Transformers are Implicit Reasoners A Mechanistic Journey to the Edge of Generalization We study whether transformers can learn to implicitly reason over parametric knowledge, a skill that even the most capable language models struggle with. Focusing on two
Tweet media one
6
49
306
3
86
430
@xiangyue96
Xiang Yue
2 months
🚀 Scaling 10M naturally existing instructions from the Web! 🌐 Introducing 🦣MAmmoTH2 (), a family of open-source models that unlocks unprecedented performance gains in reasoning tasks by leveraging two key insights: 1️⃣ Scaling is still all you need, even
Tweet media one
10
71
377
@xiangyue96
Xiang Yue
4 months
🌟 Big thanks for making StarCoder 2 open-source! 🚀 We've swiftly finetuned it on our Code-Feedback instruction dataset, the dataset behind OpenCodeInterpreter. 📈 HumanEval Scores are boosted ~30%. 3B Model: from 31.7 to 67.1! 7B Model: from 35.4 to 75.6! 🛠️ CodeFeedback has
Tweet media one
@_philschmid
Philipp Schmid
4 months
Introducing StarCoder 2 ⭐️ The most complete open Code-LLM 🤖 StarCoder 2 is the next iteration for StarCoder and comes in 3 sizes, trained 600+ programming languages on over 4 Trillion tokens on Stack v2. It outperforms StarCoder 1 by margin and has the best overall performance
Tweet media one
11
79
364
42
64
264
@xiangyue96
Xiang Yue
5 months
🌟With precise execution & human feedback, a 7B code model hits 90% accuracy on HumanEval! 🚀 Introducing OpenCodeInterpreter: A family of open-source code systems for generating, executing, & refining code.🔄 🤖 Traditional open-source models often fall short in execution
Tweet media one
@GeZhang86038849
Ge Zhang
5 months
[1/n] 🚀 Excited to share our latest work on OpenCodeInterpreter! With a blend of execution results and human feedback, we've achieved significant advancements in code generation. Here are the key points: ✨ Introducing OpenCodeInterpreter - a leap in iterative code refinement.
Tweet media one
13
61
218
27
61
319
@xiangyue96
Xiang Yue
10 months
Introducing 🦣MAmmoTH: The BEST open-source #LLMs for math NOW! 🦣Outperforms SOTA on 9 math reasoning datasets, with accuracy gains of 13-29% across all scales. 🦣 is tuned on our 260K #MathInstruct dataset, including hybrid CoT & PoT rationales. #NLProc
Tweet media one
2
63
274
@xiangyue96
Xiang Yue
4 months
🎉 Thrilled to announce the launch of our OpenCodeInterpreter demo, now live on @huggingface space! Check it out here: For those interested in running it locally, we've got you covered with a guide available on our GitHub (~1K stars! Trending NOW!):
@xiangyue96
Xiang Yue
5 months
🌟With precise execution & human feedback, a 7B code model hits 90% accuracy on HumanEval! 🚀 Introducing OpenCodeInterpreter: A family of open-source code systems for generating, executing, & refining code.🔄 🤖 Traditional open-source models often fall short in execution
Tweet media one
27
61
319
3
40
163
@xiangyue96
Xiang Yue
3 months
🚀Introducing VisualWebBench: A Comprehensive Benchmark for Multimodal Web Page Understanding and Grounding. 🤔What's this all about? Why this benchmark? > Back in Nov 2023, when we released MMMU (), a comprehensive multimodal
@jeepliu1212
Junpeng Liu
3 months
(1/8)🚀We introduce VisualWebBench, a multimodal benchmark designed to assess the understanding and grounding capabilities of MLLMs in web scenarios. Encompassing seven tasks, and comprises 1.5K human-curated instances from 139 real websites, covering 87 sub-domains.
Tweet media one
Tweet media two
2
8
61
3
28
143
@xiangyue96
Xiang Yue
1 month
🎉 Introducing MixEval📊: A fast, cheap, and effective #LLMs benchmark that combines and reweights existing benchmarks. Instead of creating a "new" benchmark from scratch with costly annotations, we found a smart reweighting strategy that can "refresh"
Tweet media one
@NiJinjie
Jinjie Ni
1 month
How to get ⚔️Chatbot Arena⚔️ model rankings with 2000× less time (5 minutes) and 5000× less cost ($0.6)? Maybe simply mix the classic benchmarks. 🚀 Introducing MixEval, a new 🥇gold-standard🥇 LLM evaluation paradigm standing on the shoulder of giants (classic benchmarks).
Tweet media one
10
63
233
2
31
135
@xiangyue96
Xiang Yue
3 months
Long-context LLMs Struggle with Long In-context Learning! 🤯 We developed LongICLBench to rigorously test LLMs on extreme classification tasks with increasing complexity. We meticulously selected six datasets with a label range spanning 28 to 174 classes covering different input
Tweet media one
@arankomatsuzaki
Aran Komatsuzaki
3 months
Long-context LLMs Struggle with Long In-context Learning Suggests a notable gap in current LLM capabilities for processing and understanding long, context-rich sequences.
Tweet media one
5
40
217
2
22
127
@xiangyue96
Xiang Yue
5 months
🤖How far are we from achieving Expert AGI? We included human experts' performance on the MMMU benchmark (). 📊 The best-performing human experts achieved an accuracy of 88.6 while the best-performing model @GoogleDeepMind Gemini Ultra just scored 59.4,
Tweet media one
@xiangyue96
Xiang Yue
7 months
🚀 Introducing MMMU, a Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI. 🧐 Highlights of the MMMU benchmark: > 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks >
Tweet media one
Tweet media two
Tweet media three
Tweet media four
18
184
745
1
23
99
@xiangyue96
Xiang Yue
22 days
🥰Attending #CVPR2024 and presenting 🏆Award Candidate Paper MMMU! DM is open, drop me a message if you'd like to chat about #multimodal , #LLMs , #evaluation or #GenAI in general! TUE 18 JUN 3:30pm: New frontiers for zero-shot Image Captioning Evaluation (NICE) THU 20 JUN
Tweet media one
@xiangyue96
Xiang Yue
7 months
🚀 Introducing MMMU, a Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI. 🧐 Highlights of the MMMU benchmark: > 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks >
Tweet media one
Tweet media two
Tweet media three
Tweet media four
18
184
745
0
19
95
@xiangyue96
Xiang Yue
3 months
Excited to announce that MMMU has been selected as an Oral presentation at #CVPR2024 (90 orals in total, 0.8%)! Congrats to all the collaborators, and see you in Seattle! It will be my first time attending a CV conference. So excited! 😃
@xiangyue96
Xiang Yue
7 months
🚀 Introducing MMMU, a Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI. 🧐 Highlights of the MMMU benchmark: > 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks >
Tweet media one
Tweet media two
Tweet media three
Tweet media four
18
184
745
6
6
95
@xiangyue96
Xiang Yue
7 months
I'm in Singapore🇸🇬 for #EMNLP2023 ! Do not hesitate to ping me for a coffee chat! My recent work covers a range of topics in #LLMs & #LMMs : - Multi-modal training & eval - (Math) Reasoning (generalization/robustness) - Attribution of #LLMs Check more details below👇 - MMMU: A
Tweet media one
Tweet media two
Tweet media three
Tweet media four
5
13
89
@xiangyue96
Xiang Yue
6 months
🌟 Exciting news: 🦣MAmmoTH was accepted as a spotlight (5%) at #ICLR2024 ! A huge shoutout to our amazing team! We're now exploring more training dynamics of #LLMs for math reasoning and uncovering fascinating insights. Perhaps a sequel of the work? MAmmoTH-2 :)? Stay tuned!!
@xiangyue96
Xiang Yue
10 months
Introducing 🦣MAmmoTH: The BEST open-source #LLMs for math NOW! 🦣Outperforms SOTA on 9 math reasoning datasets, with accuracy gains of 13-29% across all scales. 🦣 is tuned on our 260K #MathInstruct dataset, including hybrid CoT & PoT rationales. #NLProc
Tweet media one
2
63
274
2
13
73
@xiangyue96
Xiang Yue
7 months
🚀 Update alert! 🎉 We had an updated version of our MMMU paper: . 🔍 Added: Gemini ( @GoogleDeepMind ), Qwen-VL-PLUS ( @JustinLin610 @wananxy1 ), SPHINX ( @lupantech ). ✨ Revised: mPLUG-Owl2's results ( @xuhaiya2483846 ) based on author's prompt. 🔧 Fixed:
Tweet media one
@xiangyue96
Xiang Yue
7 months
🚀 Introducing MMMU, a Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI. 🧐 Highlights of the MMMU benchmark: > 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks >
Tweet media one
Tweet media two
Tweet media three
Tweet media four
18
184
745
2
20
78
@xiangyue96
Xiang Yue
3 months
After receiving community feedback, we added @GoogleDeepMind Gemini 1.5 Pro's results. 👇 Gemini 1.5 Pro's vision ability was significantly improved compared to 1.0 Pro and matched GPT-4's performance on our VisualWebBench! 🏆 Its action prediction (e.g., predicting what would
Tweet media one
@xiangyue96
Xiang Yue
3 months
🚀Introducing VisualWebBench: A Comprehensive Benchmark for Multimodal Web Page Understanding and Grounding. 🤔What's this all about? Why this benchmark? > Back in Nov 2023, when we released MMMU (), a comprehensive multimodal
3
28
143
0
18
75
@xiangyue96
Xiang Yue
7 months
Amazing achievement! Congratulations on reaching the new state-of-the-art with a 62.4% score for Gemini on our newly-released MMMU benchmark. Gemini's multimodal perception and reasoning capabilities are truly impressive!😱
@JeffDean
Jeff Dean (@🏡)
7 months
I’m very excited to share our work on Gemini today! Gemini is a family of multimodal models that demonstrate really strong capabilities across the image, audio, video, and text domains. Our most-capable model, Gemini Ultra, advances the state of the art in 30 of 32 benchmarks,
Tweet media one
Tweet media two
276
3K
13K
4
6
69
@xiangyue96
Xiang Yue
4 months
How does a baby learn to navigate the world around them? 🚶‍♂️👶 Through exploration and learning from each little stumble and triumph. The ETO framework applies this very essence of human learning to AI, emphasizing the importance of both success and failure in developing better AI
@arankomatsuzaki
Aran Komatsuzaki
4 months
Trial and Error: Exploration-Based Trajectory Optimization for LLM Agents Presents an exploration-based trajectory optimization approach, which consistently surpasses baseline performance by a large margin repo: abs:
Tweet media one
3
51
221
1
18
69
@xiangyue96
Xiang Yue
8 months
**Large high-quality Pre-trained Data is All you Need** Great performance of Yi-34B by @01AI_Yi on @huggingface #LLM leaderboard. Highlights from the website: > Pre-trained on 3T high-quality tokens from scratch > Support up to 200K context window
Tweet media one
@01AI_Yi
Yi-01.AI
8 months
Our team at @01AI_Yi is very proud to introduce the release of Yi-34B model now on top of @huggingface pretrained LLM leaderboard! Also a Yi-6B available. Welcome to give a try and build fantastic projects!
6
26
164
3
13
69
@xiangyue96
Xiang Yue
7 months
As the first author of the MMMU benchmark, I want to emphasize that our team didn't grant early access to anyone outside our team. Our goal in creating this benchmark is to ensure a fair and accessible evaluation platform for the entire community. Consistent with this aim, we
@ysu_nlp
Yu Su
7 months
Hi @emilymbender , I'm one of the lead authors of MMMU. I can certify that 1) Google didn't fund this work, and 2) Google didn't have early access. They really like the benchmark after our release and worked very hard to get the results. It doesn't take that long to eval on a
10
53
969
1
6
66
@xiangyue96
Xiang Yue
3 months
🤯Here are the #Llama3 -8B base model results on the selected reasoning benchmarks. Short Conclusions: - Mistral 7B Base (?)< Llama 3 8B Base (15T tokens)~= Gemma 7B Base (6T Tokens). Maybe we do not need that many tokens? - Llama-3 instruction tuning does a great job of
Tweet media one
@xiangyue96
Xiang Yue
3 months
Oh, I just noticed that the strong code and math reasoning performance of #Llama3 is reported based on their instruction-tuned version, which means that the model might have been trained on GSM8K or MATH (augmented) training sets. 😅
Tweet media one
2
1
13
1
5
63
@xiangyue96
Xiang Yue
3 months
😱Is pure text pre-training coming to an end? A Thread on 🦙 Llama 3's report 🧵: 1. Tokenizer 🔍 Llama 3's tokenizer boasts a 128k vocab size and yields 15% fewer tokens than Llama 2, enabling more efficient and accurate tokenization. 2. Model
1
15
60
@xiangyue96
Xiang Yue
6 months
🥳It is great to work with @Francis_YAO_ (most credits go with Yao!!!) to dive deep into data influence on training long context models. 🤠 TL;DR: A practical technical blog on training long context models, covering data scale/mixture, training setup (e.g., positional encoding,
Tweet media one
Tweet media two
@Francis_YAO_
Yao Fu
6 months
Although there are abundant work studying long-context LLMs, most of them talks about architecture / positional encoding, almost none of existing papers talk about data. In this work, we take a close look at data influence on context scaling
8
76
356
0
8
58
@xiangyue96
Xiang Yue
1 year
📢 New preprint alert! Check out our latest research on 🌟automatic evaluation of attribution by #LLMs , i.e., verifying whether the generated statement is supported by the cited reference. [1/N] #NLP #NLProc
2
17
55
@xiangyue96
Xiang Yue
4 months
@MistralAI just released their v0.2 Base😱. @WenhuChen and I quickly evaluated a few benchmarks using the OpenCompass evaluation package. It seems that the capability dropped a little bit on nearly all the benchmarks I tested. 🤔
Tweet media one
@marvinvonhagen
Marvin von Hagen
4 months
Mistral just announced at @SHACK15sf that they will release a new model today: Mistral 7B v0.2 Base Model - 32k instead of 8k context window - Rope Theta = 1e6 - No sliding window
Tweet media one
26
127
797
3
6
57
@xiangyue96
Xiang Yue
9 months
Our #EMNLP2023 work also reveals this phenomenon. We find that despite being able to generate correct step-by-step solutions in the beginning, LLMs cannot maintain their belief in truth when challenged by often-time absurdly invalid arguments.
Tweet media one
@stockthoughts81
Uncovering Value
9 months
They give examples where removing oracles or improving initial prompts eliminates/reduces any benefit from self-correction. Great representative diagram h/t @xiangyue96 (6/n)
Tweet media one
1
0
1
1
9
51
@xiangyue96
Xiang Yue
1 year
🎓 Just successfully defended my Ph.D. dissertation! 🥳📚 It's been a challenging and rewarding journey, but I made it! Grateful for the support of my advisor @hhsun1 and all the members in @osunlp . I'll attend ACL in Toronto next week. Feel free to DM me if you'd like to chat!
@xiangyue96
Xiang Yue
1 year
lol!! I'll do my real PhD defense tmr and came across this simulator tnt. The funny thing is that the simulated duration is exactly the same as my real case. Is that some good indicator I'll pass my defense tmr? 🤩🤩
Tweet media one
2
0
22
9
1
50
@xiangyue96
Xiang Yue
19 days
Congratulations! BioCLIP won the Best Student Paper at #CVPR2024 ! Sam, Luke @luke_ch_song and Yu @ysu_nlp are attending the conference, find them for a chat!
Tweet media one
@samstevens6860
Sam Stevens
20 days
Excited to be at CVPR presenting BioCLIP! DM me if you want to chat about computer vision for animals, multimodal foundation models, or AI for science!
1
3
18
1
3
38
@xiangyue96
Xiang Yue
2 years
📢📢Want to share your textual data outside your org or team but worry about privacy leakage? Check out our new preprint "Synthetic Text Generation with Differential Privacy: A Simple and Practical Recipe"() 👇👇👇
1
6
37
@xiangyue96
Xiang Yue
7 months
Benchmark results:
Tweet media one
0
4
36
@xiangyue96
Xiang Yue
4 months
Hey! Come to join us to make OpenDevin happen! Please kindly fill out the form and we will be reaching out shortly! 🙌
@huybery
Binyuan Hui
4 months
🚀 The enthusiasm for OpenDevin has exceeded our expectations! We've got an initial roadmap and a bunch of great guys working on it. 🫡 Even @gneubig has completed a front-end prototype in a very short time!! It's all fantastic and you can fill out the form below to join us.
8
20
109
0
2
30
@xiangyue96
Xiang Yue
2 years
#acl2022nlp 📢📢 Want to know how to better leverage synthetic QA data for domain adaptation? Check out our ACL22 work: 👇👇 I will present our work in the two virtual poster sessions: VPS2/VPS4: Question Answering, 14:00 EST May 24/25. Feel free to stop by!
Tweet media one
2
5
28
@xiangyue96
Xiang Yue
1 year
I'm in Toronto and attending #ACL2023 . I'll present the following work. My general research interests are building safe and responsible LMs. Some recent topics are privacy, hallucination, attribution, robustness, etc. Feel free to ping me if you'd like to chat on related topics🥳
@ysu_nlp
Yu Su
1 year
3) @xiangyue96 Synthetic Text Generation with Differential Privacy: A Simple and Practical Recipe. Session 7 - Wed July 12, 11:00-12:30, Room: Frontenac Ballroom and Queen’s Quay.
1
0
3
1
3
24
@xiangyue96
Xiang Yue
1 year
lol!! I'll do my real PhD defense tmr and came across this simulator tnt. The funny thing is that the simulated duration is exactly the same as my real case. Is that some good indicator I'll pass my defense tmr? 🤩🤩
Tweet media one
@thegautamkamath
Gautam Kamath
1 year
A realistic simulation of what it's like to be a grad student? I was only clicking buttons but the frequent rejection was triggering 🙃 It took me just under 7 years to graduate, can you do better?
62
125
661
2
0
22
@xiangyue96
Xiang Yue
4 months
Check out the comprehensive large multi-modal model evaluation framework by LMMs-Eval team👇! And thanks for including our MMMU benchmark!
@BoLi68567011
Li Bo
4 months
Accelerating the Development of Large Multimoal Models with LMMs-Eval Repo: Blog: We are offering a one command evaluation API for fast and thorough evaluation of LMMs over 39 datasets (increasingly).
Tweet media one
Tweet media two
1
24
114
0
3
22
@xiangyue96
Xiang Yue
5 months
🧩The discussion surrounding MoE has been a vibrant topic in our community. However, the open-source community's efforts to replicate and explore this process have been notably sparse. Thanks to the invaluable contributions of @XueFz , our open-source community now has a gateway
@XueFz
Fuzhao Xue on the job market!
5 months
(1/5)🚀 Our OpenMoE Paper is out! 📄 Including: 🔍ALL Checkpoints 📊 In-depth MoE routing analysis 🤯Learning from mistakes & solutions Three important findings: (1) Context-Independent Specialization; (2) Early Routing Learning; (3) Drop-towards-the-End. Paper Link:
Tweet media one
5
106
521
0
6
21
@xiangyue96
Xiang Yue
1 year
Honored to receive the Exemplary Graduate Researcher Award. Immensely grateful to my advisor @hhsun1 and the entire team @osunlp for their support! This award fuels my pursuit of safe, responsible #LLMs and their interdisciplinary uses, such as in privacy and healthcare. ☺️🥳
@hhsun1
Huan Sun (OSU) at CVPR'24
1 year
HUGE congratulations to Xiang Yue @xiangyue96 @osunlp on the very competitive Exemplary Graduate Student Researcher Award (1 out of 21 nominees across the College)! His work studies privacy-preserving #LLMs , attributions by LLMs, etc. #NLProc #proudadvisor
0
1
17
5
0
19
@xiangyue96
Xiang Yue
5 months
It was great to work with @Francis_YAO_ on this! Two important takeaways from my side: > The ability to retrieve information in the long context is already acquired during pretraining, even for models pre-trained on shorter sequences (e.g., 4096). A lightweight continual
@Francis_YAO_
Yao Fu
5 months
Frontier models all have at least 100k context length, Gemini 1.5 has even 1m context. What about research and open source? Introducing Long Context Data Engineering, a data driven method achieving the first 128k context open source model matching GPT4-level Needle in a
Tweet media one
8
67
469
0
3
18
@xiangyue96
Xiang Yue
10 months
Kudos to the team @WenhuChen (co-lead) Xingwei, @GeZhang86038849 , @Francis_YAO_ , Wenhao, @hhsun1 , @ysu_nlp ! Resources: Project: Paper: Github: Dataset: Models:
2
3
17
@xiangyue96
Xiang Yue
1 month
Can we really trust VLMs in critical areas like medical image diagnosis? No! The adversarial probing exposes major flaws in top models like GPT-4V and Gemini Pro. Top models sometimes perform worse than random guessing on diagnostic questions. Domain-specific models like
@xwang_lk
Xin Eric Wang
1 month
Can we really trust AI in critical areas like medical image diagnosis? No, and they are even worse than random. Our latest study, "Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA," uncovers the stark limitations of
Tweet media one
29
118
417
0
3
17
@xiangyue96
Xiang Yue
6 months
Our community studied intelligent assistants for many years. One major drawback of these old systems is the lack of generalizability. The most exciting prospect of SeeAct is to show the potential for these agents to act as a universal interface for information and services. By
@ysu_nlp
Yu Su
6 months
Generalist web agents may get here sooner than we thought---introducing SeeAct, a multimodal web agent built on GPT-4V(ision). What's this all about? > Back in June 2023, when we released Mind2Web () and envisioned generalist web agent, a language agent
18
149
649
0
1
16
@xiangyue96
Xiang Yue
6 months
Finally, great to see it's happened! @aclmeeting has ended the anonymity period for ACL submissions
@AlhamFikri
Alham Fikri Aji
6 months
No more anonymity period for ACL submissions @aclmeeting 🎉 For those working towards January's anonymity, let's get some sleep.
Tweet media one
4
7
64
0
1
16
@xiangyue96
Xiang Yue
2 years
#acl2022nlp 📢📢 Interested in open-domain question answering and LM pretraining? Want to know how to better pretrain open-domain QA models or dense retrievers in an unsupervised/self-supervised way? Check out our ACL22 work: 👇👇
Tweet media one
2
5
15
@xiangyue96
Xiang Yue
9 months
#LLMs struggle a lot in defending the real truth and can be easily misled by the user’s (invalid) arguments and critiques. Their strong reasoning capability is actually very fragile and may stem from prompters knowing the answer, akin to the horse #CleverHans . 👇👇 #EMNLP2023
Tweet media one
@BoshiWang2
Boshi Wang
9 months
Are LLMs reasoning based on deep understandings of truth and logic? Can LLMs hold & defend their own "reasoning"? Our #EMNLP23 findings paper () explores testing LLMs' reasoning by engaging them in a debate that probes deeper into their understanding.
Tweet media one
3
36
111
0
2
15
@xiangyue96
Xiang Yue
7 months
Great summary! Thanks for mentioning our MathInstruct dataset. MathInstruct is a meticulously curated math instruction tuning dataset. It is compiled from 13 math rationale datasets, featuring a hybrid use of chain-of-thought (CoT) and program-of-thought
Tweet media one
@clefourrier
Clémentine Fourrier 🍊
7 months
2023 has been incredible for open releases, so I made a ✨year review in Open LLMs ✨ It was lots of fun coming back through all that came out, and it's insane how much the field soared thanks to the community & openness! Summary of each section: 🧵
7
48
183
0
0
14
@xiangyue96
Xiang Yue
1 month
4) Our findings suggest that data distribution (not just size) is critical for grokking, and cross-layer memory sharing in the transformer architecture could improve systematic generalization.
Tweet media one
1
3
13
@xiangyue96
Xiang Yue
4 months
Thanks for sharing our work. Detailed analysis will be posted soon :)
Tweet media one
@_akhaliq
AK
4 months
StructLM Towards Building Generalist Models for Structured Knowledge Grounding Structured data sources, such as tables, graphs, and databases, are ubiquitous knowledge sources. Despite the demonstrated capabilities of large language models (LLMs) on plain text, their
Tweet media one
2
53
327
0
0
14
@xiangyue96
Xiang Yue
1 month
This was the third work I collaborated with @BoshiWang2 . He is a so brilliant but hard-working guy. I enjoyed much in every discussion with Boshi. This project was actually started from a discussion happening when we attended #ACL2023 last year. Still remember we were talking
1
0
13
@xiangyue96
Xiang Yue
1 year
Congratulations to the amazing team at @osunlp on the 8 papers accepted to #ACL2023 🎉👏 So proud of being part of this talented group and one of the authors on this list! 🙌 #NLP #research
@ysu_nlp
Yu Su
1 year
🎉 Thrilled to share @osunlp has 8 papers accepted to #ACL2023 (out of 11 subs), and 3 of the papers received best paper nomination by reviewers. We don't normally submit this many papers and grateful it turns out well 🥰 Equally proud of the papers that didn't get in this time!
Tweet media one
3
19
137
0
0
13
@xiangyue96
Xiang Yue
1 month
Glad to see this is finally out! Congrats @bernaaaljg @ysu_nlp ! HippoRAG draws inspiration from the hippocampal memory indexing theory of human long-term memory. Check this out if you are working on #RAG , multi-hop #reasoning , or related topics!
@bernaaaljg
Bernal Jiménez
1 month
📣📣 Super proud to present the most exciting project of my PhD so far: “HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models”. HippoRAG, as the title suggests, is a brain-inspired RAG framework that enables LLMs to effectively and efficiently
Tweet media one
29
155
865
0
1
13
@xiangyue96
Xiang Yue
3 months
Oh, I just noticed that the strong code and math reasoning performance of #Llama3 is reported based on their instruction-tuned version, which means that the model might have been trained on GSM8K or MATH (augmented) training sets. 😅
Tweet media one
@xiangyue96
Xiang Yue
3 months
😱Is pure text pre-training coming to an end? A Thread on 🦙 Llama 3's report 🧵: 1. Tokenizer 🔍 Llama 3's tokenizer boasts a 128k vocab size and yields 15% fewer tokens than Llama 2, enabling more efficient and accurate tokenization. 2. Model
1
15
60
2
1
13
@xiangyue96
Xiang Yue
1 month
3) Deep analysis into the model's internals reveals the gradual formation of generalizing circuits during grokking. The configuration of these circuits explains the variations in systematicity across tasks.
Tweet media one
1
2
12
@xiangyue96
Xiang Yue
7 months
@DrewHawkswood Our work has actually been “peer-reviewed” by the whole community in the past week🤣. We received many comments from X/Twitter, HF, GitHub, emails, etc. We are very grateful for these feedback from the community and would have incorporated them into the revisions.
1
0
12
@xiangyue96
Xiang Yue
4 months
@teortaxesTex We actually fine-tuned a 15B version but the results look pretty wired. It might be due to the transformer version issue. We are still debugging this. It should be resolved very soon. Will definitely release a 15B version. :)
1
0
12
@xiangyue96
Xiang Yue
7 months
This is really impressive! Now new SOTA on the MMMU benchmark! Our test set evaluation is also available at :
@JeffDean
Jeff Dean (@🏡)
7 months
MMMU is a brand new benchmark () that was released just last week, with ~11,500 examples requiring image understanding, college-level subject knowledge and deliberate reasoning. We decided it would be fun to try the Gemini models on this benchmark to see
Tweet media one
4
26
230
1
2
12
@xiangyue96
Xiang Yue
7 months
Check out the new Math models fine tuned from our previous MathInstruct dataset used in 🦣 MAmmoTH
@WenhuChen
Wenhu Chen
7 months
Looking for the best open-source (small) Math model? I'm happy to release MAmmoTH-7B-Mistral (), which achieves 40% on MATH and 52% on MMLU-Math. Nothing fancy, I just fine-tuned Mistral-7B on our previous MathInstruct dataset ().
Tweet media one
5
36
172
0
2
12
@xiangyue96
Xiang Yue
9 months
@jefffhj @GoogleDeepMind Congratulations Jie! We got similar observations recently. This is exactly prompter knowing the answer :)
Tweet media one
1
1
12
@xiangyue96
Xiang Yue
1 month
5) The power of parametric memory for complex reasoning: a grokked transformer achieves near-perfect accuracy on a challenging task where #GPT4 & #Gemini -Pro with non-parametric memory fail badly.
Tweet media one
Tweet media two
1
1
11
@xiangyue96
Xiang Yue
10 months
🥳We train and evaluate over 50+ models and baselines (500+ experiments). We compile two giant result tables covering nearly all the LLMs we can find in the math reasoning fields. 🧐
Tweet media one
Tweet media two
2
0
11
@xiangyue96
Xiang Yue
2 months
@OpenAI #GPT4o has greatly improved its #reasoning abilities across text and #multimodal contexts. It now achieves an accuracy of 🤯🤯69.1% on our #MMMU benchmark, close to the performance of lower-performing human experts (76.2%)! "With GPT-4o, we trained a single new model
Tweet media one
@xiangyue96
Xiang Yue
5 months
🤖How far are we from achieving Expert AGI? We included human experts' performance on the MMMU benchmark (). 📊 The best-performing human experts achieved an accuracy of 88.6 while the best-performing model @GoogleDeepMind Gemini Ultra just scored 59.4,
Tweet media one
1
23
99
0
3
10
@xiangyue96
Xiang Yue
1 year
🎉We're thrilled to share that our paper👇 has been accepted by #ACL2023NLP ! Our method fine-tunes LMs with DP to generate useful text while providing strong privacy protection. Check out our preprint for more details on this promising path to mitigating #privacy concerns in NLP
@xiangyue96
Xiang Yue
2 years
📢📢Want to share your textual data outside your org or team but worry about privacy leakage? Check out our new preprint "Synthetic Text Generation with Differential Privacy: A Simple and Practical Recipe"() 👇👇👇
1
6
37
0
3
10
@xiangyue96
Xiang Yue
10 months
A natural question to ask: Why 🦣 #MAmmoTH is so powerful😱? We investigate how the two major characteristics of #MathInstruct influence the performance of 🦣. Main takeaway: Diverse data sources and hybrid CoT & PoT training lead to substantial gains, making 🦣 math generalists.
Tweet media one
1
0
10
@xiangyue96
Xiang Yue
2 months
@arankomatsuzaki Thanks for sharing our work! More information is in the following thread👇
@xiangyue96
Xiang Yue
2 months
🚀 Scaling 10M naturally existing instructions from the Web! 🌐 Introducing 🦣MAmmoTH2 (), a family of open-source models that unlocks unprecedented performance gains in reasoning tasks by leveraging two key insights: 1️⃣ Scaling is still all you need, even
Tweet media one
10
71
377
0
2
10
@xiangyue96
Xiang Yue
4 months
To learn more about OpenCodeInterpreter 👇
@xiangyue96
Xiang Yue
5 months
🌟With precise execution & human feedback, a 7B code model hits 90% accuracy on HumanEval! 🚀 Introducing OpenCodeInterpreter: A family of open-source code systems for generating, executing, & refining code.🔄 🤖 Traditional open-source models often fall short in execution
Tweet media one
27
61
319
1
1
10
@xiangyue96
Xiang Yue
10 months
🚀Our instruction-tuning dataset #MathInstruct is compiled from 13 math datasets, 6 of which have rationales newly curated by us. What set #MathInstruct apart? 1️⃣Broad coverage of different math fields and complexity levels 2️⃣Hybrid CoT & PoT rationales
Tweet media one
1
0
10
@xiangyue96
Xiang Yue
7 months
🧐Both LLMs and humans hallucinate. It’s a form of creation. The issue arises in contexts where factuality is crucial. We expect LLMs not to ‘dream’ then. It’s not the hallucination that’s the problem, but the context in which it occurs. Creativity is a gift, but factuality and
@karpathy
Andrej Karpathy
7 months
# On the "hallucination problem" I always struggle a bit with I'm asked about the "hallucination problem" in LLMs. Because, in some sense, hallucination is all LLMs do. They are dream machines. We direct their dreams with prompts. The prompts start the dream, and based on the
757
3K
15K
0
0
10
@xiangyue96
Xiang Yue
2 months
Our incredible team's efforts made this project possible. 🙌 @GeZhang86038849 is at ICLR'24 in Vienna and will be presenting MAmmoTH1 (ICLR'24 Spotlight) and MAmmoTH2. 🎙️ If you're at the conference and would like to learn more for both projects, feel free
@xiangyue96
Xiang Yue
10 months
Introducing 🦣MAmmoTH: The BEST open-source #LLMs for math NOW! 🦣Outperforms SOTA on 9 math reasoning datasets, with accuracy gains of 13-29% across all scales. 🦣 is tuned on our 260K #MathInstruct dataset, including hybrid CoT & PoT rationales. #NLProc
Tweet media one
2
63
274
2
1
9
@xiangyue96
Xiang Yue
5 months
How different VLMs perform in the wild? Great to see that Vision Arena led by @yujielu_10 and @billyuchenlin was released! Kudos to the team!
@billyuchenlin
Bill Yuchen Lin 🤖
5 months
Introducing Vision Arena! Inspired by the awesome Chatbot Arena, we built a web demo on @huggingface for testing Vision LMs (GPT-4V, Gemini, Llava, Qwen-VL, etc.). You can easily test two VLMs side by side and vote! It’s still a work-in-progress. Feedbacks are welcome! 🔗
34
110
533
0
1
9
@xiangyue96
Xiang Yue
7 months
Huan is an amazing advisor. Feel free to ping me if you want to learn more about PhD life in OSU NLP Group!😉
@hhsun1
Huan Sun (OSU) at CVPR'24
7 months
Hiring multi Ph.D. students this cycle in areas: #LLM train/eval, trustworthiness of LLMs incl. privacy & safety, LLM for biomedicine/chemistry. see below for representative work. I won't be #EMNLP23 , but pls talk to my (former) students there @xiangyue96 @BoshiWang2 @RonZiruChen
5
23
92
0
0
9
@xiangyue96
Xiang Yue
7 months
Great to see that this project has been successfully shipped!! > BioCLIP is the first large-scale multimodal model for general biology questions related to images. > BioCLIP utilizes a wide range of biological images, including plants, animals, and fungi. > Trained on the
@ysu_nlp
Yu Su
7 months
Introducing BioCLIP: A Vision Foundation Model for the Tree of Life A foundation model that strongly generalizes on the tree of life (2M+ species), outperforming OpenAI CLIP by 18% in zero-shot classification, and supports open-ended classification over
9
94
440
0
3
8
@xiangyue96
Xiang Yue
3 years
New #ACL2021 Findings paper: "Differential Privacy for Text Analytics via Natural Text Sanitization". The privacy issue is always overlooked in NLP. We address privacy from the root: directly producing sanitized text documents based on differential privacy. [1/3]
Tweet media one
2
1
7
@xiangyue96
Xiang Yue
5 months
@_akhaliq Thanks for sharing our work! Resources: Website: Paper: Github: Huggingface (models & data):
0
0
6
@xiangyue96
Xiang Yue
1 year
We hope our testbed, modeling methodology, and insights will help lay the foundation for future studies on this important problem. Our code, models, and datasets are available at: . Joint work with @BoshiWang2 @DrogoKhal4 @RonZiruChen @ysu_nlp @hhsun1 [6/N]
0
0
6
@xiangyue96
Xiang Yue
3 months
Check out @TianleLI123 's thread for more details👇
@TianleLI123
Tianle LI
3 months
[1/n] 👉👉👉 Checkout our latest work to explore the behaviour of the SoTA long-context LLMs when confronted with long in-context learning: “Long-context LLMs Struggle with Long In-context Learning” We created 🐍LongICLBench🐍 to conduct comprehensive
Tweet media one
2
13
63
0
0
6
@xiangyue96
Xiang Yue
1 year
📚Most of the attributed LLMs (e.g., generative search engines) rely on humans to verify the attribution, which is costly. Our research explores two automatic evaluation methods: prompting LLMs and fine-tuning smaller LMs on repurposed data from related tasks. [2/N]
Tweet media one
1
0
5
@xiangyue96
Xiang Yue
9 months
Great to see that by combining our 🦣**MAmmoTH MathInstruct** dataset with other open source datasets, Mistral 7B model can obtain very impressive performance on GSM8K and MATH.
@akjindal53244
Ashvini Jindal
9 months
Excited to announce release of 𝗔𝗿𝗶𝘁𝗵𝗺𝗼-𝗠𝗶𝘀𝘁𝗿𝗮𝗹-𝟳𝗕 model that outperforms existing 7B and 13B state-of-the-art mathematical reasoning models by a huge margin on both GSM8K and MATH datasets.
Tweet media one
Tweet media two
6
23
107
0
0
5
@xiangyue96
Xiang Yue
1 year
Citation (or attribution) is definitely a crucial component in LLMs to enhance the verifiability and trustworthiness of the genearated statement. Agreed that a comprehensive citation mechanism should account for both non-parametric and parametric knowledge. Good position paper!
@jefffhj
Jie Huang
1 year
New position paper! 🔥 We position "citation" as the key to building responsible and accountable large language models - enhancing content transparency and verifiability, while mitigating IP and ethical dilemmas in LLMs such as #ChatGPT . 👉 🧵⬇️
Tweet media one
3
23
95
0
1
5
@xiangyue96
Xiang Yue
2 months
Came across a fun math query: "use 4 4s to form an expression equals to 4." I tested it across various models and nearly all models fail.😅 ❌ #GPT4o #GPT4 #Claude3 Opus ❌ #Gemini #Llama3 #Yi -1.5-34B-Chat (succeeded after a few attempts) ❌ #Qwen -110B-Chat
Tweet media one
0
0
4
@xiangyue96
Xiang Yue
7 months
@TheSeaMouse Right. As most of the questions in MMMU highly rely on images, GPT-4 Text only often says something like "Without images, I cannot answer this question". We also try some prompting strategies and force the GPT-4 Text output the most likely answer. The results are pretty low as
0
0
5
@xiangyue96
Xiang Yue
9 months
@akjindal53244 Thanks for the great work Ashvini! Would you mind adding MAmmoTH 7B performance in the leaderboard? And could you please acknowledge MathInstruct dataset in the introduction and recommend citing MAmmoTH and MetaMath papers if people use the data and model in the repo?
1
0
5
@xiangyue96
Xiang Yue
1 year
@YangsiboHuang Very interesting work! I agree that heuristic methods like PIIs removal would be useful in target attacks. But for untargeted attacks, we definitely need DP. Our previous work in ACL'21 and ACL'23 could be extended in this scenario:
1
0
5
@xiangyue96
Xiang Yue
5 months
@airesearch12 @Yampeleg Thanks! We currently only support Python for the execution. But the model can generate other languages as well. We are considering extending this execution and refinement to other popular languages. Any suggestions and comments are very welcome.
3
0
5
@xiangyue96
Xiang Yue
1 year
📊 Our results show promising signals in the automatic evaluation: fine-tuned smaller LMs (e.g., FLAN-T5-Large) can even outperform LLMs like ChatGPT. However, automatic evaluation of attribution still remains challenging. [4/N]
Tweet media one
1
0
4
@xiangyue96
Xiang Yue
8 months
😀I like the model's name Yi. "Y" resembles the Chinese character "人" (which means "human") rotated 180 degrees. "i" is taken from "Ai". So "Yi" means Human + Ai ("以人为本" in Chinese). It's a perfect visual mash-up! The GIF was taken from the website
1
0
4
@xiangyue96
Xiang Yue
9 months
Very impressive results! Congratulations! Many people including me are wondering how the pre-training data size and mixture contribute to the improvement🤔️
@AlbertQJiang
Albert Jiang
9 months
Mistral 7B paper is up on arxiv. The authorship order is alphabetical. Please cite with author = {Mistral AI} 🙂
Tweet media one
19
182
1K
0
0
4
@xiangyue96
Xiang Yue
3 months
Thanks for this PR! It is so great to see this happen: Devin submitted a PR to OpenDevin and got approved -:)
@gneubig
Graham Neubig
3 months
Thanks to Devin for the contribution to OpenDevin! It's great to see that even AI programmers believe in the power of open source 😃
7
17
202
0
0
4
@xiangyue96
Xiang Yue
7 months
@hu_yifei Thanks! There might be little chance we included such examples. Our questions are mostly from college textbooks and exams. But I definitely agree that such a GUI understanding scenario (e.g., web or mobile agent) will be another important application of LMMs' evaluation.
0
0
3
@xiangyue96
Xiang Yue
1 year
🧩Most evaluation failures stem from three factors: 1) insensitivity to fine-grained information comparison such as numerical values, 2) overlooking contextual cues in the reference, and 3) failures in performing symbolic operations such as verifying set relationships [5/N]
Tweet media one
1
0
4
@xiangyue96
Xiang Yue
5 months
@Maxwell_Nye Congrats on the outstanding achievements of Fuyu-Heavy in our MMMU benchmark! We would be delighted to feature Fuyu-Heavy's detailed results like Gemini on our leaderboard. This could provide more visibility for the model: .
0
0
4
@xiangyue96
Xiang Yue
1 month
@_philschmid @lmsysorg Thanks for sharing our work!
0
0
4
@xiangyue96
Xiang Yue
7 months
@JeffDean Thank you Jeff!
0
0
3
@xiangyue96
Xiang Yue
5 months
@avion2323 @Yampeleg We'll release a demo on huggingface space very soon. Stay tuned!
1
0
3
@xiangyue96
Xiang Yue
1 month
@juniorro16 Y axis is the model's accuracy on our MixEval-Hard benchmark.
0
0
3