Postdoc
@LTIatCMU
. PhD from Ohio State
@osunlp
. Author of MMMU, MAmmoTH. Training & evaluating foundation models. Previously
@MSFTResearch
. Opinions are my own.
🚀 Introducing MMMU, a Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI.
🧐 Highlights of the MMMU benchmark:
> 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks
>
Thank
@_akhaliq
for sharing our work!
Paper:
Key takeaways:
1) Transformers can learn to implicitly reason, but only through extended training far beyond overfitting, a phenomenon known as grokking.
2) Transformers exhibit different levels of
Grokked Transformers are Implicit Reasoners
A Mechanistic Journey to the Edge of Generalization
We study whether transformers can learn to implicitly reason over parametric knowledge, a skill that even the most capable language models struggle with. Focusing on two
🚀 Scaling 10M naturally existing instructions from the Web! 🌐
Introducing 🦣MAmmoTH2 (), a family of open-source models that unlocks unprecedented performance gains in reasoning tasks by leveraging two key insights:
1️⃣ Scaling is still all you need, even
🌟 Big thanks for making StarCoder 2 open-source! 🚀 We've swiftly finetuned it on our Code-Feedback instruction dataset, the dataset behind OpenCodeInterpreter. 📈 HumanEval Scores are boosted ~30%.
3B Model: from 31.7 to 67.1!
7B Model: from 35.4 to 75.6!
🛠️ CodeFeedback has
Introducing StarCoder 2 ⭐️ The most complete open Code-LLM 🤖 StarCoder 2 is the next iteration for StarCoder and comes in 3 sizes, trained 600+ programming languages on over 4 Trillion tokens on Stack v2. It outperforms StarCoder 1 by margin and has the best overall performance
🌟With precise execution & human feedback, a 7B code model hits 90% accuracy on HumanEval!
🚀 Introducing OpenCodeInterpreter: A family of open-source code systems for generating, executing, & refining code.🔄
🤖 Traditional open-source models often fall short in execution
[1/n]
🚀 Excited to share our latest work on OpenCodeInterpreter! With a blend of execution results and human feedback, we've achieved significant advancements in code generation. Here are the key points:
✨ Introducing OpenCodeInterpreter - a leap in iterative code refinement.
Introducing 🦣MAmmoTH: The BEST open-source
#LLMs
for math NOW! 🦣Outperforms SOTA on 9 math reasoning datasets, with accuracy gains of 13-29% across all scales. 🦣 is tuned on our 260K
#MathInstruct
dataset, including hybrid CoT & PoT rationales.
#NLProc
🎉 Thrilled to announce the launch of our OpenCodeInterpreter demo, now live on
@huggingface
space! Check it out here:
For those interested in running it locally, we've got you covered with a guide available on our GitHub (~1K stars! Trending NOW!):
🌟With precise execution & human feedback, a 7B code model hits 90% accuracy on HumanEval!
🚀 Introducing OpenCodeInterpreter: A family of open-source code systems for generating, executing, & refining code.🔄
🤖 Traditional open-source models often fall short in execution
🚀Introducing VisualWebBench: A Comprehensive Benchmark for Multimodal Web Page Understanding and Grounding.
🤔What's this all about? Why this benchmark?
> Back in Nov 2023, when we released MMMU (), a comprehensive multimodal
(1/8)🚀We introduce VisualWebBench, a multimodal benchmark designed to assess the understanding and grounding capabilities of MLLMs in web scenarios. Encompassing seven tasks, and comprises 1.5K human-curated instances from 139 real websites, covering 87 sub-domains.
🎉 Introducing MixEval📊: A fast, cheap, and effective
#LLMs
benchmark that combines and reweights existing benchmarks.
Instead of creating a "new" benchmark from scratch with costly annotations, we found a smart reweighting strategy that can "refresh"
How to get ⚔️Chatbot Arena⚔️ model rankings with 2000× less time (5 minutes) and 5000× less cost ($0.6)?
Maybe simply mix the classic benchmarks.
🚀 Introducing MixEval, a new 🥇gold-standard🥇 LLM evaluation paradigm standing on the shoulder of giants (classic benchmarks).
Long-context LLMs Struggle with Long In-context Learning! 🤯
We developed LongICLBench to rigorously test LLMs on extreme classification tasks with increasing complexity. We meticulously selected six datasets with a label range spanning 28 to 174 classes covering different input
Long-context LLMs Struggle with Long In-context Learning
Suggests a notable gap in current LLM capabilities for processing and understanding long, context-rich sequences.
🤖How far are we from achieving Expert AGI? We included human experts' performance on the MMMU benchmark ().
📊 The best-performing human experts achieved an accuracy of 88.6 while the best-performing model
@GoogleDeepMind
Gemini Ultra just scored 59.4,
🚀 Introducing MMMU, a Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI.
🧐 Highlights of the MMMU benchmark:
> 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks
>
🥰Attending
#CVPR2024
and presenting 🏆Award Candidate Paper MMMU!
DM is open, drop me a message if you'd like to chat about
#multimodal
,
#LLMs
,
#evaluation
or
#GenAI
in general!
TUE 18 JUN 3:30pm:
New frontiers for zero-shot Image Captioning Evaluation (NICE)
THU 20 JUN
🚀 Introducing MMMU, a Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI.
🧐 Highlights of the MMMU benchmark:
> 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks
>
Excited to announce that MMMU has been selected as an Oral presentation at
#CVPR2024
(90 orals in total, 0.8%)! Congrats to all the collaborators, and see you in Seattle! It will be my first time attending a CV conference. So excited! 😃
🚀 Introducing MMMU, a Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI.
🧐 Highlights of the MMMU benchmark:
> 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks
>
I'm in Singapore🇸🇬 for
#EMNLP2023
! Do not hesitate to ping me for a coffee chat! My recent work covers a range of topics in
#LLMs
&
#LMMs
:
- Multi-modal training & eval
- (Math) Reasoning (generalization/robustness)
- Attribution of
#LLMs
Check more details below👇
- MMMU: A
🌟 Exciting news: 🦣MAmmoTH was accepted as a spotlight (5%) at
#ICLR2024
! A huge shoutout to our amazing team!
We're now exploring more training dynamics of
#LLMs
for math reasoning and uncovering fascinating insights. Perhaps a sequel of the work? MAmmoTH-2 :)?
Stay tuned!!
Introducing 🦣MAmmoTH: The BEST open-source
#LLMs
for math NOW! 🦣Outperforms SOTA on 9 math reasoning datasets, with accuracy gains of 13-29% across all scales. 🦣 is tuned on our 260K
#MathInstruct
dataset, including hybrid CoT & PoT rationales.
#NLProc
🚀 Introducing MMMU, a Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI.
🧐 Highlights of the MMMU benchmark:
> 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks
>
After receiving community feedback, we added
@GoogleDeepMind
Gemini 1.5 Pro's results. 👇 Gemini 1.5 Pro's vision ability was significantly improved compared to 1.0 Pro and matched GPT-4's performance on our VisualWebBench! 🏆 Its action prediction (e.g., predicting what would
🚀Introducing VisualWebBench: A Comprehensive Benchmark for Multimodal Web Page Understanding and Grounding.
🤔What's this all about? Why this benchmark?
> Back in Nov 2023, when we released MMMU (), a comprehensive multimodal
Amazing achievement! Congratulations on reaching the new state-of-the-art with a 62.4% score for Gemini on our newly-released MMMU benchmark. Gemini's multimodal perception and reasoning capabilities are truly impressive!😱
I’m very excited to share our work on Gemini today! Gemini is a family of multimodal models that demonstrate really strong capabilities across the image, audio, video, and text domains. Our most-capable model, Gemini Ultra, advances the state of the art in 30 of 32 benchmarks,
How does a baby learn to navigate the world around them? 🚶♂️👶 Through exploration and learning from each little stumble and triumph. The ETO framework applies this very essence of human learning to AI, emphasizing the importance of both success and failure in developing better AI
Trial and Error: Exploration-Based Trajectory Optimization for LLM Agents
Presents an exploration-based trajectory optimization approach, which consistently surpasses baseline performance by a large margin
repo:
abs:
**Large high-quality Pre-trained Data is All you Need**
Great performance of Yi-34B by
@01AI_Yi
on
@huggingface
#LLM
leaderboard.
Highlights from the website:
> Pre-trained on 3T high-quality tokens from scratch
> Support up to 200K context window
Our team at
@01AI_Yi
is very proud to introduce the release of Yi-34B model now on top of
@huggingface
pretrained LLM leaderboard! Also a Yi-6B available. Welcome to give a try and build fantastic projects!
As the first author of the MMMU benchmark, I want to emphasize that our team didn't grant early access to anyone outside our team. Our goal in creating this benchmark is to ensure a fair and accessible evaluation platform for the entire community. Consistent with this aim, we
Hi
@emilymbender
, I'm one of the lead authors of MMMU. I can certify that 1) Google didn't fund this work, and 2) Google didn't have early access. They really like the benchmark after our release and worked very hard to get the results. It doesn't take that long to eval on a
🤯Here are the
#Llama3
-8B base model results on the selected reasoning benchmarks.
Short Conclusions:
- Mistral 7B Base (?)< Llama 3 8B Base (15T tokens)~= Gemma 7B Base (6T Tokens). Maybe we do not need that many tokens?
- Llama-3 instruction tuning does a great job of
Oh, I just noticed that the strong code and math reasoning performance of
#Llama3
is reported based on their instruction-tuned version, which means that the model might have been trained on GSM8K or MATH (augmented) training sets. 😅
😱Is pure text pre-training coming to an end?
A Thread on 🦙 Llama 3's report 🧵:
1. Tokenizer 🔍
Llama 3's tokenizer boasts a 128k vocab size and yields 15% fewer tokens than Llama 2, enabling more efficient and accurate tokenization.
2. Model
🥳It is great to work with
@Francis_YAO_
(most credits go with Yao!!!) to dive deep into data influence on training long context models.
🤠 TL;DR: A practical technical blog on training long context models, covering data scale/mixture, training setup (e.g., positional encoding,
Although there are abundant work studying long-context LLMs, most of them talks about architecture / positional encoding, almost none of existing papers talk about data.
In this work, we take a close look at data influence on context scaling
📢 New preprint alert! Check out our latest research on 🌟automatic evaluation of attribution by
#LLMs
, i.e., verifying whether the generated statement is supported by the cited reference. [1/N]
#NLP
#NLProc
@MistralAI
just released their v0.2 Base😱.
@WenhuChen
and I quickly evaluated a few benchmarks using the OpenCompass evaluation package. It seems that the capability dropped a little bit on nearly all the benchmarks I tested. 🤔
Mistral just announced at
@SHACK15sf
that they will release a new model today:
Mistral 7B v0.2 Base Model
- 32k instead of 8k context window
- Rope Theta = 1e6
- No sliding window
Our
#EMNLP2023
work also reveals this phenomenon. We find that despite being able to generate correct step-by-step solutions in the beginning, LLMs cannot maintain their belief in truth when challenged by often-time absurdly invalid arguments.
They give examples where removing oracles or improving initial prompts eliminates/reduces any benefit from self-correction.
Great representative diagram h/t
@xiangyue96
(6/n)
🎓 Just successfully defended my Ph.D. dissertation! 🥳📚 It's been a challenging and rewarding journey, but I made it! Grateful for the support of my advisor
@hhsun1
and all the members in
@osunlp
. I'll attend ACL in Toronto next week. Feel free to DM me if you'd like to chat!
lol!! I'll do my real PhD defense tmr and came across this simulator tnt. The funny thing is that the simulated duration is exactly the same as my real case. Is that some good indicator I'll pass my defense tmr? 🤩🤩
Congratulations! BioCLIP won the Best Student Paper at
#CVPR2024
! Sam, Luke
@luke_ch_song
and Yu
@ysu_nlp
are attending the conference, find them for a chat!
Excited to be at CVPR presenting BioCLIP! DM me if you want to chat about computer vision for animals, multimodal foundation models, or AI for science!
📢📢Want to share your textual data outside your org or team but worry about privacy leakage? Check out our new preprint "Synthetic Text Generation with Differential Privacy: A Simple and Practical Recipe"() 👇👇👇
🚀 The enthusiasm for OpenDevin has exceeded our expectations! We've got an initial roadmap and a bunch of great guys working on it. 🫡 Even
@gneubig
has completed a front-end prototype in a very short time!! It's all fantastic and you can fill out the form below to join us.
#acl2022nlp
📢📢 Want to know how to better leverage synthetic QA data for domain adaptation? Check out our ACL22 work: 👇👇
I will present our work in the two virtual poster sessions:
VPS2/VPS4: Question Answering, 14:00 EST May 24/25. Feel free to stop by!
I'm in Toronto and attending
#ACL2023
. I'll present the following work. My general research interests are building safe and responsible LMs. Some recent topics are privacy, hallucination, attribution, robustness, etc. Feel free to ping me if you'd like to chat on related topics🥳
3)
@xiangyue96
Synthetic Text Generation with Differential Privacy: A Simple and Practical Recipe. Session 7 - Wed July 12, 11:00-12:30, Room: Frontenac Ballroom and Queen’s Quay.
lol!! I'll do my real PhD defense tmr and came across this simulator tnt. The funny thing is that the simulated duration is exactly the same as my real case. Is that some good indicator I'll pass my defense tmr? 🤩🤩
A realistic simulation of what it's like to be a grad student? I was only clicking buttons but the frequent rejection was triggering 🙃
It took me just under 7 years to graduate, can you do better?
Accelerating the Development of Large Multimoal Models with LMMs-Eval
Repo:
Blog:
We are offering a one command evaluation API for fast and thorough evaluation of LMMs over 39 datasets (increasingly).
🧩The discussion surrounding MoE has been a vibrant topic in our community. However, the open-source community's efforts to replicate and explore this process have been notably sparse. Thanks to the invaluable contributions of
@XueFz
, our open-source community now has a gateway
(1/5)🚀 Our OpenMoE Paper is out! 📄 Including:
🔍ALL Checkpoints
📊 In-depth MoE routing analysis
🤯Learning from mistakes & solutions
Three important findings:
(1) Context-Independent Specialization;
(2) Early Routing Learning;
(3) Drop-towards-the-End.
Paper Link:
Honored to receive the Exemplary Graduate Researcher Award. Immensely grateful to my advisor
@hhsun1
and the entire team
@osunlp
for their support! This award fuels my pursuit of safe, responsible
#LLMs
and their interdisciplinary uses, such as in privacy and healthcare. ☺️🥳
HUGE congratulations to Xiang Yue
@xiangyue96
@osunlp
on the very competitive Exemplary Graduate Student Researcher Award (1 out of 21 nominees across the College)! His work studies privacy-preserving
#LLMs
, attributions by LLMs, etc.
#NLProc
#proudadvisor
It was great to work with
@Francis_YAO_
on this! Two important takeaways from my side:
> The ability to retrieve information in the long context is already acquired during pretraining, even for models pre-trained on shorter sequences (e.g., 4096). A lightweight continual
Frontier models all have at least 100k context length, Gemini 1.5 has even 1m context. What about research and open source?
Introducing Long Context Data Engineering, a data driven method achieving the first 128k context open source model matching GPT4-level Needle in a
Can we really trust VLMs in critical areas like medical image diagnosis? No! The adversarial probing exposes major flaws in top models like GPT-4V and Gemini Pro. Top models sometimes perform worse than random guessing on diagnostic questions. Domain-specific models like
Can we really trust AI in critical areas like medical image diagnosis? No, and they are even worse than random. Our latest study, "Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA," uncovers the stark limitations of
Our community studied intelligent assistants for many years. One major drawback of these old systems is the lack of generalizability.
The most exciting prospect of SeeAct is to show the potential for these agents to act as a universal interface for information and services. By
Generalist web agents may get here sooner than we thought---introducing SeeAct, a multimodal web agent built on GPT-4V(ision).
What's this all about?
> Back in June 2023, when we released Mind2Web () and envisioned generalist web agent, a language agent
#acl2022nlp
📢📢 Interested in open-domain question answering and LM pretraining? Want to know how to better pretrain open-domain QA models or dense retrievers in an unsupervised/self-supervised way? Check out our ACL22 work: 👇👇
#LLMs
struggle a lot in defending the real truth and can be easily misled by the user’s (invalid) arguments and critiques. Their strong reasoning capability is actually very fragile and may stem from prompters knowing the answer, akin to the horse
#CleverHans
. 👇👇
#EMNLP2023
Are LLMs reasoning based on deep understandings of truth and logic? Can LLMs hold & defend their own "reasoning"?
Our
#EMNLP23
findings paper () explores testing LLMs' reasoning by engaging them in a debate that probes deeper into their understanding.
Great summary! Thanks for mentioning our MathInstruct dataset.
MathInstruct is a meticulously curated math instruction tuning dataset. It is compiled from 13 math rationale datasets, featuring a hybrid use of chain-of-thought (CoT) and program-of-thought
2023 has been incredible for open releases, so I made a ✨year review in Open LLMs ✨
It was lots of fun coming back through all that came out, and it's insane how much the field soared thanks to the community & openness!
Summary of each section: 🧵
4) Our findings suggest that data distribution (not just size) is critical for grokking, and cross-layer memory sharing in the transformer architecture could improve systematic generalization.
StructLM
Towards Building Generalist Models for Structured Knowledge Grounding
Structured data sources, such as tables, graphs, and databases, are ubiquitous knowledge sources. Despite the demonstrated capabilities of large language models (LLMs) on plain text, their
This was the third work I collaborated with
@BoshiWang2
. He is a so brilliant but hard-working guy. I enjoyed much in every discussion with Boshi. This project was actually started from a discussion happening when we attended
#ACL2023
last year. Still remember we were talking
Congratulations to the amazing team at
@osunlp
on the 8 papers accepted to
#ACL2023
🎉👏 So proud of being part of this talented group and one of the authors on this list! 🙌
#NLP
#research
🎉 Thrilled to share
@osunlp
has 8 papers accepted to
#ACL2023
(out of 11 subs), and 3 of the papers received best paper nomination by reviewers. We don't normally submit this many papers and grateful it turns out well 🥰 Equally proud of the papers that didn't get in this time!
Glad to see this is finally out! Congrats
@bernaaaljg
@ysu_nlp
! HippoRAG draws inspiration from the hippocampal memory indexing theory of human long-term memory. Check this out if you are working on
#RAG
, multi-hop
#reasoning
, or related topics!
📣📣 Super proud to present the most exciting project of my PhD so far: “HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models”.
HippoRAG, as the title suggests, is a brain-inspired RAG framework that enables LLMs to effectively and efficiently
Oh, I just noticed that the strong code and math reasoning performance of
#Llama3
is reported based on their instruction-tuned version, which means that the model might have been trained on GSM8K or MATH (augmented) training sets. 😅
😱Is pure text pre-training coming to an end?
A Thread on 🦙 Llama 3's report 🧵:
1. Tokenizer 🔍
Llama 3's tokenizer boasts a 128k vocab size and yields 15% fewer tokens than Llama 2, enabling more efficient and accurate tokenization.
2. Model
3) Deep analysis into the model's internals reveals the gradual formation of generalizing circuits during grokking. The configuration of these circuits explains the variations in systematicity across tasks.
@DrewHawkswood
Our work has actually been “peer-reviewed” by the whole community in the past week🤣. We received many comments from X/Twitter, HF, GitHub, emails, etc. We are very grateful for these feedback from the community and would have incorporated them into the revisions.
@teortaxesTex
We actually fine-tuned a 15B version but the results look pretty wired. It might be due to the transformer version issue. We are still debugging this. It should be resolved very soon. Will definitely release a 15B version. :)
MMMU is a brand new benchmark () that was released just last week, with ~11,500 examples requiring image understanding, college-level subject knowledge and deliberate reasoning. We decided it would be fun to try the Gemini models on this benchmark to see
Looking for the best open-source (small) Math model?
I'm happy to release MAmmoTH-7B-Mistral (), which achieves 40% on MATH and 52% on MMLU-Math. Nothing fancy, I just fine-tuned Mistral-7B on our previous MathInstruct dataset ().
5) The power of parametric memory for complex reasoning: a grokked transformer achieves near-perfect accuracy on a challenging task where
#GPT4
&
#Gemini
-Pro with non-parametric memory fail badly.
🥳We train and evaluate over 50+ models and baselines (500+ experiments). We compile two giant result tables covering nearly all the LLMs we can find in the math reasoning fields. 🧐
@OpenAI
#GPT4o
has greatly improved its
#reasoning
abilities across text and
#multimodal
contexts. It now achieves an accuracy of 🤯🤯69.1% on our
#MMMU
benchmark, close to the performance of lower-performing human experts (76.2%)!
"With GPT-4o, we trained a single new model
🤖How far are we from achieving Expert AGI? We included human experts' performance on the MMMU benchmark ().
📊 The best-performing human experts achieved an accuracy of 88.6 while the best-performing model
@GoogleDeepMind
Gemini Ultra just scored 59.4,
🎉We're thrilled to share that our paper👇 has been accepted by
#ACL2023NLP
! Our method fine-tunes LMs with DP to generate useful text while providing strong privacy protection. Check out our preprint for more details on this promising path to mitigating
#privacy
concerns in NLP
📢📢Want to share your textual data outside your org or team but worry about privacy leakage? Check out our new preprint "Synthetic Text Generation with Differential Privacy: A Simple and Practical Recipe"() 👇👇👇
A natural question to ask: Why 🦣
#MAmmoTH
is so powerful😱? We investigate how the two major characteristics of
#MathInstruct
influence the performance of 🦣. Main takeaway: Diverse data sources and hybrid CoT & PoT training lead to substantial gains, making 🦣 math generalists.
🚀 Scaling 10M naturally existing instructions from the Web! 🌐
Introducing 🦣MAmmoTH2 (), a family of open-source models that unlocks unprecedented performance gains in reasoning tasks by leveraging two key insights:
1️⃣ Scaling is still all you need, even
🌟With precise execution & human feedback, a 7B code model hits 90% accuracy on HumanEval!
🚀 Introducing OpenCodeInterpreter: A family of open-source code systems for generating, executing, & refining code.🔄
🤖 Traditional open-source models often fall short in execution
🚀Our instruction-tuning dataset
#MathInstruct
is compiled from 13 math datasets, 6 of which have rationales newly curated by us. What set
#MathInstruct
apart?
1️⃣Broad coverage of different math fields and complexity levels
2️⃣Hybrid CoT & PoT rationales
🧐Both LLMs and humans hallucinate. It’s a form of creation. The issue arises in contexts where factuality is crucial. We expect LLMs not to ‘dream’ then.
It’s not the hallucination that’s the problem, but the context in which it occurs. Creativity is a gift, but factuality and
# On the "hallucination problem"
I always struggle a bit with I'm asked about the "hallucination problem" in LLMs. Because, in some sense, hallucination is all LLMs do. They are dream machines.
We direct their dreams with prompts. The prompts start the dream, and based on the
Our incredible team's efforts made this project possible. 🙌
@GeZhang86038849
is at ICLR'24 in Vienna and will be presenting MAmmoTH1 (ICLR'24 Spotlight) and MAmmoTH2. 🎙️
If you're at the conference and would like to learn more for both projects, feel free
Introducing 🦣MAmmoTH: The BEST open-source
#LLMs
for math NOW! 🦣Outperforms SOTA on 9 math reasoning datasets, with accuracy gains of 13-29% across all scales. 🦣 is tuned on our 260K
#MathInstruct
dataset, including hybrid CoT & PoT rationales.
#NLProc
Introducing Vision Arena! Inspired by the awesome Chatbot Arena, we built a web demo on
@huggingface
for testing Vision LMs (GPT-4V, Gemini, Llava, Qwen-VL, etc.). You can easily test two VLMs side by side and vote! It’s still a work-in-progress. Feedbacks are welcome!
🔗
Hiring multi Ph.D. students this cycle in areas:
#LLM
train/eval, trustworthiness of LLMs incl. privacy & safety, LLM for biomedicine/chemistry. see below for representative work. I won't be
#EMNLP23
, but pls talk to my (former) students there
@xiangyue96
@BoshiWang2
@RonZiruChen
Great to see that this project has been successfully shipped!!
> BioCLIP is the first large-scale multimodal model for general biology questions related to images.
> BioCLIP utilizes a wide range of biological images, including plants, animals, and fungi.
> Trained on the
Introducing BioCLIP: A Vision Foundation Model for the Tree of Life
A foundation model that strongly generalizes on the tree of life (2M+ species), outperforming OpenAI CLIP by 18% in zero-shot classification, and supports open-ended classification over
New
#ACL2021
Findings paper:
"Differential Privacy for Text Analytics via Natural Text Sanitization".
The privacy issue is always overlooked in NLP. We address privacy from the root: directly producing sanitized text documents based on differential privacy. [1/3]
We hope our testbed, modeling methodology, and insights will help lay the foundation for future studies on this important problem. Our code, models, and datasets are available at: . Joint work with
@BoshiWang2
@DrogoKhal4
@RonZiruChen
@ysu_nlp
@hhsun1
[6/N]
[1/n]
👉👉👉
Checkout our latest work to explore the behaviour of the SoTA long-context LLMs when confronted with long in-context learning: “Long-context LLMs Struggle with Long In-context Learning”
We created 🐍LongICLBench🐍 to conduct comprehensive
📚Most of the attributed LLMs (e.g., generative search engines) rely on humans to verify the attribution, which is costly. Our research explores two automatic evaluation methods: prompting LLMs and fine-tuning smaller LMs on repurposed data from related tasks. [2/N]
Great to see that by combining our 🦣**MAmmoTH MathInstruct** dataset with other open source datasets, Mistral 7B model can obtain very impressive performance on GSM8K and MATH.
Excited to announce release of 𝗔𝗿𝗶𝘁𝗵𝗺𝗼-𝗠𝗶𝘀𝘁𝗿𝗮𝗹-𝟳𝗕 model that outperforms existing 7B and 13B state-of-the-art mathematical reasoning models by a huge margin on both GSM8K and MATH datasets.
Citation (or attribution) is definitely a crucial component in LLMs to enhance the verifiability and trustworthiness of the genearated statement. Agreed that a comprehensive citation mechanism should account for both non-parametric and parametric knowledge. Good position paper!
New position paper! 🔥
We position "citation" as the key to building responsible and accountable large language models
- enhancing content transparency and verifiability, while mitigating IP and ethical dilemmas in LLMs such as
#ChatGPT
.
👉
🧵⬇️
Came across a fun math query: "use 4 4s to form an expression equals to 4." I tested it across various models and nearly all models fail.😅
❌
#GPT4o
✅
#GPT4
❌
#Claude3
Opus
❌
#Gemini
❌
#Llama3
✅
#Yi
-1.5-34B-Chat (succeeded after a few attempts) ❌
#Qwen
-110B-Chat
@TheSeaMouse
Right. As most of the questions in MMMU highly rely on images, GPT-4 Text only often says something like "Without images, I cannot answer this question". We also try some prompting strategies and force the GPT-4 Text output the most likely answer. The results are pretty low as
@akjindal53244
Thanks for the great work Ashvini! Would you mind adding MAmmoTH 7B performance in the leaderboard? And could you please acknowledge MathInstruct dataset in the introduction and recommend citing MAmmoTH and MetaMath papers if people use the data and model in the repo?
@YangsiboHuang
Very interesting work! I agree that heuristic methods like PIIs removal would be useful in target attacks. But for untargeted attacks, we definitely need DP. Our previous work in ACL'21 and ACL'23 could be extended in this scenario:
@airesearch12
@Yampeleg
Thanks! We currently only support Python for the execution. But the model can generate other languages as well. We are considering extending this execution and refinement to other popular languages. Any suggestions and comments are very welcome.
📊 Our results show promising signals in the automatic evaluation: fine-tuned smaller LMs (e.g., FLAN-T5-Large) can even outperform LLMs like ChatGPT. However, automatic evaluation of attribution still remains challenging. [4/N]
😀I like the model's name Yi. "Y" resembles the Chinese character "人" (which means "human") rotated 180 degrees. "i" is taken from "Ai". So "Yi" means Human + Ai ("以人为本" in Chinese). It's a perfect visual mash-up! The GIF was taken from the website
Very impressive results! Congratulations! Many people including me are wondering how the pre-training data size and mixture contribute to the improvement🤔️
@hu_yifei
Thanks! There might be little chance we included such examples. Our questions are mostly from college textbooks and exams. But I definitely agree that such a GUI understanding scenario (e.g., web or mobile agent) will be another important application of LMMs' evaluation.
@jefffhj
Great work! Our ACL2023 work demonstrates fine-tuning LMs with differential privacy is a good treatment to privacy information (e.g., PIIs) leakage by LMs.
🧩Most evaluation failures stem from three factors: 1) insensitivity to fine-grained information comparison such as numerical values, 2) overlooking contextual cues in the reference, and 3) failures in performing symbolic operations such as verifying set relationships [5/N]
@Maxwell_Nye
Congrats on the outstanding achievements of Fuyu-Heavy in our MMMU benchmark! We would be delighted to feature Fuyu-Heavy's detailed results like Gemini on our leaderboard. This could provide more visibility for the model: .