I am thrilled to announce that I will be joining the Department of Computer Science
@unccs
at UNC-Chapel Hill
@UNC
starting from Fall 2023.
📢I am actively seeking highly motivated students for Ph.D. positions (Spring/Fall 2024), postdoc positions, and interns. RT appreciated.
I'm on the academic job market this year!
I develop machine learning methods that are robust and adaptable to distribution shifts and open/non-stationary environments, w/ interdisciplinary applications in healthcare+drug discovery, transportation, and education. RT appreciated.
📢Multimodal Large Language Models (MLLMs) often generate hallucinatory responses that disregard actual visual input information. We attribute this issue to the lack of alignment between different modalities and demonstrate that we can address it by improving alignment with
📢Excited to share our approach called Calibrated Self-Rewarding Vision Language Models (CSR)🌟! With no need for labeled data, a VLM can get stronger by itself with visual constraints. Discover how CSR enhances VLMs through self-improvement with visual constraints:
🚨 Unveiling GPT-4V(ision)'s mind! We're breaking down how even the brightest Visual Language Models get it wrong!
With our new 'Bingo' benchmark, we shed light on the two common types of hallucinations in GPT-4V(ision): bias and interference.
Led by
@cuichenhang
@AiYiyangZ
Excited to announce the Workshop on Reliable and Responsible Foundation Models at
@iclr_conf
2024 (hybrid workshop).
We welcome submissions! Please consider submitting your work here: (deadline: Fed 3, 2024, AOE)
Hope to see you in Vienna or
🚀 Can we directly rectify hallucination in Large Vision-Language Models (LVLMs)?
🛠 We introduce a hallucination revisor named LURE that mitigates hallucination in LVLMs, achieving over a 23% improvement!
Nice work, Yiyang Zhou &
@cuichenhang
🌟NEW Paper Alert 🌟
👩⚖️MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation? ()
🧐Also wonder about the best judge model to provide feedback for your diffusion models?
We evaluate multimodal judges in providing
🚨New Work Alert: CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models!
We delve into the trustworthiness of Med-LVLMs across 5 key dimensions: trustfulness, fairness, safety, privacy, & robustness. With 41K Q&A pairs, spanning 16 image
Super-excited to share our new work about out-of-distribution robustness ().
We propose a simple mixup-based method to learn invariant functions via selective augmentation.
🚨Excited to share our work on the seamlessness between Policy Models (PM) and Reward Models (RM) in RLHF🌟!
Motivation: Improving PM and RM separately doesn't translate to better RLHF outcomes.
Solution: We introduce SEAM, a concept that quantifies the distribution shift
#RLHF
is taking the spotlight. We usually focus on boosting reward and policy models to enhance RLHF outcomes. Our paper dives into the interactions between PM and RM from a data-centric way, revealing that their seamlessness is crucial to RLHF outcomes.
🎉Please join us in welcoming this year's new faculty cohort! From algorithms, security, machine learning, to graphics and computational optics, these faculty will maintain the standard of excellence at
@UNCCS
!
➡️
@UNC
@unccollege
@UNCResearch
@UNCSDSS
📢The reasoning ability of multimodal LLMs has been widely evaluated in single and static images, which are far more enough. Introduce '🎞️ Mementos': our new benchmark to push multimodal LLMs to understand and infer the behavior over image sequence.
Findings:
#GPT4V
and
#Gemini
🎬 Just like Nolan's 'Memento' rewrote storytelling, we're reshaping AI! Introducing '🎞️ Mementos': our benchmark pushing AI to understand sequences of images, not just stills. A real game-changer in AI's narrative.
#AIStorytelling
#Multimodal
#LLMs
#GenAI
We need more reviewers for the Workshop on Reliable and Responsible Foundation Models at
@iclr_conf
, if you are interested, please fill out the nomination form .
Excited to announce the Workshop on Reliable and Responsible Foundation Models at
@iclr_conf
2024 (hybrid workshop).
We welcome submissions! Please consider submitting your work here: (deadline: Fed 3, 2024, AOE)
Hope to see you in Vienna or
I'll be at
#CVPR2024
from June 16th to 22nd, looking forward to catching up with old friends and making new ones. In addition, I have 2-3 PhD openings next year.
Feel free to DM me to grab a ☕️ and chat about research and PhD opportunities if you're around!
📢Workshop on Reliable and Responsible Foundation Models will happen today (8:50am - 5:00pm). Join us at
#ICLR2024
room Halle A 3 for a wonderful lineup of speakers, along with 63 amazing posters and 4 contributed talks! Schedule: .
Excited to announce the Workshop on Reliable and Responsible Foundation Models at
@iclr_conf
2024 (hybrid workshop).
We welcome submissions! Please consider submitting your work here: (deadline: Fed 3, 2024, AOE)
Hope to see you in Vienna or
Thanks a lot, Mohit! If you are interested in joining my lab, kindly complete the application form and send an email to huaxiu.recruiting
@gmail
.com.
- Ph.D. student and intern application form:
- Postdoc application form:
🎉🥳 Excited to have
@HuaxiuYaoML
joining us (from
@StanfordAILab
) very soon this August 2023! Welcome to the
@unc
@unccs
@uncnlp
family, Prof. Huaxiu!! Looking forward to many awesome collaborations😀
PS. Students applying for spring/fall2024 PhD admissions, take note below 👇
🚀 Can we directly rectify hallucination in Large Vision-Language Models (LVLMs)?
🛠 We introduce a hallucination revisor named LURE that mitigates hallucination in LVLMs, achieving over a 23% improvement!
Nice work, Yiyang Zhou &
@cuichenhang
📢We are organizing the ICML 2024 Foundation Models in the Wild Workshop. Submissions on perspectives, pitfalls, and paths forward for foundation models in any downstream real-world scenarios are very welcome! 🔥
See u in Vienna!
Excited to announce the Workshop on Foundation Models in the Wild at
@icmlconf
2024 (hybrid workshop).
We welcome submissions! Please consider submitting your work here: (deadline: May 31, 2024, AOE) Hope to see you in Vienna or virtually in July,
🚀 New Paper Alert!
#AIChallenge
: How do we prevent "hallucination snowballing" in Large Language Models (LLMs)? And, how can we use verification results to enhance trustworthiness in AI-generated text? These are critical issues in
#LLMs
.
🧠 Introducing EVER (Real-time
Excited to announce the Workshop on Reliable and Responsible Foundation Models at
@iclr_conf
2024 (hybrid workshop).
We welcome submissions! Please consider submitting your work here: (deadline: Fed 3, 2024, AOE)
Hope to see you in Vienna or
I'll be at
#ICLR2024
in Vienna🇦🇹 next week (from May 7th to 12th), looking forward to catching up with old friends and making new ones. Feel free to DM me to grab a ☕️ and chat if you're around!
I will be
#NeurIPS22
next week. I am on the academic job market and work on building machine learning models that are unbiased, widely generalizable, and easily adaptable to distribution shifts.
DM me to grab a ☕️ and chat if you are around!
Meta-learning typically randomly sample meta-training tasks with a uniform probability, where tasks are of equal importance.
However, tasks may be detrimental with noise or imbalanced.
In
#NeurIPS2021
, we propose a neural task scheduler (ATS) to adaptively select training tasks
🎉Our paper “Meta-learning with Fewer Tasks through Task Interpolation” has been accepted by
@iclr_conf
. A simple solution to improving the generalization of meta-learning by densifying the task, particularly works if you do not have a large number of training tasks.
Meta-learning methods need a large set of training tasks. We introduce a simple regularizer that helps, especially when you don’t have a lot of tasks.
Meta-Learning with Fewer Tasks through Task Interpolation
Paper:
with
@HuaxiuYaoML
,
@zlj11112222
I will attend
#KDD2023
from next Mon (Aug 7) - Wed (Aug 9).
📢My group
@unccs
will have multiple Ph.D. (fall 2024)/remote intern positions.
☕️DM me if you are interested in discussing
#foundationmodels
,
#AISafety
,
#MedicalAI
, or Ph.D. applications.
Excited to share our new work on Healthcare AI: can we assist doctors in recommending personalized newly approved medications to patients ()?
A nice work led by Zhenbang Wu, and collab w. the amazing
@james_y_zou
@chelseabfinn
@jimeng
and others.
🚀 Preference fine-tuning has shown immense power in boosting factuality in LLMs! Our straightforward strategy slashes factual errors by a whopping ~50% in LLama 1 & 2.
Fine-tuning Language Models for Factuality
paper page:
The fluency and creativity of large pre-trained language models (LLMs) have led to their widespread use, sometimes even as a replacement for traditional search engines. Yet language models are prone
If you're at
#ICML2022
and interested in out-of-distribution generalization, please come to our talk (Wed 20 July 1:15 pm ET, Room 318 - 320) and poster (Wed 20 July 6:30pm, Hall E
#321
)!
Neural nets are brittle under domain shift & subpop shift.
We introduce a simple mixup-based method that selectively interpolates datapts to encourage domain-invariance
ICML 22 paper:
w/
@HuaxiuYaoML
Yu Wang
@zlj11112222
@liang_weixin
@james_y_zou
(1/3)
👇Our
#EMNLP2023
work suggests that RLHF-LLMs verbalize probabilities that are significantly better calibrated than the model's conditional probabilities, thus enabling a well-calibrated model.
LLMs fine-tuned with RLHF are known to be poorly calibrated.
We found that they can actually be quite good at *verbalizing* their confidence.
Led by
@kattian_
and
@ericmitchellai
, at
#EMNLP2023
this week.
Paper:
Thanks
@james_y_zou
for the unreserved support! I work on building
#MachineLearning
models that are unbiased, widely generalizable, and easily adaptable to in-the-wild shifts. DM me if you think I would be a good fit for your department!
@HuaxiuYaoML
is a super postdoc at
@StanfordAILab
He has done many interesting works on meta-learning, data augmentation and OOD learning to make
#ML
more reliable.
🔥Though I see bad reviewers and ACs in my other papers, LISA is luckily accepted to
@icmlconf
. An extremely simple model with super-cool results for tackling distribution shifts. Nice collab w/
@__YuWang__
,
@chelseabfinn
, and others🎉.
ArXiv👇 and code:
Super-excited to share our new work about out-of-distribution robustness ().
We propose a simple mixup-based method to learn invariant functions via selective augmentation.
📢Excited to share our approach called Calibrated Self-Rewarding Vision Language Models (CSR)🌟! With no need for labeled data, a VLM can get stronger by itself with visual constraints. Discover how CSR enhances VLMs through self-improvement with visual constraints:
We systematically evaluate the safety and robustness of Vision LLMs, including adv attack and OOD generalization.
👇See detailed takeaways in Haoqin’s thread.
Nice collaboration with
@cihangxie
’s team!
Vision LLMs like LLaVA and GPT4V, are good at handling regular vision-language tasks, but are they really robust and safe😈
With our new VLLM safety benchmark, we shed light on two types of safety evaluations in existing VLLMs: OOD situation and adversarial attack. 🧵👇
We are excited to announce an open rank faculty hiring initiative for several positions to bolster research and innovation in the emerging field of
#DataScience
.
🎓100% SDSS position:
🎓All faculty positions:
We are thrilled to announce that James Zou has been promoted to DBDS Associate Professor with tenure. A hearty congratulations and thanks to his endless hard work, revolutionary research and constant contributions to our department.
#EMNLP2021
Today (Nov 7th), I will present KGML () for few-shot text classification, where knowledge graph is used to bridge the gap between training and test tasks in the Oral Session 4A at 12:45 - 2:15 pm PST and Post Session at 3:00 - 5:00 pm PST.
(3/7) Factual bias: GPT-4V(ision) gets tripped by images with counterfactuals, sticking to what's 'common sense' instead of what's in the image. Like missing Saturn in a solar system photo, it still calls out Saturn. (right example).
If you are working on meta-learning and struggling with the overfitting issue, you should use MetaMix presented at
#ICML2021
. It is a simple task augmentation method to improve generalization in meta-learning.
Paper:
Code:
Excited to announce 1st Pre-training Workshop at
@icmlconf
(hybrid workshop).
We welcome submissions! Please consider submitting your work here: (deadline: May 22, 2022, AOE)
Hope to see you in person in July, stay tuned for more info.
(8/8) In linear or monotonic non-linear models, our theory further shows that C-Mixup improves generalization in regression.
#NeurIPS2022
Paper:
Code: (coming soon)
(6/7) Are there any mitigations? Popular mitigation approaches, such as self-correction and chain-of-thought reasoning, are not effective in resolving these challenges.
😍I'm super excited to announce my next journey! After a great time at KAIST, I'll be working as a Postdoctoral Research Associate at UNC Chapel Hill (
@UNC
) this fall, working with Prof. Mohit Bansal (
@mohitban47
) and faculty+students in the awesome
@uncnlp
and
@unccs
groups!
1/3
If you don't have sufficient tasks in meta-learning, come and check our
#ICLR2022
oral "Meta-learning with Fewer Task through Task Interpolation" for an extremely simple solution. Nice collab with
@chelseabfinn
,
@zlj11112222
Oral: Wed 27, 9:30am PT.
Poster: Tue 26, 10:30am PT
🎉Our paper “Meta-learning with Fewer Tasks through Task Interpolation” has been accepted by
@iclr_conf
. A simple solution to improving the generalization of meta-learning by densifying the task, particularly works if you do not have a large number of training tasks.
(7/7) Bias and interference aren't just GPT-4V(ision) problems - LLaVA and Bard have them too. Our study shows these 'hallucination' issues are widespread in cutting-edge visual-language models.
If you are interested in joining my lab, kindly complete the application form and send an email to huaxiu.recruiting
@gmail
.com.
- Ph.D. student and intern application form:
- Postdoc application form:
(4/7) Image-to-Image Interference: Composite images lead to confusion! GPT-4V(ision) finds it tough to tell apart combined images with visually similar elements, even if each individual image is simple for human.
⚡️Excited to share our new
@NatureMedicine
paper where we used Twitter to build a vision-language foundation
#AI
for
#pathology
We curated >100K public Twitter threads w/ medical images+text to create PLIP for semantic search and 0-shot pred.
All our
📢 Any great papers that are rejected by ICML or planned to submit to NeurIPS? You can also submit these interesting works to our Pre-training workshop
@icmlconf
. 7 days left. more details 👇
Excited to announce 1st Pre-training Workshop at
@icmlconf
(hybrid workshop).
We welcome submissions! Please consider submitting your work here: (deadline: May 22, 2022, AOE)
Hope to see you in person in July, stay tuned for more info.
I would like to express my sincere appreciation to my advisors (
@chelseabfinn
,
@JessieLzh
), collaborators (
@james_y_zou
,
@ericxing
,
@jimeng
, etc.), friends, and family for their invaluable support throughout this journey. I eagerly look forward to embarking on the next chapter🎉
APPLY: TENURE-TRACK/TENURE/TEACHING FACULTY. Research areas include but not limited to: ML, NLP, vision, graphics, systems, bioinformatics, security, medical imaging, robotics & AR/VR. Join our team - committed to research, teaching, and collaboration.
➡️
Pre-training workshop will happen tomorrow. Join us this Saturday at
#ICML2022
room Hall F for a wonderful lineup of speakers and panelists, along with 48 amazing posters and 3 contributed talks!
Schedule:
[8/N] Feedback scale and format: further study reveals that
(1) the feedback performance of close-source VLMs are almost invariant to scale (0-5 or 0-10) or format (either numerical or Likert-scale).
(2) while open-sources VLMs perform significantly better when providing the
[6/N] Main result: we find that:
(1) Close-source VLM judges generally provide more accurate feedback across all perspectives, with GPT4o performing the best.
(2) Smaller-sized CLIP-based scoring models can provide better feedback regarding text-image alignment and image
Meta-learning methods need a large set of training tasks. We introduce a simple regularizer that helps, especially when you don’t have a lot of tasks.
Meta-Learning with Fewer Tasks through Task Interpolation
Paper:
with
@HuaxiuYaoML
,
@zlj11112222
[2/N] Most previous studies have honed in on just one aspect of trustworthiness like diagnostic accuracy. What's largely missing is a comprehensive, standardized evaluation of Med-LVLMs from multiple critical dimensions, including safety, fairness, and privacy.
[7/N] End-to-end human evaluation (left): we select six most capable judges and individually fine-tune a base SDv-1.5 model with their feedback via DPO. Human evaluation results on the final aligned model are generally consistent with the automatic metric, while we also find
[7/N] 👉Privacy.
(1) Unlike general LVLMs, Med-LVLMs often lack defenses against queries seeking private info, failing to refuse such content.
(2) Though Med-LVLMs may generate responses resembling private info, these are typically fabricated and not real disclosures.
(3)
[5/N] 👉Fairness. We've uncovered significant performance disparities across demographic groups, categorized by age, gender, and race.
Age-wise, the best performance is seen in the 40-60 group, with a drop in accuracy for the elderly due to uneven data.
Gender disparities are
[2/N] While some benchmarks have studied multimodal foundation models as a generator, only a few study their evaluative capability as a judge. MJ-Bench comprehensively and exclusively studies AI feedback for text-to-image generation around four key alignment objectives.
🔬Our solution stems from rigorous statistical analysis pinpointing the causes of hallucination in LVLMs. We've identified three pivotal factors:
- Object co-occurrence
- Object uncertainty
- Position of the object in generated descriptions
Our method is easy to implement and well-suited to domain shifts and subpopulation shifts.
The results are cool in nine benchmarks in domain shifts (left figure) and subpopulation shifts (right figure)
[3/N] Dataset: We propose a high-quality preference dataset () structured around four key dimensions: text-image alignment, safety, image quality and artifacts, bias and fairness. Notably, each perspective is further decomposed into multiple sub-categories
The 5th edition of the Meta-Learning Workshop (
#MetaLearn2021
) is taking place on Workshop Monday (13th Dec.) @
#NeurIPS2021
, and the CfP is now out! Please submit your up-to-8-page research papers by Sept. 17th; more details at .
[6/N] 👉Safety.
(1) Under "jailbreaking" attacks, accuracy drops for all models.
(2) All models slightly increase in toxicity under toxic prompts, but LLaVA-Med uniquely shows strong resistance.
(3) However, its overly conservative tuning leads LLaVA-Med to be too cautious,
[5/N] We study a comprehensive set of multimodal judges, including 6 smaller-sized CLIP-based scoring models, 11 popular open-source VLMs, and 4 most capable close-source VLMs. Meanwhile, we are updating the leaderboard () to include more recent models.
Motivated by mixup, our method encourages learning invariant functions and cancel out domain-related information by
(1) interpolating samples with the same label but different domains;
(2) interpolating samples with the same domain but different labels.
[2/N] Challenge: Lack of alignment across modalities in Large VLMs (LVLMs) ➡️ LVLM Hallucination
Solution (preference optimization):
- Constructing preference by additional models / human annotator ➡️ constructed data may not fully reflect preferences of the target LVLM 🙁
-
[4/N] Iterative Optimization Pipeline: 1⃣Sentence-level beam search combined with a calibrated self-rewarding strategy for preference data curation. 2⃣Preference optimization.🔄
[3/N] Based on medical vision-language and image classification datasets, CARES includes roughly 18K images paired with 41K QA items, covering 16 medical imaging modalities and 27 anatomical regions across various question types.
[5/N] 👉Takeaway 1: CSR requires only the image data, the LVLM, and a CLIP model. LVLM can achieve self-improvement through CSR, particularly when compared with the self-rewarding paradigm.
🧵[4/4] 🌟 We've conducted a human evaluation of concepts flagged as extrinsic hallucinations by ChatGPT. These flags are mostly accurate, enhancing transparency and trustworthiness in text generation. A step forward for
#TrustworthyAI
! 🤖🔍
I was shocked to know that Dr. Jian Sun, my former colleague of the MSRA Visual Computing Group, has passed away. We will miss him dearly. May his soul rest in peace.
[4/N] 👉Trustfulness. Key findings: (1) These models often face 'factuality hallucination,' with over 50% accuracy errors on our VQA benchmark—particularly with open-ended questions and less common modalities/regions. (2) Their performance in estimating uncertainty is also