Xingyu Fu Profile
Xingyu Fu

@XingyuFu2

Followers
705
Following
405
Media
15
Statuses
68

PhD student at Upenn @cogcomp . | Focused on Vision+Language Multimodal learning | Previous: B.S. @UIUC | ⛳️😺

Joined September 2020
Don't wanna be here? Send us removal request.
Explore trending content on Musk Viewer
Pinned Tweet
@XingyuFu2
Xingyu Fu
5 months
Can GPT-4V and Gemini-Pro perceive the world the way humans do? 🤔 Can they solve the vision tasks that humans can in the blink of an eye? 😉 tldr; NO, they are far worse than us 💁🏻‍♀️ Introducing BLINK👁 , a novel benchmark that studies visual perception
Tweet media one
@_akhaliq
AK
5 months
BLINK Multimodal Large Language Models Can See but Not Perceive We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by humans
Tweet media one
4
91
374
8
128
403
@XingyuFu2
Xingyu Fu
2 months
@XingyuFu2
Xingyu Fu
5 months
Can GPT-4V and Gemini-Pro perceive the world the way humans do? 🤔 Can they solve the vision tasks that humans can in the blink of an eye? 😉 tldr; NO, they are far worse than us 💁🏻‍♀️ Introducing BLINK👁 , a novel benchmark that studies visual perception
Tweet media one
8
128
403
6
18
160
@XingyuFu2
Xingyu Fu
3 months
Can Text-to-Image models understand common sense? 🤔 Can they generate images that fit everyday common sense? 🤔 tldr; NO, they are far less intelligent than us 💁🏻‍♀️ Introducing Commonsense-T2I 💡 , a novel evaluation and benchmark designed to measure
Tweet media one
6
39
127
@XingyuFu2
Xingyu Fu
2 months
🔥 Checkout MuirBench! 🚀 robust multi-image understanding on 12 tasks 🤔 GPT-4o and Gemini Pro are worse from humans More details in
Tweet media one
@fwang_nlp
Fei Wang
2 months
Can GPT-4o and Gemini-Pro handle 𝐦𝐮𝐥𝐭𝐢𝐩𝐥𝐞 𝐢𝐦𝐚𝐠𝐞𝐬? Introducing MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding. 🌐 Explore here: 📄 Paper: 📊 Data:
Tweet media one
2
42
94
1
21
70
@XingyuFu2
Xingyu Fu
2 months
🔥 Commonsense-T2I is accepted to the first @COLM_conf with review scores of 8/8/7/7 😼 🎉 Congratulations to the team!! @yujielu_10 @muyuhe @WilliamWangNLP @DanRothNLP Excited to see you in the beautiful Philly 😊 Do current text-to-image models align with the everyday real
@XingyuFu2
Xingyu Fu
3 months
Can Text-to-Image models understand common sense? 🤔 Can they generate images that fit everyday common sense? 🤔 tldr; NO, they are far less intelligent than us 💁🏻‍♀️ Introducing Commonsense-T2I 💡 , a novel evaluation and benchmark designed to measure
Tweet media one
6
39
127
2
12
58
@XingyuFu2
Xingyu Fu
3 months
🔥Error Examples from DALL-E 3 👀More Visualizations: (3/n)
Tweet media one
1
2
13
@XingyuFu2
Xingyu Fu
5 months
🔥Highlights of the BLINK benchmark: 👩🏻‍🏫14 vision tasks, ranging from low-level perception to high-level visual reasoning 📚3.8K multiple-choice questions and 7.9K images carefully derived from various datasets and sources 📄Enable the study of various visual prompts to enhance
Tweet media one
1
0
14
@XingyuFu2
Xingyu Fu
3 months
Thanks @ashkamath20 for testing Gemini 1.5 Pro model on BLINK. We're happy to see the large improvements on GPT-4o and Gemini 1.5! @huyushi98 and I are presenting BLINK at #CVPR2024 . Come chat with us! We will give a talk on: 📅Jun 17 🕙10 am CinW workshop📍Arch 3B Poster
@huyushi98
Yushi Hu
3 months
Glad that GPT-4 and Gemini have gone so far on BLINK! GPT-4V, 4-turbo, 4o: 51.1➡️ 54.6 ➡️ 60.0 📈 Gemini 1.0 Pro, 1.0 Ultra, 1.5 Pro: 45.1 ➡️ 51.7 ➡️ 61.4 📈 @XingyuFu2 and I will present BLINK at #CVPR2024 . Find us if you want to chat! Jun 17 10 am, 2:30-3:30 pm CinW
Tweet media one
0
7
23
0
3
14
@XingyuFu2
Xingyu Fu
3 months
🔥Highlights of the Commonsense-T2I benchmark: 📚Pairwise text prompts with minimum token change ⚙️Rigorous automatic evaluation with descriptions for expected outputs ❗️Even DALL-E 3 only achieves below 50% accuracy (2/n)
Tweet media one
1
3
12
@XingyuFu2
Xingyu Fu
3 months
🔥Results and Takeaways: 💻< 🧠Commonsense-T2I Benchmark posts a great challenge to existing T2I models: 🥇DALL-E 3: 49% 🥈playground v2.5: 26% 🥉Stable Diffusion XL: 25% 🤖GPT-revised prompts cannot solve the problem! Check out more details, visualization of T2I model outputs
Tweet media one
1
3
11
@XingyuFu2
Xingyu Fu
5 months
🔥Results and Takeaways: 💻< 🧠BLINK Benchmark posts a great challenge to existing multimodal LLMs: 🥇Human: 96% 🥈GPT4V: 51% 🥉Gemini Pro: 45% 4️⃣Claude OPUS: 43% 5️⃣Random guess: 38% ⭕️ Visual prompting can have a big impact on multimodal LLM performance: circle sizes and
Tweet media one
3
1
11
@XingyuFu2
Xingyu Fu
5 months
😺 This work is done with my amazing collaborators: @huyushi98 @BangzhengL @AnnieFeng6 @Haoyu_Wang_97 @Xudong_Lin_AI @DanRothNLP @nlpnoah @weichiuma @RanjayKrishna YOU ARE THE BEST!!! 😎🔥 (5/n)
1
0
10
@XingyuFu2
Xingyu Fu
5 months
😊 The data of BLINK👁️ is curated from a wide range of sources and datasets. We thank the authors for making them available. 😊 Specifically, we would like to shout out to 1. HPatches by Vassileios Balntas, @LencKarel , Andrea Vedaldi, and Krystian Mikolajczyk (
0
0
8
@XingyuFu2
Xingyu Fu
5 months
🔥Comparison with previous benchmark: ⭕️ Diverse visual prompting: Besides text prompts, we support various visual prompts, such as points, boxes, and masks. 👓Beyond recognition: We study a wide range of tasks beyond visual recognition. 🧑‍Visual commonsense: Our tasks
Tweet media one
1
0
8
@XingyuFu2
Xingyu Fu
3 months
😺 This work is done with my amazing collaborators: @yujielu_10 , muyu he, @WilliamWangNLP @DanRothNLP YOU ARE THE BEST!!! 😎🔥 (n/n)
3
1
8
@XingyuFu2
Xingyu Fu
2 months
Shoutout to its sister benchmark: BLINK Which claims that multimodal LLMs are bad at visual perception focused tasks.
0
0
4
@XingyuFu2
Xingyu Fu
5 months
@Yossi_Dahan_ @penn_nlp @uwnlp @cogcomp @ai2_allennlp Yeah I agree with you! That’s a great question, I believe more high quality data is needed, and also more supervision other than image-caption mapping is needed, e.g. feedback from specialist models as we discussed in the paper 😺
1
0
3
@XingyuFu2
Xingyu Fu
5 months
🔥Results and Takeaways: 💻< 🧠BLINK Benchmark posts a great challenge to existing multimodal LLMs: 🥇Human: 96% 🥈GPT4V: 51% 🥉Gemini Pro: 45% 4️⃣Claude OPUS: 43% 5️⃣Random guess: 38% ⭕️ Visual prompting can have a big impact on multimodal LLM performance: circle sizes and
Tweet media one
0
0
2
@XingyuFu2
Xingyu Fu
5 months
@DongfuJiang @penn_nlp @uwnlp @cogcomp @ai2_allennlp 😺 The eval code on validation set is already out at and we'll have the test set evaluation ready this weekend, I'll keep you posted! I'm excited to see the Mantis results!
Tweet media one
0
0
2
@XingyuFu2
Xingyu Fu
3 months
@keviv9 @penn_nlp @ucsbNLP Thanks Prof. Gupta lol 😄
0
0
1
@XingyuFu2
Xingyu Fu
5 months
@XueFz Sure, this task evaluates the ability of Multimodal LLMs to engage in graphical reasoning. There are mainly two kinds of problems, 1. 3D imagination: e.g. link a 2D paper to it's 3D format after folding it, 2. Pattern following: eg. find the image following the same pattern or
0
0
1
@XingyuFu2
Xingyu Fu
2 years
🌟 Can large multi-modal models reason the time and location given a single image? Check our #acl2022nlp paper, “There’s a Time and Place for Reasoning Beyond the Image” for more details! Paper available at @cogcomp (1/N)
Tweet media one
3
0
1
@XingyuFu2
Xingyu Fu
2 months
@ZhiHuangPhD @Penn @PennPathLabMed @UPennDBEI Congratulations Prof. Huang! Welcome to Philly 😃
0
0
1
@XingyuFu2
Xingyu Fu
2 years
@cogcomp (3/N) We show that there exists a 70% gap between a state-of-the-art CLIP model and human performance, motivating higher-level vision-language joint models that can conduct open-ended reasoning with world knowledge.
Tweet media one
0
0
1