Seungone Kim Profile Banner
Seungone Kim Profile
Seungone Kim

@seungonekim

Followers
1,116
Following
826
Media
38
Statuses
539

Ph.D. student @LTIatCMU working on (V)LM Evaluation & Systems that Improve with (Human) Feedback | Prev: @kaist_ai @yonsei_u @NAVER_AI_Lab @LG_AI_Research

Pittsburgh, PA
Joined November 2021
Don't wanna be here? Send us removal request.
Explore trending content on Musk Viewer
Pinned Tweet
@seungonekim
Seungone Kim
18 days
🤔How can we systematically assess an LM's proficiency in a specific capability without using summary measures like helpfulness or simple proxy tasks like multiple-choice QA? Introducing the ✨BiGGen Bench, a benchmark that directly evaluates nine core capabilities of LMs.
Tweet media one
6
54
182
@seungonekim
Seungone Kim
9 months
Excited to present 🔥Prometheus, a fully open-source evaluator LM that is on par with GPT-4 evaluation when the “appropriate” reference materials are appended! * Could generalize to customized score rubrics * Shows high correlation with both human evaluators & GPT-4 evaluation
9
51
347
@seungonekim
Seungone Kim
3 months
🔥I will be joining @CarnegieMellon @LTIatCMU this upcoming Fall, working with @gneubig and @wellecks on evaluating LLMs & improving them with (human) feedback! Can't wait to explore what lies ahead during my Ph.D. journey☺️
31
7
352
@seungonekim
Seungone Kim
1 year
🚨 New Instruction Data Alert! We introduce 🌟CoT Collection🌟, an instruction dataset including 52 times more CoT rationales and 177 times more tasks compared to previously available CoT datasets.
8
53
250
@seungonekim
Seungone Kim
2 months
#NLProc Introducing 🔥Prometheus 2, an open-source LM specialized on evaluating other language models. ✅Supports both direct assessment & pairwise ranking. ✅ Improved evaluation capabilities compared to its predecessor. ✅Can assess based on user-defined evaluation criteria.
Tweet media one
3
41
161
@seungonekim
Seungone Kim
6 months
🤔How could you evaluate whether your Vision Language Model (VLM) is closely reaching the capabilities of GPT-4V? We’re excited to present 🔥Prometheus-Vision, the first open-source VLM specialized for evaluating other VLMs based on fine-grained scoring criteria, with co-lead
Tweet media one
3
43
148
@seungonekim
Seungone Kim
9 months
Super excited to share that our CoT Collection work has been accepted at #EMNLP2023 ! If you want to make your LM better at expressing Chain-of-Thoughts, take a look at our work🙂 See you in Singapore!! 🇸🇬🇸🇬
@seungonekim
Seungone Kim
1 year
🚨 New Instruction Data Alert! We introduce 🌟CoT Collection🌟, an instruction dataset including 52 times more CoT rationales and 177 times more tasks compared to previously available CoT datasets.
8
53
250
2
16
57
@seungonekim
Seungone Kim
6 months
Super excited to share that Prometheus is accepted at ICLR! See you all in Vienna 🇦🇹 Also, check out our recent work on Prometheus-Vision expanding to the multi-modal space! It's the first open-source VLM that evaluates other VLMs:
@seungonekim
Seungone Kim
9 months
Excited to present 🔥Prometheus, a fully open-source evaluator LM that is on par with GPT-4 evaluation when the “appropriate” reference materials are appended! * Could generalize to customized score rubrics * Shows high correlation with both human evaluators & GPT-4 evaluation
9
51
347
1
5
56
@seungonekim
Seungone Kim
2 months
I'll be presenting Prometheus @iclr_conf on May 10th (Friday), 10:45 AM - 12:45 PM at Halle B. Let's talk if you're interested at LLM Evals 🙂
@seungonekim
Seungone Kim
9 months
Excited to present 🔥Prometheus, a fully open-source evaluator LM that is on par with GPT-4 evaluation when the “appropriate” reference materials are appended! * Could generalize to customized score rubrics * Shows high correlation with both human evaluators & GPT-4 evaluation
9
51
347
1
2
48
@seungonekim
Seungone Kim
1 year
I'm going to present our Expert Language Model paper at ICML 2023! Come to Exhibit Hall 1 (7.25 2:00PM - 3:30PM) if you're interested in either Instruction Tuning or Expert LMs!
@jang_yoel
Joel Jang
1 year
Scaling 📈 the total # of tasks during instruction tuning has been known to unlock new abilities in LMs. However, we find that an LM trained on a single task outperforms an LM trained on 300+ tasks on unseen tasks 🤯 📝: 1/8
Tweet media one
2
62
305
1
5
47
@seungonekim
Seungone Kim
2 months
🤖Distilling from stronger models is effective at enhancing the reasoning capabilities of LLMs as shown in Orca, WizardMath, Meta-Math, and Mammoth. 🔍In this work, we ask the question if LLMs could "self-improve" their reasoning capabilities! Check out the post for more info😃
@ronalhwang
Hyeonbin Hwang
2 months
🚨 New LLM Reasoning Paper 🚨 Q. How can LLMs self-improve their reasoning ability? ⇒ Introducing Self-Explore⛰️🧭, a training method specifically designed to help LLMs avoid reasoning pits by learning from their own outputs! [1/N]
Tweet media one
8
55
293
0
4
46
@seungonekim
Seungone Kim
9 months
@_akhaliq
AK
9 months
Prometheus: Inducing Fine-grained Evaluation Capability in Language Models paper page: Recently, using a powerful proprietary Large Language Model (LLM) (e.g., GPT-4) as an evaluator for long-form responses has become the de facto standard. However, for
Tweet media one
3
51
229
1
3
44
@seungonekim
Seungone Kim
8 months
💡Evaluating high-level capabilities (e.g., Helpfulness, CoT abilities, Presence of ToM) of LLMs hold crucial instead of simply measuring the performance on a single domain/task. In our recent work, we check whether LLMs could process multiple instructions at once. Check it out!
@gson_AI
arlo_son
8 months
🔥New Paper Alert 🔥 🤔 LLMs are typically prompted to follow a SINGLE instruction per inference call. BUT, can LLMs also follow TWO or MORE instructions at once? 📢 Findings: Smaller Models can’t! But, GPT-4 shows up to 17.4% better performance even less inference calls 🤯
Tweet media one
3
8
39
0
5
37
@seungonekim
Seungone Kim
5 months
#NLProc 🤗We propose LangBridge, a scalable method to enable LMs solve multilingual reasoning tasks (e.g., math, code) "without any multilingual data"! 🔑The key ingredient is using mT5's encoder and aligning with an arbitrary LLM! ➡Check out the post for more information!
@dongkeun_yoon
Dongkeun Yoon
5 months
❗New multilingual paper❗ 🤔LMs good at reasoning are mostly English-centric (MetaMath, Orca 2, etc). 😃Let’s adapt them to solve multilingual tasks. BUT without using multilingual data! We present LangBridge, a zero-shot approach to adapt LMs for multilingual reasoning.
Tweet media one
6
56
215
1
6
35
@seungonekim
Seungone Kim
7 months
Thank you for the shout out @jerryjliu0 @llama_index !! Great to see an awesome blog post & some additional examples that wasn't in the paper😊
@llama_index
LlamaIndex 🦙
7 months
GPT-4 is a popular choice for LLM evals, but it’s closed-source, subject to version changes, and super expensive 💸 We’re excited to feature Prometheus by @seungonekim et al., a fully-open source 13B LLM that is fine-tuned to be on par with GPT-4 eval capabilities 🔥
Tweet media one
9
87
470
0
5
32
@seungonekim
Seungone Kim
7 months
🤗Interested at how we could holistically evaluate LLMs? I'll be presenting 2 papers at #NeurIPS2023 @itif_workshop ! Come visit😃 👉Room 220-222, Dec 15th, 1-2PM (Poster), 5-6PM (Oral) 🧪Flask: (led by @SeonghyeonYe ) 🔥Prometheus:
@seungonekim
Seungone Kim
9 months
Excited to present 🔥Prometheus, a fully open-source evaluator LM that is on par with GPT-4 evaluation when the “appropriate” reference materials are appended! * Could generalize to customized score rubrics * Shows high correlation with both human evaluators & GPT-4 evaluation
9
51
347
1
7
30
@seungonekim
Seungone Kim
9 months
🤔 How can you evaluate whether your LLM is humorous or not? Among various versions during development, how can you track whether your LLM is inspiring while being culturally sensitive? Current evaluation resources (e.g., MMLU, Big Bench, AlpacaFarm) are confined to generic,
Tweet media one
3
8
28
@seungonekim
Seungone Kim
7 months
⭐️If you're interested at inducing Chain-of-Thought capabilities to smaller models across a variety of tasks, come visit our poster at #EMNLP2023 ! ➡️ Poster Session 5, Dec 9th, 11AM Also happy to chat about synthetic data and NLG evaluation as well😀
@seungonekim
Seungone Kim
1 year
🚨 New Instruction Data Alert! We introduce 🌟CoT Collection🌟, an instruction dataset including 52 times more CoT rationales and 177 times more tasks compared to previously available CoT datasets.
8
53
250
1
1
26
@seungonekim
Seungone Kim
3 months
#NLProc 🧐We show that LLMs can write & simulate executing pseudocode to improve reasoning. Compared to Zero-shot CoT or PoT/PAL, our Think-and-Execute significantly improves performance on BBH by understanding the "logic" behind the task. Check out the post for more details!
@hyungjoochae
Hyungjoo Chae
3 months
📢 New LLM Reasoning Paper Excited to present 🧠Think-and-⏩Execute, a method that (1) generates a pseudocode describing the key logic of a problem & (2) executes the pseudocode with an LLM as if it is a compiler (e.g., printing out intermediate variables in between). 📈
Tweet media one
7
55
238
0
2
23
@seungonekim
Seungone Kim
2 years
My first first-authored paper has been accepted at COLING 2022! The title of the paper is "Mind the Gap! Injecting Commonsense Knowledge for Abstractive Dialogue Summarization". (1/N)
Tweet media one
3
3
21
@seungonekim
Seungone Kim
5 months
@Teknium1 I've experienced this as well while training a smaller model with CoT augmented from GPT models. Even though the eval loss went up, a smaller model trained for multiple epochs (>=3) definitely generated better rationales. Also, since it is too
2
2
21
@seungonekim
Seungone Kim
9 months
To learn more about our work, please check out our paper, website, code, model, and dataset! 📚 🌐  🖥 🔥 💬
1
5
21
@seungonekim
Seungone Kim
30 days
🤔In most cases, we use a fixed system message, "As a helpful assistant [...]". Is this optimal? 💡We find that by incorporating diverse system messages in post-training, LLMs gain adherence to these messages, a key component for personalized alignment. ➡️Check out the post!
@sylee_ai
Seongyun Lee
30 days
🚨 New LLM personalization/alignment paper 🚨 🤔 How can we obtain personalizable LLMs without explicitly re-training reward models/LLMs for each user? ✔ We introduce a new zero-shot alignment method to control LLM responses via the system message 🚀
Tweet media one
3
53
206
0
2
15
@seungonekim
Seungone Kim
8 months
#NLPaperAlert 😴 Aren't you tired of the monotonous way ChatGPT responds? 💡Infuse your preferences to personalize the way your LLM responds, based on our new alignment method 🥣Personalized Soups🥣 ✅ Great work led by @jang_yoel , check it out!!
@jang_yoel
Joel Jang
8 months
🎯 Tired of one-size-fits-all AI chatter? ChatGPT tends to generate verbose & overly informative responses. This is because the current RLHF pipeline only allows aligning LLMs to the general preferences of the population. However, in the real world, people may have multiple,
Tweet media one
2
67
298
0
1
15
@seungonekim
Seungone Kim
1 year
Check out our recent work on evaluating the capabilities of LLMs based on multiple fine-grained criteria! It suggests where we currently are & whete we should head towards ✨✨Thanks to @SeonghyeonYe & @Doe_Young_Kim for leading this amazing work🙌
@SeonghyeonYe
Seonghyeon Ye
1 year
Are open-sourced LLMs really good? 👀 We introduce FLASK🧪, a fine-grained evaluation based on skill sets! Even SOTA open-sourced LLMs such as LLaMA2 Chat 70B lag behind proprietary LLMs for some abilities. 🤯 Paper: Demo:
Tweet media one
10
106
453
0
1
14
@seungonekim
Seungone Kim
9 months
I'd like to thank all of my co-authors and supervisor for helping me with this project! It was a great experience working with you all, appreciate it😃 @jshin491 Yejin Cho @jang_yoel @ShayneRedford @hwaran_lee @oodgnas Seongjin Shin @SungdongKim4 @j6mes @seo_minjoon
1
0
14
@seungonekim
Seungone Kim
18 days
✏️Based on this motivation, we have constructed a generation benchmark that encompasses 9 capabilities, 77 tasks, and 765 instances, each with its own instance-specific evaluation criteria. 🤖We evaluate 103 frontier LLMs including pre-trained LMs, post-trained LMs, and
Tweet media one
1
0
15
@seungonekim
Seungone Kim
9 months
We also show that Prometheus could function as a reward model when tested with human preference datasets. Prometheus obtains high accuracy compared to SOTA reward models, showing the possibilities of using it as an universal reward model.
Tweet media one
1
1
12
@seungonekim
Seungone Kim
5 months
@aparnadhinak @OpenAI @ArizePhoenix @arizeai Hello @aparnadhinak , great analysis! I really enjoyed reading it. I have one question: I think the scoring decisions could be a lot more precise if you prompt it with (1) mapping an explanation for all the scores instead of just 0,20,40,60,80,100%, (2) making the model generate
3
0
11
@seungonekim
Seungone Kim
4 months
#nlproc As a native Korean, the translated MMLU seemed super awkward to me since it requires US-related knowledge & expressions are not fluent (even if you use DeepL or GPT-4). Super timely work from @gson_AI who made a Korean version of MMLU without any translated instances!
@gson_AI
arlo_son
4 months
🌟 KMMLU 🌟This benchmark replicates the methodology that produced MMLU, but using examinations common in Korea. We manually annotate a subset of the questions as to whether they require Korea-specific knowledge and also designate a KMMLU-Hard subset that current models find
Tweet media one
Tweet media two
1
5
22
1
2
12
@seungonekim
Seungone Kim
9 months
🔍 Why is it important to build an open-source alternative for GPT-4 evaluation? Solely relying on proprietary LLM evaluators confers the following disadvantages: 1/ Close-source Nature: The proprietary nature of LLMs brings transparency concerns as internal workings are not
1
0
12
@seungonekim
Seungone Kim
4 months
@jphme @dvilasuero @natolambert My take is that without a reference answer, it's basically asking the evaluator/judge model to (1) solve the problem internally through its forward pass and also (2) evaluate the response at the same time. It's twice of workload, which is evidently harder!
2
1
11
@seungonekim
Seungone Kim
18 days
Lastly, I’d like to appreciate our coauthors for their hard work in annotating/verifying the dataset and for valuable advices! @scott_sjy @JiYongCho1 @ShayneRedford @chaechaek1214 @dongkeun_yoon @gson_AI @joyejin195315 @shafayat_sheikh @jinheonbaek @suehpark @ronalhwang
0
0
13
@seungonekim
Seungone Kim
1 year
Recent works showed that CoT Fine-tuning enables smaller LMs to solve new novel tasks more effectively. But, up to date, there exists only 9 CoT datasets available, which FLAN-T5 used for training, namely AQuA, Creak, ECQA, eSNLI, GSM8K, QASC, QED, SenseMaking and StrategyQA.
1
0
10
@seungonekim
Seungone Kim
2 months
☺️ I'd like to thank all my amazing co-authors for their valuable comments & advice throughout the project! @scott_sjy @ShayneRedford @billyuchenlin @jshin491 @wellecks @gneubig Moontae Lee @Kyungjae__Lee @seo_minjoon
1
0
9
@seungonekim
Seungone Kim
18 days
Check out our paper for more information! 📚 Links to potentially useful resources: 📰 Dataset: 📊 Evaluation Results of 103 LMs: 🖨 Zeno Visualization of Outputs/Feedback: Bench
2
0
11
@seungonekim
Seungone Kim
9 months
💡Specifically, we append a Score Rubric and a Reference Answer during both training/inference! 1/ Score Rubric: Provides information of the pivotal aspects essential for addressing the given instruction. Without it, the evaluator LM should inherently know what details should be
Tweet media one
1
3
8
@seungonekim
Seungone Kim
2 months
[So how did we obtain a unified evaluator LM that could function in both formats?] 😧 Initially, we trained a evaluator LM jointly on direct assessment and pairwise ranking formats. => However, in most cases, it performed worse than evaluator LMs trained on each format
1
2
9
@seungonekim
Seungone Kim
4 months
@jphme @dvilasuero @natolambert @jphme @natolambert @dvilasuero We're currently training mistral/mixtral with some additional Prometheus data & new techniques. I think we could have a preprint by the end of the month! I'll definitely include an experiment on how the model behaves when there's no reference:)
2
1
9
@seungonekim
Seungone Kim
18 days
🧮 Take this math problem for example, used to measure reasoning capabilities. ➡️ Is it really reliable to trust a score obtained by prompting GPT-4 to assess the “helpfulness” of a rationale when we are trying to measure the reasoning capabilities of LLMs? ➡️ Asking to grade
Tweet media one
1
0
11
@seungonekim
Seungone Kim
9 months
🤔 But how is it possible to use a 7B & 13B sized LM as an evaluator if it isn’t as good as GPT-4 in the first place? ⇒ The main obstacle of obtaining a language model specialized on evaluation is because it needs to know the important aspects tailored with the instruction and
1
0
7
@seungonekim
Seungone Kim
5 months
@Teknium1 Here's the dataset: @openchatdev used it as their training data to induce evaluation capabilities in their recent models, but I haven't heard whether it had a positive effect. Would love to see if training on it would eventually lead to a self-improving
1
0
6
@seungonekim
Seungone Kim
18 days
🔍 Among the 78,795 judgments (103 LMs * 765 prompts), we use a subset to measure the correlation between human judgments and evaluator LM judgments. ➡️ GPT-4-Turbo-2024-04-09 achieves the highest correlation in average (0.623). ➡️ Majority voting with multiple evaluator LMs
Tweet media one
3
0
9
@seungonekim
Seungone Kim
2 months
@_akhaliq
AK
2 months
Prometheus 2 An Open Source Language Model Specialized in Evaluating Other Language Models Proprietary LMs such as GPT-4 are often employed to assess the quality of responses from various LMs. However, concerns including transparency, controllability, and affordability
Tweet media one
4
58
271
0
0
7
@seungonekim
Seungone Kim
1 year
To resolve this issue, we augment CoT rationales using LLMs. Using FLAN Collection as a source, we use ICL and make LLMs generate high quality CoT rationales across 1,060 NLP tasks.
Tweet media one
1
0
7
@seungonekim
Seungone Kim
18 days
🧩 One of the biggest challenges in evaluating LLMs is the difficulty in quantifying the quality of a model’s response with respect to specific capabilities of interest, such as reasoning, planning, and grounding. ➡️ When LLMs serve as judges for free-form responses, the lack of
Tweet media one
2
0
9
@seungonekim
Seungone Kim
18 days
🆚 Based on this result, we plot the overall performance trends of different LMs and analyze how the performance of each capability scales with respect to model parameter size. ➡️ (Finding 1) Scalability of pre-trained LM parameter sizes contributes to qualitative improvements
Tweet media one
Tweet media two
Tweet media three
1
0
9
@seungonekim
Seungone Kim
6 months
Last but not least, I’d like to thank our wonderful team for accomplishing this project! @sylee_ai @suehpark @GeewookKim @seo_minjoon
0
0
6
@seungonekim
Seungone Kim
9 months
Surprisingly, when measuring the Pearson correlation with human evaluators on 45 customized score rubrics, Prometheus achieves 0.897, which is on par with GPT-4 (0.882), and greatly outperforms GPT-3.5-Turbo (0.392).
Tweet media one
1
1
6
@seungonekim
Seungone Kim
1 year
I'd also like to talk if you're interested in Chain of Thought Fine-tuning, Zero-shot Generalization and Few-shot adaptation! Come visit :)
@seungonekim
Seungone Kim
1 year
🚨 New Instruction Data Alert! We introduce 🌟CoT Collection🌟, an instruction dataset including 52 times more CoT rationales and 177 times more tasks compared to previously available CoT datasets.
8
53
250
0
2
6
@seungonekim
Seungone Kim
9 months
As instruction-tuned models are developed in a rapid pace, we expect more future research on open-source evaluator LLMs. Some remaining questions are: 1/ What additional reference materials could we use in addition to the reference answer & score rubric? Does including
1
0
6
@seungonekim
Seungone Kim
18 days
➡️ (Finding 3) The performance gap between larger pre-trained LMs and post-trained LMs narrows, while it persists in smaller LMs. When examining each capability, the gap is more pronounced in refinement, reasoning, grounding, and planning, suggesting that post-training impacts
Tweet media one
Tweet media two
1
0
7
@seungonekim
Seungone Kim
2 months
To learn more about our work, please check out our paper, code, models, and datasets! 📄 💻 🤗
1
0
5
@seungonekim
Seungone Kim
9 months
To train 🔥Prometheus, we construct a new dataset called the 💬Feedback Collection. Compared to previous feedback datasets that only evaluate on helpfulness/harmlessness, the 💬Feedback Collection includes 1K customized score rubrics (e.g., Is the answer professional and formal
1
0
5
@seungonekim
Seungone Kim
2 months
Also, I'll present Prometheus-Vision on May 11th (Saturday), 13:00 - 14:00 at ME-FoMo workshop!
@seungonekim
Seungone Kim
6 months
🤔How could you evaluate whether your Vision Language Model (VLM) is closely reaching the capabilities of GPT-4V? We’re excited to present 🔥Prometheus-Vision, the first open-source VLM specialized for evaluating other VLMs based on fine-grained scoring criteria, with co-lead
Tweet media one
3
43
148
0
0
5
@seungonekim
Seungone Kim
2 months
📅 Last year, we introduced Prometheus 1, one of the first evaluator LM that showed high scoring correlations with both humans and GPT-4 in direct assessment formats. (Please read the tagged thread if you haven't already!) 📈 Since then, many stronger
@seungonekim
Seungone Kim
9 months
Excited to present 🔥Prometheus, a fully open-source evaluator LM that is on par with GPT-4 evaluation when the “appropriate” reference materials are appended! * Could generalize to customized score rubrics * Shows high correlation with both human evaluators & GPT-4 evaluation
9
51
347
1
0
5
@seungonekim
Seungone Kim
7 months
Our CoT Collection dataset & CoT-T5 models are available at Huggingface! Link:
0
1
5
@seungonekim
Seungone Kim
1 year
So how could one use this dataset? To explore the benefits of CoT fine-tuning on massive amount of instruction data, we continually train FLAN-T5 with CoT Collection, resulting in our model C2F2. Then, we test our model in zero-shot and few-shot settings.
Tweet media one
1
0
4
@seungonekim
Seungone Kim
2 months
🤔 The two most conventional methods for evaluating LLMs are direct assessment (e.g., MT-Bench) and pairwise ranking (e.g., AlpacaEval). 🫡 Since Prometheus 1 did not support "pairwise ranking", one of our motivations was to train a flexible, unified evaluator LM that could
Tweet media one
1
0
4
@seungonekim
Seungone Kim
1 year
On the Big Bench Hard (BBH) benchmark, we observe +4.34%, +2.44% improvement across 3B and 11B model scale, respectively. Also, naively training T5 on CoT Collection works effectively, considering that CoT Collection is 8 times smaller than FLAN Collection!
Tweet media one
1
0
4
@seungonekim
Seungone Kim
6 months
Specifically, during both training & inference, Prometheus-Vision receives 5 input components: 1⃣ An Instruction 2⃣ An Image 3⃣ A Response to Evaluate 4⃣ A Customized Score Rubric 5⃣ A Reference Answer (Score 5) Prometheus-Vision is trained to generate 2 components: 1⃣ A
Tweet media one
1
0
4
@seungonekim
Seungone Kim
2 months
We evaluated Prometheus 2 on 4 direct assessment benchmarks: Vicuna-Bench, MT-Bench, FLASK, and Feedback Bench (our in-domain test set). Only the Prometheus 2 models show Pearson correlations above 0.5 among the open source LMs regardless of the reference evaluator (listed
Tweet media one
1
0
4
@seungonekim
Seungone Kim
2 months
We also evaluated Prometheus 2 on 4 pairwise ranking benchmarks: HHH Alignment, MT Bench Human Judgments, Auto-J Eval, and Preference Bench (our in-domain test set). Considering that (1) PairRM was trained on HH-RLHF (similar to HHH Alignment) and (2) Auto-J Eval is the
Tweet media one
1
0
4
@seungonekim
Seungone Kim
9 months
When assessing the quality of the generated feedback in a pairwise comparison setting (i.e., Which feedback is better at criticizing & assessing the given response), human evaluators preferred Prometheus’s feedback over GPT-4 with a 58.62% win rate and 79.57% over GPT-3.5-Turbo.
Tweet media one
1
2
4
@seungonekim
Seungone Kim
1 year
Also, we show that a C2F2 could adapt to new tasks with few instances. On a 64 shot setting, C2F2 + LoRA outperforms FLAN-T5 + Full fine-tuning by +2.97% and +2.37% across 4 legal & medical datasets while updating 2,352x fewer parameters.
Tweet media one
1
0
4
@seungonekim
Seungone Kim
6 months
💡As in our previous work (Flask & Prometheus), we find that appending a score rubric & reference answer during both training/inference is effective in obtaining a good evaluator VLM! We first construct a multi-modal feedback dataset called the Perception Collection. ✨Includes
Tweet media one
1
1
3
@seungonekim
Seungone Kim
6 months
@seungonekim
Seungone Kim
6 months
🤔How could you evaluate whether your Vision Language Model (VLM) is closely reaching the capabilities of GPT-4V? We’re excited to present 🔥Prometheus-Vision, the first open-source VLM specialized for evaluating other VLMs based on fine-grained scoring criteria, with co-lead
Tweet media one
3
43
148
0
0
3
@seungonekim
Seungone Kim
6 months
Lastly, we highlight the potential of our in-domain test set called the Perception Bench. 👎In the widely used LLaVA-Bench, LLaVA-RLHF shows only a marginal difference of 0.14 points with GPT-4V since the questions/instructions are relatively simple. 👍In our Perception-Bench,
Tweet media one
1
1
3
@seungonekim
Seungone Kim
5 months
@aparnadhinak @OpenAI @ArizePhoenix @arizeai @aparnadhinak This would be what I mentioned about looks like:
Tweet media one
0
0
3
@seungonekim
Seungone Kim
7 months
@ShayneRedford @natolambert @natolambert Hello Nathan, in our recent preprint, we made 100K synthetic data & the trained model functioned as a good evaluator/critique model on custom criteria even compared to GPT-4! I would be glad for further discussion if you're interested😃
@seungonekim
Seungone Kim
9 months
Excited to present 🔥Prometheus, a fully open-source evaluator LM that is on par with GPT-4 evaluation when the “appropriate” reference materials are appended! * Could generalize to customized score rubrics * Shows high correlation with both human evaluators & GPT-4 evaluation
9
51
347
1
0
2
@seungonekim
Seungone Kim
2 months
@ZhiruoW Congrats 🥳 I wish I could have been at that workshop
0
0
3
@seungonekim
Seungone Kim
9 months
When measuring the pearson correlation with GPT-4 evaluation on 1222 customized score rubrics, Prometheus obtains higher correlation compared to GPT-3.5-Turbo & Llama-2-Chat (70B), bolstering its capability as an evaluator LM.
Tweet media one
1
2
3
@seungonekim
Seungone Kim
6 months
Next, we measure the correlation with human evaluators on 45 instances across 3 Visual Instruction Following Benchmarks. 😃Prometheus-Vision performs on par with GPT-4 and GPT-4V on the LLaVA-Bench and Perception-Bench (in-domain test set). 💡Yet, as mentioned above, it shows
Tweet media one
1
1
3
@seungonekim
Seungone Kim
6 months
🏞 The solution to this problem is directly using “VLM-as-a-judge”! ➡️VLM Evaluators are flexible to assess based on any customized scoring criteria and don’t require any captioning model to pass the image. 💣 Currently, the ONLY possible option to do this is using GPT-4V!
1
1
3
@seungonekim
Seungone Kim
6 months
Notably, since our evaluation setting is in an absolute grading setting, we find that there is no length bias in terms of the score the evaluator VLM gave (i.e., Doesn’t prefer longer responses). 💡As mentioned in the MT-Bench paper, exploring the advantages and disadvantages of
Tweet media one
1
1
3
@seungonekim
Seungone Kim
6 months
⁉️ How did previous works typically evaluate how good a VLM is? Basically, there were two different approaches. 1️⃣ For Visual Question Answering and Captioning tasks, you would measure the word overlap or edit distance between the model’s prediction and the ground-truth label.
Tweet media one
1
1
3
@seungonekim
Seungone Kim
6 months
We test Prometheus-Vision’s evaluation capabilities on 3,560 instances across 8 benchmarks ranging from Visual Instruction Following, Visual Question Answering, and Captioning tasks. 😃Surprisingly, it consistently outperforms (1) all the other open-source VLMs, (2) Prometheus,
Tweet media one
1
1
3
@seungonekim
Seungone Kim
6 months
@openchatdev @_philschmid @openchatdev I'd be glad to help if there is anything unclear!
0
0
1
@seungonekim
Seungone Kim
7 months
@_philschmid Thank you for the shout out!! It's very rewarding as a grad student ☺️
0
0
2
@seungonekim
Seungone Kim
17 days
@Teknium1 @Teknium1 I personally enjoyed reading through all the verbal feedback for each response instead of just having an average score 🙂
0
0
1
@seungonekim
Seungone Kim
9 months
@NeuralNeron Thanks for your interest in our work! It has the same speed as using Llama-2-Chat 7B & 13B. Using 4 A100 GPUs on huggingface tgi, it took less than 0.33 seconds to generate a feedback & score:)
0
0
1
@seungonekim
Seungone Kim
6 months
@alignment_lab @altryne @alignment_lab @altryne That sounds great! Expanding to other modalities such as speech & video definitely seems like an interesting direction to pursue 🙂 We could have @sylee_ai join as well to make our discussions more fruitful!
0
0
0
@seungonekim
Seungone Kim
6 months
@alignment_lab Interested in seeing where this will go:)
0
0
0
@seungonekim
Seungone Kim
18 days
@LChoshen @LChoshen I think this is a good topic to explore. One thing that first comes into my mind is that we should be very careful when using AI models to judge humans, there might be ethical issues depending on the task!
0
0
1
@seungonekim
Seungone Kim
1 year
0
0
1
@seungonekim
Seungone Kim
1 year
@OrenElbaum Hello Oren, thank you for your interest in our work! ToT is effective at solcing hard problems at the cost of additional inference. While one could obtain high quality rationale data using ToT, investigating if smaller LMs could learn it would be an interesting direction!
0
0
1
@seungonekim
Seungone Kim
18 days
@LChoshen @LChoshen We added our conjecture of why that might be the case in the paper. For this experiment, we wanted to show how the correlations might be if you use evaluation criteria/rubrics that are currently available at the community. Hence, for coarse-grained, we chose MT-Bench
1
0
1
@seungonekim
Seungone Kim
1 year
@tugot17 Hello Piotr, thanks for your interest in our work! We're planning to open source all of our data / models, so stay tuned🙂
0
0
1
@seungonekim
Seungone Kim
1 year
@hwchung27 Thanks for open sourcing the checkpoints, appreciate it!!😃
0
0
1
@seungonekim
Seungone Kim
2 years
The main motivation of this paper was to investigate whether adding commonsense inferences at the input (a typical method for commonsense knowledge injection) could work for datasets not explicitly made for commonsense reasoning. (2/N)
1
0
1
@seungonekim
Seungone Kim
2 years
To check out how commonsense inferences might be helpful in summarizing dialogues, we experiment in a zero-shot setting where the only difference is whether inferences are given or not. We find even with no training commonsense inferences help. (6/N)
Tweet media one
1
0
1
@seungonekim
Seungone Kim
2 years
Lastly, we check out the effect of "commonsense supervision" task. We find that the auxiliary task makes the average attention values of the commonsense inference tokens higher in the upper layers. We conjecture this as fusing semantic meaning / commonsense injection. (8/N)
Tweet media one
0
0
1
@seungonekim
Seungone Kim
2 years
As the main methodology, we cross concatenate utterances and its corresponding commonsense inferences. Also, to enforce the model to use the given inferences, we add an auxiliary task and loss function where the objective is to generate the inferences of the summary. (5/N)
Tweet media one
1
0
1