Beyond their use in assisting human evaluation (e.g. CriticGPT), can critiques directly enhance preference learning? During my
@Cohere
internship, we explored using synthetic critiques from large language models to improve reward models.
📑Preprint:
Scaling experiments show that high-quality critiques significantly enhance data efficiency, especially with limited data. 1 critique roughly equals 40 vanilla preference pairs. Our method, using open-source models, is accessible and budget-friendly.
@cohere
Instead of solely relying on better/worse annotations, we enrich reward models with synthetic critiques from LLMs, dissecting completion features. We then train the model to predict scalar rewards, conditioned on these critiques.
Results show that adding critiques improves reward model accuracy. We found that critique quality matters: high-quality critiques boost performance, while low-quality ones can hinder it.
Thank you
@stefan_fee
,
@JinlanFu
, and Professor
@gneubig
for all your guidance and support! I'm interested in exploring the use of performance prediction in many more scenarios.
How can we reliably estimate a system's performance without performing experiments? Check out our work (EACL)
1. Formulate performance prediction as Tensor Completion problem 2. Establish a set of reliability analysis mechanisms (confidence, calibration)
@Zhuang_Li_NLP
Thank you for your comment! Yes, the RM is trained on instruction-response-critique triplets. We haven't done experiments to train LLM using the critiques yet, but in practice yes this means passing the critique context.