🧵(2/n) We create a new benchmark PIE, which consists of 853 questions across six distinct, fine-grained question types based on a three-tiered hierarchy of model’s interactive behaviors: challenging invalid question settings, seeking clarifications, and uncovering latent human