PhD Dissertation Proposal Defense: Yixiao Song, Advancing AI Factuality via Comprehensive Evaluation Your Name: Kyle Skemer
Content
Speaker
Abstract
The rapid advancement of AI models in natural language generation often outpaces the development of reliable evaluation metrics, creating challenges in reliably evaluating model quality and capturing the nuances of model performance across various tasks such as long-form question answering (LFQA) [Krishna et al., 2021], machine translation [Karpinska and Iyyer, 2023], and instruction following [Pham et al., 2024]. Consequently, there is a pressing need for evaluation methodologies that not only keep pace with AI’s rapid development but also set higher standards for AI systems to achieve.
This thesis addresses the urgent need for more robust evaluation methodologies by developing scalable, accurate tools and benchmarks for factuality assessment in AI-generated content. It begins by identifying key limitations in current evaluation practices through an analysis of long-form question answering, where automatic metrics and crowd-sourced annotations often fail to detect factual errors that experts can reliably catch. Building on these insights, the thesis introduces VeriScore, a general-purpose factuality metric that leverages improved claim extraction, web-based evidence retrieval, and fine-tuned verification models to offer accurate and efficient assessment aligned with human judgments. Finally, the thesis presents BearCubs, a new benchmark designed to evaluate AI agents' ability to identify factual information from open-ended, real-world, and multimodal environments, such as interacting with live web content and navigating complex visual tasks. BearCubs reveals substantial performance gaps between humans and current state-of-the-art agents, motivating new directions in agent evaluation and training. Together, these contributions aim to close the gap between human-level evaluation and scalable automation, setting higher standards for factuality and reliability in AI systems.
Advisor
Mohit Iyyer