PhD Seminar: Yen-Chieh Lien, Data Generation for Weakly Supervised Neural Retrieval
Content
Speaker
Abstract
To address the data limitation in neural retrieval, weak supervision leverages existing ranking methods to automatically generate pseudo relevance judgments. However, there are still several limitations to consider. Firstly, the size and accessibility of the query collection pose challenges for existing methods in generating a sufficiently large and diverse set of weak signals. Secondly, while empirical and theoretical evidence demonstrates that a weakly supervised neural ranking model can outperform the original ranker, the quality of the information provided by the original ranking models highly correlates with and constrains the overall performance of weakly supervised models.
To overcome these limitations, the dissertation focuses on employing a neural generative approach for data generation in weak supervision settings, incorporating effective techniques. To address the issue of query scarcity, a query augmentation framework is designed, which utilizes GAN-based methods to expand an insufficient query set. Evaluation results indicate that augmentation enhances ranking performance, particularly when the original query set is inadequate for supporting weakly supervised training. In the context of E-commerce applications, an ensemble approach is devised to generate pseudo queries from customer reviews. These generated queries exhibit similarity to real customer queries and effectively enhance ranking performance within the weak supervision framework.
To tackle the challenge of weak labeler quality, we propose a framework called generalized weak supervision (GWS). This framework extends the definition of weak labeler to include the weakly supervised model itself. Through iterative re-labeling, the quality of pseudo relevance judgments is improved without the need for additional data. We present four implementations of the GWS framework, demonstrating significant enhancements in ranking tasks compared to weak supervision methods.
Finally, we extend weak signals generated from large language models (LLMs) to explanations in natural language. Through the extended form, the ranking ability of LLMs is transferred into smaller models more effectively.
Advisors
W. Bruce Croft and Hamed Zamani