PhD Dissertation Proposal Defense: Oindrila Saha, Fine-Grained Recognition with Limited Supervision
Content
Speaker
Abstract
Enabling machines to perform visual recognition has been a long-standing challenge. While humans can understand and recognize concepts with just a few examples, deep learning models require large-scale datasets to learn effectively. The cost of obtaining precise annotations at scale becomes prohibitive for tasks that demand expert knowledge. These tasks, termed fine-grained recognition, still remain a significant challenge for machine learning systems. Fine-grained recognition encompasses a variety of problems that require complex and detailed understanding, ranging from classifying images of visually similar bird species to spatially fine-grained tasks such as identifying individual parts within objects. To advance progress in this domain, methods that minimize reliance on costly supervision while maintaining or enhancing fine-grained recognition are essential.
This thesis advances fine-grained recognition under limited supervision through several complementary approaches. First, we evaluate generative and discriminative representation learning strategies for few-shot part segmentation, identifying strengths and trade-offs with respect to performance, robustness, and computational demands. We then introduce a novel method to discover and contrast object parts within images, enhancing both classification and segmentation accuracy. Next, we show how integrating coarse annotation modalities—such as keypoint or foreground-background labels—can improve dense part segmentation beyond what is achievable using only limited dense annotations. Subsequently, we explore natural language as a powerful source of fine-grained information, leveraging large language models to generate weakly supervised text descriptions to adapt vision-language representations, leading to better classification in fine-grained tasks. Future directions for improving vision-language models with limited supervision involve several strategies. First, we can exploit the structure among test images for better classification without the need for labeled data. Another direction involves training language generative models which can reason about images with limited annotations. This can be accomplished by generating synthetic data through language-based generative
models and incorporating external or self-generated feedback to guide the training process. Finally, large-scale, community-driven platforms like iNaturalist can be utilized for training vision-language models.
Advisor
Subhransu Maji