Vision perception involves a range of functions to process diverse forms of perception. This also requires a high-level understanding of the relation and intrinsic among various types of visual data. For instance, autonomous driving encompasses a coordinated interplay between lane detection, tracking of humans and vehicles, recognition of traffic signs, and more. In this talk, I will explore methods for efficiently harnessing insights from different perceptions to build an integrated AI vision system. Additionally, I will review my efforts in integrating various forms of visual supervision and joint training a deep-learning model for all tasks. This integrated system can be enhanced by delving into the scaling law in large-scale vision joint training.
Advisor: Erik Learned-Miller