Uncertainty-Aware Computer Vision in Resource-Constrained Environments
Content
Speaker
Abstract
Recent breakthroughs in deep learning techniques have led to staggering performance improvements in many domains. This has made autonomous systems a critical component for many real-world use cases, including in Internet of Things (IoT) environments. This domain is especially challenging as it imposes considerable environmental, networking, and hardware constraints on the models. Further compounding this challenge are the high stakes decisions that are necessary in many real-world deployments. This thesis explores how to leverage uncertainty awareness to create more robust vision models for use in constrained environments. By estimating and communicating the uncertainty that a model has, we can generate more reliable and truth-worthy predictions.
We first consider the problem of zero-shot image classification, where no labeled data is available for some classes. By utilizing a textual class hierarchy, we expose an accuracy-specificity trade-off that lets systems make more accurate, albeit less specific, predictions under uncertainty due to resource constraints. We then address the distributed execution of image classifiers. We split a neural network between an edge device and the cloud by performing a partial execution on the edge and sending latent features to the cloud for completion. We found that this approach demonstrates superior latency over conventional methods. Merging these strategies together, we craft a distributed, hierarchical object detector validated via a prototype on ultra low-power edge hardware.
We next evaluate the edge runtime of recent transformer-based object detectors. We additionally show how their unique characteristics simplify reasoning about bounding box uncertainty compared to earlier methods. However, reasoning about uncertainty over bounding boxes still has various downsides. For that reason, we emphasize geospatial tracking, where 3D points in space are predicted rather than boxes in the image-plane. With the support of a multi-camera dataset with geospatial ground-truth, we train a deep probabilistic model of an object's position. The predictions are then fused using multi-observation Kalman trackers. We demonstrate how modeling the geometric transformation between the image-plane and the world coordinate frame allows us to train geospatial detectors for tracking using much less data than end-to-end deep learning approaches. Furthermore, we are able to output intuitive geospatial uncertainty estimates, generalize to unseen viewpoints, and provide straightforward support for multi-object tracking.