Content

Speaker

Sohaib Ahmad

Abstract

The exponential growth of deep learning (DL) usage has led to a significant increase in the demand for computational resources. However, the computational capabilities of the underlying hardware used to train and deploy these models have not progressed at the same rate, leading to resource constraints and increased operational costs. Model serving, which dominates the lifecycle of DL models, constitutes the majority of these costs. Therefore, it has become increasingly critical to develop resource-efficient methods to serve DL models.

This thesis aims to maximize the resource efficiency of DL model serving by  optimizing resource allocation, thereby reducing serving costs while ensuring high performance and response quality. We first introduce a model serving system that employs accuracy scaling—adjusting the accuracy of served requests in response to demand variations—to increase serving capacity with minimal accuracy degradation. We further generalize accuracy scaling to inference pipelines with complex dependencies and integrate it with traditional hardware scaling to minimize serving costs and latency violations. Using model cascades, we enhance accuracy scaling with query awareness to identify and route easier queries to lightweight models to improve serving throughput without sacrificing accuracy. Finally, we present a distributed edge-cloud model serving system that selectively offloads inference queries from expensive cloud servers to the edge in a query-aware manner to further minimize serving costs.

Advisor

Ramesh Sitaraman

Hybrid event posted in PhD Thesis Defense