PhD Thesis Defense: Sohaib Ahmad, Optimized Resource Allocation for Serving Deep Learning Models : Manning College of Information & Computer Sciences : UMass Amherst

Friday, February 14, 2025, 11:00 AM - Friday, February 14, 2025, 1:00 PM

A215

Lederle Graduate Research Center Lowrise

Hybrid

PhD Thesis Defense

Presentation

Speaker

Sohaib Ahmad

Abstract

The exponential growth of deep learning (DL) usage has led to a significant increase in the demand for computational resources. However, the computational capabilities of the underlying hardware used to train and deploy these models have not progressed at the same rate, leading to resource constraints and increased operational costs. Model serving, which dominates the lifecycle of DL models, constitutes the majority of these costs. Therefore, it has become increasingly critical to develop resource-efficient methods to serve DL models.

This thesis aims to maximize the resource efficiency of DL model serving by optimizing resource allocation, thereby reducing serving costs while ensuring high performance and response quality. We first introduce a model serving system that employs accuracy scaling—adjusting the accuracy of served requests in response to demand variations—to increase serving capacity with minimal accuracy degradation. We further generalize accuracy scaling to inference pipelines with complex dependencies and integrate it with traditional hardware scaling to minimize serving costs and latency violations. Using model cascades, we enhance accuracy scaling with query awareness to identify and route easier queries to lightweight models to improve serving throughput without sacrificing accuracy. Finally, we present a distributed edge-cloud model serving system that selectively offloads inference queries from expensive cloud servers to the edge in a query-aware manner to further minimize serving costs.

Advisor

Ramesh Sitaraman

Hybrid event posted in PhD Thesis Defense

PhD Thesis Defense: Sohaib Ahmad, Optimized Resource Allocation for Serving Deep Learning Models

Content

Lederle Graduate Research Center Lowrise

Global footer