Content

Speaker

Yanlei Diao

Abstract

Big data analytics in the cloud has achieved widespread adoption, revolutionizing data-driven insight discovery and decision-making for businesses and applications. However, unlocking economies of scale is contingent upon the efficient configuration of analytical jobs within big data analytics systems, rapidly optimizing user performance-cost benefits. Existing methods, often heuristic-based or limited to coarse-grained control, face challenges due to three key factors: (i) the complexity of predicting performance amidst varying job characteristics and dynamic runtime environments, (ii) the difficulty in developing Pareto optimal resource optimization solutions with good coverage, efficiency, and consistency, and (iii) the evolution of advanced systems like MaxCompute and Spark, which offer flexible configuration controls, leading to more complex hierarchical or adaptive optimization challenges.

This thesis aims to answer two questions: (i) how to build performance models for analytical jobs in big data systems, and (ii) how to efficiently automate resource optimization across various granularities and system settings? My first contribution is a model server designed for diverse data analytics scenarios. It uses an in-situ modeling approach for jobs with unknown properties, learning job characteristics and performance in the execution environment. This process is enhanced by an autoencoder and a customized triplet loss, which effectively disentangles job embeddings from run-time metrics for accurate performance predictions. For SQL jobs with available query plans, the server employs an ex-ante modeling approach, using a multi-channel input framework to process heterogeneous data structures and learn job performance.

The second contribution is an intelligent resource optimizer with three components for various optimization options. The first component is my enhancement of an existing unified data analytics optimizer through a custom gradient-based solver, boosting its efficiency within a multi-objective optimization framework. The second component is a stage-level resource optimization method designed for a production-scale scheduler, mapping millions of computation tasks to machines and resource profiles within sub-seconds. The last component is an adaptive, multi-granularity framework for Spark SQL, tuning parameters both at compile-time and runtime, integrating adaptive query execution in Spark.