UMass AI&Sec SP'25 Seminar: Harsh Chaudhari, Propagation of Adversarial Bias to Distilled Language Models
Content

Speaker
Abstract
The widespread deployment of Large Language Models (LLMs) trained by knowledge distillation is increasingly raising concerns about their resilience to adversarial manipulation.This paper investigates the vulnerability of distilled language models to adversarial injection of biased content during training. More broadly, we demonstrate how malicious vendors can inject adversarial biased data into a large ``teacher'' LLM's training set, causing the adversarial bias to not only propagate to a smaller "student" model, but also become amplified. Using data poisoning techniques, we manipulate the teacher's output to include adversarial bias in the generated content, such as promoting a particular brand or generating phishing links. We show that the attack transfers to the student model, where the adversarial bias becomes even more pronounced and impacts unseen tasks. With only 25 poisoned samples, or 0.25% poisoning rate in the teacher's training data, the student model generates a large fraction 76.9% of biased responses. Moreover, the student model’s fraction of biased responses is 8.1x higher on unseen tasks compared to the teacher model. Our findings highlight significant security and trustworthiness concerns for distilled language models deployed in user-facing applications.
Bio
Harsh Chaudhari is a fourth-year PhD student in Computer Science at Northeastern University, under the supervision of Professor Alina Oprea. His research focuses on the security and privacy of machine learning models, particularly in understanding and mitigating threats through adversarial attacks. Harsh has had several internship experiences, the most recent of which was at Google Deepmind, where he worked on adversarial bias in language models. He has published his works at notable venues such as ICLR, Oakland, and NDSS.
Host