TY - JOUR
T1 - Private Fine-tuning of Large Language Models with Zeroth-order Optimization
AU - Tang, Xinyu
AU - Panda, Ashwinee
AU - Nasr, Milad
AU - Mahloujifar, Saeed
AU - Mittal, Prateek
N1 - Publisher Copyright: © 2025, Transactions on Machine Learning Research. All rights reserved.
PY - 2025/1
Y1 - 2025/1
N2 - Differentially private stochastic gradient descent (DP-SGD) allows models to be trained in a privacy-preserving manner, but has proven difficult to scale to the era of foundation models. We introduce DP-ZO, a private fine-tuning method for large language models by privatizing zeroth order optimization methods. A key insight into the design of our method is that the direction of the gradient in the zeroth-order optimization we use is random and the only information from the training data is the step size, i.e., a scalar. Therefore, we only need to privatize the scalar step size, which is memory-efficient. DP-ZO provides a strong privacy-utility trade-off across different tasks, and model sizes that are comparable to DP-SGD in (ε, δ)-DP. Notably, DP-ZO possesses significant advantages over DP-SGD in memory efficiency, and obtains higher utility in pure ε-DP when using the Laplace mechanism.
AB - Differentially private stochastic gradient descent (DP-SGD) allows models to be trained in a privacy-preserving manner, but has proven difficult to scale to the era of foundation models. We introduce DP-ZO, a private fine-tuning method for large language models by privatizing zeroth order optimization methods. A key insight into the design of our method is that the direction of the gradient in the zeroth-order optimization we use is random and the only information from the training data is the step size, i.e., a scalar. Therefore, we only need to privatize the scalar step size, which is memory-efficient. DP-ZO provides a strong privacy-utility trade-off across different tasks, and model sizes that are comparable to DP-SGD in (ε, δ)-DP. Notably, DP-ZO possesses significant advantages over DP-SGD in memory efficiency, and obtains higher utility in pure ε-DP when using the Laplace mechanism.
UR - http://www.scopus.com/inward/record.url?scp=85219547470&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85219547470&partnerID=8YFLogxK
M3 - Article
SN - 2835-8856
VL - 2025-January
SP - 1
EP - 27
JO - Transactions on Machine Learning Research
JF - Transactions on Machine Learning Research
ER -