Research
I develop the science of data—a perspective that views data as an optimizable, expandable and predictable component of modern AI systems, rather than as i.i.d. samples in statistical learning theory or a static input for model training in deep learning practice. To this end, I think about the following questions:- How can we quantify the value of data in a principled way, and use this understanding to guide better data selection and filtering?
- When data is user-contributed and privacy-sensitive, how can we fully leverage it without compromising privacy?
- As web-scale corpora plateau, can synthetic data close the access gap, and under what guarantees?
More specifically, I work on data attribution, synthetic data, and privacy. Below are selected works that best reflect my research focus & style. For the full list of publications, see my Google Scholar.
Selected Research
-
A Unified Theory of Random Projection for Influence Functions
Pingbang Hu*, Yuzheng Hu*, Jiaqi W. Ma*, Han Zhao*
Preprint 2026
-
ACTG-ARL: Differentially Private Conditional Text Generation with RL-Boosted Control
Yuzheng Hu, Ryan McKenna, Da Yu, Shanshan Wu, Han Zhao, Zheng Xu, Peter Kairouz
The 43rd International Conference on Machine Learning (ICML 2026)
-
A Snapshot of Influence: A Local Data Attribution Framework for Online Reinforcement Learning
Yuzheng Hu*, Fan Wu*, Haotian Ye, David Forsyth, James Zou, Nan Jiang, Jiaqi W. Ma, Han Zhao
The 39th Annual Conference on Neural Information Processing Systems (NeurIPS 2025, Oral)
-
Empirical Privacy Variance
Yuzheng Hu*, Fan Wu*, Ruicheng Xian, Yuhang Liu, Lydia Zakynthinou, Pritish Kamath, Chiyuan Zhang, David Forsyth
The 42nd International Conference on Machine Learning (ICML 2025)
-
Most Influential Subset Selection: Challenges, Promises, and
Beyond
Yuzheng Hu, Pingbang Hu, Han Zhao, Jiaqi W. Ma
The 38th Annual Conference on Neural Information Processing Systems (NeurIPS 2024)
-
SoK: Privacy-Preserving Data Synthesis
Yuzheng Hu*, Fan Wu*, Qinbin Li, Yunhui Long, Gonzalo Munilla Garrido, Chang Ge, Bolin Ding, David Forsyth, Bo Li, Dawn Song
The 45th IEEE Symposium on Security and Privacy (S&P 2024)
-
Towards Understanding the Data Dependency of Mixup-style Training
Muthu Chidambaram, Xiang Wang, Yuzheng Hu, Chenwei Wu, Rong Ge
The 10th International Conference on Learning Representations (ICLR 2022, Spotlight)

