Tianle Cai

Machine learning PhD @Princeton. Life-long learner, hacker, and builder.

profile
feeling_proud

About me

Hey there! I'm Tianle Cai (蔡天乐, pronounced Tyen-luh Tseye), a PhD candidate at Princeton under Professors Kai Li and Jason D. Lee. I completed my undergrad at Peking University in applied mathematics and computer science, guided by Professors Liwei Wang and Di He.

I was a part-time researcher at Together.ai working with Tri Dao. I've also worked at Google Deepmind with Xuezhi Wang and Denny Zhou, and at Microsoft Research with Sébastien Bubeck and Debadeepta Dey.

My research recently focuses on designing more efficient systems for large models with system-architecture co-design. I'm also broadly interested in network architecture design, representation learning, and optimization.

If you share common interests, are interested in potential collaboration, or simply want to connect for a chat, feel free to contact me. I'm always open to conversation :)

Experience

Selected Projects

figure

Inference Efficiency of Large Models

(ICML 2024) Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads

Tianle Cai*, Yuhong Li*, Zhengyang Geng, Hongwu Peng, Tri Dao

🌟Highlight: Medusa is a simple framework that democratizes the acceleration techniques for LLM generation with multiple decoding heads. It achieves an impressive speedup of more than 2x.

[arXiv] [🖥️Code] [📖Blog]  

(ICLR 2024) Large Language Models as Tool Makers

Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, Denny Zhou

🌟Highlight: Tools can boost the productivity of LLMs, but what if there isn't a suitable tool?-- Let LLM build their own We introduce `LLMs As Tool Maker`: one LLM serves as Tool Maker👩🏻‍🎓 to make new tools🔨, another LLM servers as Tool User👨🏻‍🔧 to solve new problems with the tool.

[arXiv] [🖥️Code]   

BitDelta: Your Fine-Tune May Only Be Worth One Bit

James Liu*, Guangxuan Xiao, Kai Li, Jason D. Lee, Song Han, Tri Dao, Tianle Cai*

🌟Highlight: BitDelta compresses the weight delta between a fine-tuned and base model LLM to 1 bit, enabling accurate and efficient multi-tenant serving.

[arXiv] [🖥️Code] [📖Blog]  

(NAACL 2024) REST: Retrieval-Based Speculative Decoding

Zhenyu He*, Zexuan Zhong*, Tianle Cai*, Jason D Lee, Di He

🌟Highlight: REST is a plug-and-play method for accelerating language model decoding without any training. It retrieves common phrases from a datastore and lets models verify them in parallel.

[arXiv]  [📖Blog]  

(CVPR 2024) DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models

Muyang Li*, Tianle Cai*, Jiaxin Cao, Qinsheng Zhang, Han Cai, Junjie Bai, Yangqing Jia, Ming-Yu Liu, Kai Li, Song Han

🌟Highlight: DistriFusion is a training-free algorithm to harness multiple GPUs to accelerate diffusion model inference without sacrificing image quality.

[arXiv] [🖥️Code] [📖Blog]  

figure

Long-Context Sequence Modeling

(ICLR 2023) What Makes Convolutional Models Great on Long Sequence Modeling?

Yuhong Li*, Tianle Cai*, Yi Zhang, Deming Chen, Debadeepta Dey

🌟Highlight: We find two simple principles behind the success of global convolutional models on long sequence modeling: 1) Parameterization efficiency; 2) Decaying structure in kernel weights.

[arXiv] [🖥️Code]   

(NeurIPS 2021) Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding

Shengjie Luo, Shanda Li, Tianle Cai, Di He, Dinglan Peng, Shuxin Zheng, Guolin Ke, Liwei Wang, Tie-Yan Liu

🌟Highlight: Enabling fast relative positional encoding and stabilize the training via Fast Fourier Transform.

[arXiv]    

figure

Learning with Graph Structure

(NeurIPS 2021) Do Transformers Really Perform Bad for Graph Representation?

Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, Tie-Yan Liu

🌟Highlight: Make Transformer great again on graph classification by introducing three graph structural encodings! Achieve SOTA performance on several benchmarks! Winner solution of OGB-LSC challenge!!

[arXiv] [🖥️Code]   

(ICML 2021) GraphNorm: A Principled Approach to Accelerating Graph Neural Network Training

Tianle Cai*, Shengjie Luo*, Keyulu Xu, Di He, Tie-Yan Liu, Liwei Wang

🌟Highlight: A principled normalization scheme specially designed for graph neural networks. Achieve SOTA on several graph classification benchmarks.

[arXiv] [🖥️Code]   

figure

Distribution Shift

(NeurIPS 2021) Towards a Theoretical Framework of Out-of-Distribution Generalization

Haotian Ye, Chuanlong Xie, Tianle Cai, Ruichen Li, Zhenguo Li, Liwei Wang

🌟Highlight: We formulate what an OOD is and derive bounds and model selection algorithm upon our framework.

[arXiv]    

(ICML 2021) A Theory of Label Propagation for Subpopulation Shift

Tianle Cai*, Ruiqi Gao*, Jason D. Lee*, Qi Lei*

🌟Highlight: Subpopulation shift is a ubiquitous component of natural distribution shift. We propose a general theoretical framework of learning under subpopulation shift based on label propagation. And our insights can help to improve domain adaptation algorithms.

[arXiv]    

figure

Miscellaneous

(ICML 2021) Towards Certifying L_inf Robustness using Neural Networks with L_inf-dist Neurons

Bohang Zhang, Tianle Cai, Zhou Lu, Di He, Liwei Wang

🌟Highlight: New architecture with inherent L_inf-robustness and a tailored training pipeline. Achieving SOTA performance on several benchmarks!

[arXiv]    

(NeurIPS 2020) Sanity-Checking Pruning Methods: Random Tickets can Win the Jackpot

Jingtong Su*, Yihang Chen*, Tianle Cai*, Tianhao Wu, Ruiqi Gao, Liwei Wang, Jason D. Lee

🌟Highlight: We sanity-check several existing pruning methods and find the performance of a large group of methods only rely on the pruning ratio of each layer. This finding inspires us to design an efficient data-independent, training-free pruning method as a byproduct.

[arXiv]    

(NeurIPS 2019) Convergence of Adversarial Training in Overparametrized Networks

Ruiqi Gao*, Tianle Cai*, Haochuan Li, Liwei Wang, Cho-Jui Hsieh, Jason D. Lee

🌟Highlight: For overparameterized neural network, we prove that adversarial training can converge to global minima (with loss 0).

[arXiv]