About Me

Hello! I am a professor at NYU. I'm also a research scientist at Google DeepMind, part of the GenAI/nano 🍌 team. Before that I was a research scientist at Facebook AI Research (FAIR), Menlo Park for four years.

Most human and animal knowledge arises from sensory experiences and their interactions with the environment. I believe that achieving human-like intelligence requires moving beyond language-only systems toward world models that learn directly from continuous, real-world sensory input---systems capable of understanding, creating, reasoning, planning, and developing commonsense about the physical world.

To advance understanding, I have developed some of the most widely adopted architectures and representation learning systems in computer vision, including this, this, and this. You can also explore our recent work here, which pushes the boundaries of multimodal intelligence and spatial supersensing across images and videos. For generation, I co-created diffusion transformers (DiT), a framework that powers most of today's leading generative systems like Sora. We've since accelerated training by orders of magnitude, as explored in this and this.

My research has been cited over 85,000 times (as of Sep 2025), and I am honored to be a recipient of the Marr Prize Honorable Mention, NSF CAREER Award, AISTATS Test-of-Time Award, and the PAMI Young Researcher Award.

I will be on leave from NYU in Spring/Summer 2026.

Research Group

I was fortunate to work with many exceptionally talented students and interns during my time at FAIR, GDM, and NYU. Many of whom are now leaders in the AI field. These include Sanghyun Woo (nano banana, GDM), Bill Peebles (Head of Sora, OpenAI), Eric Mintun (Sora, OpenAI), Zihan Zhang (OpenAI), Zhuang Liu (professor at Princeton), Jiaxuan You (professor at UIUC), Bingyi Kang (Researcher at TikTok), Demi Guo (Cofounder, Pika), Xun Huang (Cofounder, stealth startup), Ji Hou (Meta GenAI), Chenxi Liu (Meta, TBD Lab), Norman Mu (xAI), Suppakit Waiwitlikhit (xAI), and many others.

Selected Publications

(* indicate equal contribution)
For full publication list, please refer to my Google Scholar Google Scholar page.
(Actually, the best way to stay updated on my latest research is to check there, as I may not update this website regularly.)
Diffusion Transformers with Representation Autoencoders
Technical Report, ArXiv 2025
Boyang Zheng, Nanye Ma, Shengbang Tong, Saining Xie
Cambrian-S: Towards Spatial Supersensing in Video
Technical Report, ArXiv 2025
Shusheng Yang*, Jihan Yang*, Pinzhi Huang, Ellis Brown, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, Daohan Lu, Rob Fergus, Yann LeCun, Li Fei-Fei, Saining Xie
Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think
ICLR 2025
Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin*, Saining Xie*
Oral Presentation
Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps
CVPR 2025
Nanye Ma*, Shangyuan Tong*, Haolin Jia, Hexiang Hu, Yu-Chuan Su, Mingda Zhang, Xuan Yang, Yandong Li, Tommi Jaakkola, Xuhui Jia, Saining Xie
Highlight Presentation
Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces
CVPR 2025
Jihan Yang*, Shusheng Yang*, Anjali W. Gupta*, Rilyn Han*, Li Fei-Fei, Saining Xie
Oral Presentation
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
NeurIPS 2024
Shengbang Tong*, Ellis Brown*, Penghao Wu*, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Austin Wang, Rob Fergus, Yann LeCun, Saining Xie
Oral Presentation
Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning
NeurIPS 2024
Yuexiang Zhai*, Hao Bai†, Zipeng Lin†, Jiayi Pan†, Shengbang Tong†, Yifei Zhou†, Alane Suhr, Saining Xie, Yann LeCun, Yi Ma, Sergey Levine
On Scaling Up 3D Gaussian Splatting Training
ICLR 2025
Hexu Zhao, Haoyang Weng, Daohan Lu, Ang Li, Jinyang Li, Aurojit Panda, Saining Xie
Oral Presentation
V-IRL: Grounding virtual intelligence in real life
ECCV 2024
Jihan Yang, Runyu Ding, Ellis Brown, Xiaojuan Qi, Saining Xie
SIT: Exploring flow and diffusion-based generative models with scalable interpolant transformers
ECCV 2024
Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, Saining Xie
Deconstructing denoising diffusion models for self-supervised learning
ArXiv 2024
Xinlei Chen, Zhuang Liu, Saining Xie, Kaiming He
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
CVPR 2024
Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, Saining Xie
Oral Presentation
V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs
CVPR 2024
Penghao Wu, Saining Xie
Image Sculpting: Precise Object Editing with 3D Geometry Control
CVPR 2024
Jiraphon Yenphraphai, Xichen Pan, Sainan Liu, Daniele Panozzo, Saining Xie
Demystifying CLIP Data
ICLR 2023
Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, Christoph Feichtenhofer
Spotlight Presentation
Scalable Diffusion Models with Transformers
ICCV 2023
William Peebles, Saining Xie
Oral Presentation
CiT: Curation in Training for Effective Vision-Language Data
ICCV 2023
Hu Xu, Saining Xie, Po-Yao Huang, Licheng Yu, Russell Howes, Gargi Ghosh, Luke Zettlemoyer, Christoph Feichtenhofer
Going Denser with Open-Vocabulary Part Segmentation
ICCV 2023
Peize Sun, Shoufa Chen, Chenchen Zhu, Fanyi Xiao, Ping Luo, Saining Xie, Zhicheng Yan
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders
CVPR 2023
A ConvNet for the 2020s
CVPR 2022
SLIP: Self-supervision meets Language-Image Pre-training
ECCV 2022
Masked Feature Prediction for Self-Supervised Visual Pre-Training
CVPR 2022
Benchmarking Detection Transfer Learning with Vision Transformers
arXiv 2021
Masked Autoencoders are Scalable Vision Learners
CVPR 2022
Oral Presentation
Pri3D: Can 3D Priors Help 2D Representation Learning?
ICCV 2021
An Empirical Study of Training Self-supervised Vision Transformers
ICCV 2021
Xinlei Chen*, Saining Xie*, Kaiming He,
Oral Presentation
On Interaction Between Augmentations and Corruptions in Natural Corruption Robustness
NeurIPS 2021
Exploring Data-Efficient 3D Scene Understanding with Contrastive Scene Contexts
CVPR 2021
Oral Presentation
Sample-Efficient Neural Architecture Search by Learning Action Space
TPAMI 2021

2020
Graph Structure of Neural Networks
ICML 2020
PointContrast: Unsupervised Pre-training for 3D Point Cloud Understanding
ECCV 2020
Spotlight Presentation
Are Labels Necessary for Neural Architecture Search?
ECCV 2020
Spotlight Presentation
Momentum Contrast for Unsupervised Visual Representation Learning
CVPR 2020
Best Paper Nomination (top 30)
Decoupling Representation and Classifier for Long-Tailed Recognition
ICLR 2020

2019
On Network Design Spaces for Visual Recognition
ICCV 2019
Exploring Randomly Wired Neural Networks for Image Recognition
ICCV 2019
Oral Presentation

Previous
Deep Representation Learning with Induced Structural Priors
Ph.D. Thesis, UC San Diego 2018
Saining Xie
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification
ECCV 2018
Attentional ShapeContextNet for Point Cloud Recognition
CVPR 2018
Saining Xie*, Sainan Liu*, Zeyu Chen, Zhuowen Tu
Aggregated Residual Transformations for Deep Neural Networks
CVPR 2017
Top-down Learning for Structured Labeling with Convolutional Pseudoprior
ECCV 2016
Saining Xie*, Xun Huang*, Zhuowen Tu
Holistically-Nested Edge Detection
ICCV 2015
Saining Xie, Zhuowen Tu
Marr Prize Honorable Mention
Deeply-Supervised Nets
AISTATS 2015
Chen-Yu Lee*, Saining Xie*, Patrick Gallagher*, Zhengyou Zhang, Zhuowen Tu
Oral Presentation at the NeurIPS'14 Deep Learning Workshop
Hyper-class Augmented and Regularized Deep Learning for Fine-grained Image Classification
CVPR 2015