Pichao Wang, PhD, Nvidia

In addition to research, I publish books that distill years of experience in AI and computer vision.

👉 Check out curated content on my official book site

Email: pichaowang@gmail.com Goolge Scholar ResearchGate Linkedin

Biography

I am a principal machine learning engineer at Nvidia. Before I joined Nvidia, I worked as a senior applied scientist at Amazon AGI and Prime Video for more than 3 years. Prior to that, I worked as a staff/senior engineer at DAMO Academy, Alibaba Group (U.S.) for more than 4 years. I received my Ph.D in Computer Science from University of Wollongong, Australia, in Oct. 2017, supervised by Prof. Wanqing Li and Prof. Philip Ogunbona. I received my M.E. in Information and Communication Engineering from Tianjin University, China, in 2013, supervised by Prof. Yonghong Hou, and B.E. in Network Engineering from Nanchang University, China, in 2010.

Research Interests

Computer Vision · World Model · Visual Generation · LLM · MLLM · Speech Processing

Selected Awards and Honors

Apr. 2024, The Tony Stark Award of Prime Video
Oct. 2023, World’s Top 1% Scientist
Jun.2022, Best Student Paper Award @CVPR2022
Jan.2022, AI 2000 Most Influential Scholars certificate
Oct.2021, World’s Top 2% Scientists
Jun. 2020, Second Prize, Multiple Object Tracking and Segmentation@CVPR2020
May. 2018, EIS Faculty Postgraduate Thesis Award.
Aug. 2017, Second Prize, Action, Gesture, and Emotion Recognition Workshop and Competitions: Large Scale Multimodal Gesture Recognition and Real versus Fake expressed emotions@ICCV2017
Apr. 2017 First Prize (Winner), Large Scale 3D Human Activity Analysis Challenge in Depth Video@ICME2017
Dec. 2016 Second Prize, Joint Contest on Multimedia Challenges Beyond Visual Analysis@ICPR2016
Dec. 2016 Third Prize, Joint Contest on Multimedia Challenges Beyond Visual Analysis@ICPR2016
Jan. 2013 Excellent Postgraduate Award
Dec. 2011 Excellent Prize, National Campus CUDA Programming Contest. certificate

Publications

Ph.D. Dissertation

Action Recognition from RGB-D Data. The University of Wollongong, 2017. (Best Postgraduate Thesis Award) link

Large Model Reports

The Amazon Nova Family of Models: Technical Report and Model Card link

Conference Papers (selected papers, full paper list)

Penghui Ruan, Bojia Zi, Xianbiao Qi, Youze Huang, Rong Xiao, Pichao WANG, Jiannong Cao, Yuhui Shi, “CTRL&SHIFT: High-quality Geometry-Aware Object Manipulation in Visual Generation”, ICLR 2026.
Jingyi Chen, Zhimeng Guo, Jiyun Chun, Pichao Wang, Andrew Perrault, Micha Elsner, “Do Audio LLMs Really LISTEN, or Just Transcribe? Measuring Lexical vs. Acoustic Emotion Cues Reliance”, EACL 2026.
Shuning Chang, Pichao Wang, Jiasheng Tang, Fan Wang, Yi Yang, “SparseDiT: Token Sparsification for Efficient Diffusion Transformer”, NeurIPS 2025.
Yufei Wang, Lanqing Guo, Zhihao Li, Jiaxing Huang, Pichao Wang, Bihan Wen, Jian Wang, “Training-Free Text-Guided Image Editing with Visual Autoregressive Model”, ICCV 2025
Guofeng Cui, Pichao Wang, Yang Liu, Zemian Ke, Zhu Liu, Vimal Bhat, “CRPO: Confidence-Reward Driven Preference Optimization for Machine Translation”, ACL 2025.
Jingyi Chen,Ju-Seung Byun, Micha Elsner, Pichao Wang, Andrew Perrault, “Fine-Tuning Text-to-Speech Diffusion Models Using Reinforcement Learning with Human Feedback”, Interspeech 2025
Zechen Bai, Tianjun Xiao, Tong He, Pichao WANG, Zheng Zhang, Thomas Brox, Mike Zheng Shou, “Bridging Information Asymmetry in Text-video Retrieval: A Data-centric Approach”, ICLR 2025
Mingyue Huo, Abhinav Jain, Cong Phuoc Huynh, Fanjie Kong, Pichao Wang, Zhu Liu, Vimal Bhat, “Beyond Speaker Identity: Text Guided Target Speech Extraction”, ICASSP 2025
Minxue Niu, Najmeh Sadoughi, Abhishek Yanamandra, Pichao Wang, Zhu Liu, Vimal Bhat, and Sarah Norred, “Learning Rich Speech Representations with Acoustic-Semantic Factorization”, ICASSP 2025.
Penghui Ruan, Pichao Wang, Divya Saxena, Jiannong Cao, Yuhui Shi, “Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning”, NeurIPS 2024.
Jiamian Wang, Pichao Wang, Dongfang Liu, Qiang Guan, Sohail Dianat, Majid Rabbani, Raghuveer Rao, Zhiqiang Tao, “Diffusion-Inspiered Truncated Sampler for Text-Video Retrieval”, NeurIPS 2024.
Zechen Bai, Tong He, Haiyang Mei, Pichao Wang, Ziteng Gao, Joya Chen, Lei Liu, Zheng Zhang, Mike Zheng Shou, “One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos”, NeurIPS 2024.
Jiaqi Wang, Pichao Wang, Yi Feng, Huafeng Liu, Chang Gao, Liping Jing, “Align2Concept: Language Guided Interpretable Image Recognition by Visual Prototype and Textural Concept Alignment”, ACM MM 2024.
Bo Dong, Pichao Wang, Hao Luo, Fan Wang, “Adaptive Query Selection for Camouflaged Instance Segmentation”, ACM MM 2024.
Jiamian Wang, Guohao Sun, Pichao Wang, Dongfang Liu, Sohail Dianat, Majid Rabbani, Raghuveer Rao, Zhiqinag Tao, “Text is MASS: Modelling as Stochasitc Embedding for Text-to-Video Retrieval”, CVPR 2024 (Highlight).paper code
Wenhao Li, Mengyuan Liu, Hong Liu, Pichao Wang, Jialun Cai, and Nicu Sebe, “Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation”, CVPR 2024 (Highlight). paper code
Yujun Ma, Benjia Zhou, Ruili Wang, Pichao WANG, “Multi-stage Factorized Spatio-Temporal Representation for RGB-D Action and Gesture Recognition”, ACM MM 2023.
Shuning Chang, Pichao Wang, Hao Luo, Fan Wang, and Mike Zheng Shou, “Revisiting Vision Transformer from the View of Path Ensemble”, Oral, ICCV 2023.
Sarah Ibrahimi, Xiaohang Sun, Pichao Wang, Amanmeet Garg, Ashutosh Sanan, Mohamed Omar, “Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment”, Oral, ICCV 2023.
Shuning Chang, Pichao Wang, Ming Lin, Fan Wang, David Junhao Zhang, Rong Jin and Mike Zheng Shou,(first two authors make equal contributions), “Making Vision Transformers Efficient from A Token Sparsification View”, CVPR 2023. paper code
Qitao Zhao, Ce Zheng, Mengyuan Liu, Pichao Wang, Chen Chen, “PoseFormerV2: Exploring Frequency Domain for Efficient and Robust 3D Human Pose Estimation”, CVPR 2023 paper code
Jue Wang, Wentao Zhu, Pichao Wang, Xiang Yu, Linda Liu, Mohamed Omar, and Raffay Hamid, “Selective Structured State-Spaces for Long-Form Video Understanding”, CVPR2023 paper
Bo Dong, Pichao Wang@, Fan Wang,(@ Corresponding author), “Head-Free Lightweight Semantic Segmentation with Linear Transformer”, AAAI 2023.paper code
Dongyang Li, Hao Luo, Pichao Wang, Zhibin Wang, Shang Liu, Fan Wang, “Frequency Domain Disentanglement for Arbitrary Neural Style Transfer”, AAAI 2023.
Zhenyu Wang, Hao Luo, Pichao Wang, Feng Ding, Fan Wang, Hao Li, “VTC-LFC: Vision Transformer Compression with Low-Frequency Components”, NeurIPS 2022.paper code
Pichao Wang, Xue Wang, Fan Wang, Ming Lin, Shuning Chang, Hao Li, Rong Jin, (first two authors make equal contributions), “KVT: k-NN Attention for Boosting Vision Transformers”, ECCV 2022. paper. code
Zhaoyuan Yin, Pichao Wang@, Fan Wang, Xianzhe Xu, Hanling Zhang, Hao Li, Rong Jin,(@ Corresponding author), “TransFGU: A Top-down Approach to Fine-Grained Unsupervised Semantic Segmentation”, ECCV 2022, Oral(2.7% of submitted papers) paper. code
Benjia Zhou, Pichao Wang@, Jun Wan, Yanyan Liang, Fan Wang, Du Zhang, Zhen Lei, Hao Li, Rong Jin, (@ Corresponding author), “Decoupling and Recoupling Spatiotemporal Representation for RGB-D-based Motion Recognition”, CVPR 2022. paper. code
Hansheng Chen, Pichao Wang@, Fan Wang, Wei Tian, Lu Xiong, Hao Li, (@ Corresponding author), “EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation”, CVPR 2022, Best Student Paper Award. paper code
Wenhao Li, Hong Liu, Hao Tang, Pichao Wang, Luc Van Gool, “MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation”, CVPR 2022. paper. code
Pichao Wang, Fan Wang, Hao Li, “Image-to-Video Re-Identification via Mutual Discriminative Knowledge Transfer”, ICASSP 2022. paper
Tongkun Xu, Weihua Chen, Pichao Wang, Fan Wang, Hao Li, Rong Jin, “Cdtrans: Cross-domain transformer for unsupervised domain adaptation”, ICLR 2022. paper. code
Pichao Wang, Xue Wang, Hao Luo, Jingkai Zhou, Zhipeng Zhou, Fan Wang, Hao Li, and Rong Jin,(first two authors make equal contributions), “Scaled relu matters for training vision transformers”, AAAI 2022. paper. video
Shuting He, Hao Luo, Pichao Wang, Fan Wang, Hao Li, and Wei Jiang, “TransReid: Transformer-based Object Re-identification”,ICCV 2021. paper. code
Min Lin, Pichao Wang, Zhenhong Sun, Hesen Chen, Xiuyu Sun, Qi Qian, Hao Li, and Rong Jin, “Zen-NAS: A Zero-Shot NAS for High-Performance Deep Image Recognition”, ICCV 2021. paper. code
Liang Han, Pichao Wang, Zhaozheng Yin, Fan Wang, and Hao Li, (first two authors make equal contributions) “Exploiting Better Feature Aggregation for Video Object Detection”, ACM MM 2020. paper
Chang Tang, Xinwang Liu, Xinzhong Zhu, En Zhu, Kun Sun, Pichao Wang, Lizhe Wang and Albert Zomaya, “R2MRF: Defocus Blur Detection via Recurrently Refining Multi-scale Residual Features”, AAAI 2020.paper. code
Pichao Wang, Wanqing Li, Jun Wan, Philip Ogunbona, and Xinwang Liu, “Cooperative Training of Deep Aggregation Networks for RGB-D Action Recognition”, AAAI 2018, ORAL paper. code
Huogen Wang, Pichao Wang, Zhanjie Song, and Wanqing Li, (first two authors make equal contributions) “Large-scale Multimodal Gesture Recognition Using Heterogeneous Networks”, ICCV 2017.paper. code
Huogen Wang, Pichao Wang, Zhanjie Song, and Wanqing Li, (first two authors make equal contributions) “Large-scale Multimodal Gesture Segmentation and Recognition based on Convolutional Neural Network”, ICCV 2017. paper. code
Pichao Wang, Wanqing Li, Zhimin Gao, Yuyao Zhang, Chang Tang, and Philip Ogunbona, “Scene Flow to Action Map: A New Representation for RGB-D Based Action Recognition with Convolutional Neural Networks”, CVPR 2017. paper
Pichao Wang, Zhaoyang Li, Yonghong Hou, and Wanqing Li, (first two authors make equal contributions) “Action Recognition Based on Joint Trajectory Maps Using Convolutional Neural Networks”, ACM MM 16. paper. code
Pichao Wang, Wanqing Li, Zhimin Gao, Chang Tang, Jing Zhang, and Philip Ogunbona,”ConvNets-Based Action Recognition from Depth Maps Through Virtual Cameras and Pseudocoloring”, ACM MM 15. paper. code

Journal Articles (selected papers, full paper list)

Wenhao Li, Mengyuan Liu, Hong Liu, Pichao Wang, Shijian Lu, and Nicu Sebe, “H2OT: Hierarchical Hourglass Tokenizer for Efficient Video Pose Transformers”, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2025. project
Hansheng Chen, Wei Tian, Pichao Wang, Fan Wang, Lu Xiong, Hao Li, “EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation”, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024, paper code
Henry Hengyuan Zhao, Pichao Wang, Yuyang Zhao, Hao Luo, Fan Wang, Mike Zheng Shou, “SCT: A Simple Baseline for Parameter-Efficient Fine-Tuning via Salient Channels”, International Journal of Computer Vision (IJCV), 2023. paper. code
Jingkai Zhou, Pichao Wang@, Jiasheng Tang, Fan Wang, Qiong Liu, Hao Li, Rong Jin,(@project lead), “What limits the performance of local self-attention?”, International Journal of Computer Vision (IJCV), 2023. paper code
Benjia Zhou, Pichao Wang@, Jun Wan, Liangliang Yan, and Fan Wang, (@corresponding auther), “A Unified Multimodal De-and Re-coupling Framework for RGB-D Motion Recognition”, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023. paper code
Wenhao Li, Hong Liu, Hao Tang, and Pichao Wang, “Multi-Hypothesis Representation Learning for Transformer-Based 3D Human Pose Estimation”, Pattern Recognition, 2023
Wenhao Li, Hong Liu, Runwei Ding, Mengyuan Liu, Pichao Wang, and Wenming Yang, “Exploiting Temporal Contexts with Strided Transformer for 3D Human Pose Estimation”, IEEE Transactions on Multimedia, 2021. paper. code
Liang Han, Pichao Wang, Zhaozheng Yin, Fan Wang, and Hao Li, (first two authors make equal contributions), “Context and Structure Mining Network for Video Object Detection”, International Journal of Computer Vision (IJCV), 2021. paper
Liang Han, Pichao Wang, Zhaozheng Yin, Fan Wang, and Hao Li, (first two authors make equal contributions), “Class-aware Feature Aggregation Network for Video Object Detection”, IEEE Transactions on Circuits and Systems for Video Technology, 2021. paper
Zitong Yu, Benjia Zhou, Jun Wan, Pichao Wang, Haoyu Chen, Xin Liu, Stan Z Li, and Guoying Zhao, “Searching Multi-Rate and Multi-Modal Temporal Enhanced Network for Gesture Recognition”, IEEE Transaction on Image Processing, 2021. paper. code
Xiangyu Li, Yonghong Hou, Pichao Wang@, Zhimin Gao, Mingliang Xu, and Wanqing Li,（@ Corresponding author), “Trear: Tranformer-based RGB-D Egocentric Action Recognition”, IEEE Transactions on Cognitive and Developmental System, 2021. paper
Chang Tang, Xinwang Liu, Shan An, and Pichao Wang, “BR2NET: Defocus Blur Detection via Bidirectional Channel Attention Residual Refining Network”, IEEE Transactions on Multimedia, 2020. paper
Chang Tang, Xinwang Liu, Pichao Wang, Changqing Zhang, Miaomiao Li and Lizhe Wang,“Adaptive Hypergraph Embedded Semi-supervised Multi-label Image Annotation” IEEE Transactions on Multimedia, 2019. paper
Chang Tang, Xinzhong Zhu, Xinwang Liu, Miaomiao Li, Pichao Wang, Changqing Zhang and Lizhe Wang, “Learning Joint Affinity Graph for Multi-view Subspace Clustering”, IEEE Transactions on Multimedia, 2019. paper
Chuankun Li, Yonghong Hou, Pichao Wang@, and Wanqing Li, (@Corresponding author), “Multi-view Based 3D Action Recognition Using Deep Networks”, IEEE Transactions on Human Machine Systems, 2018. paper
Chang Tang, Wanqing Li, Pichao Wang@, and Lizhe Wang, (@ Corresponding author), “Online Human Action Recognition Based on Incremental Learning of Weighted Covariance Descriptors”, Information Sciences, 2018. code
Pichao Wang, Wanqing Li, Philip Ogunbona, Jun Wan and Sergio Escalera, “RGB-D-based Human Motion Recognition with Deep Learning: A Survey “, Computer Vision and Image Understanding, 2018.
Pichao Wang, Wanqing Li, Zhimin Gao, Chang Tang, and Philip Ogunbona, “Depth Pooling Based Large-scale 3D Action Recognition with Deep Convolutional Neural Networks”, IEEE Transactions on Multimedia, 2018. paper. code
Pichao Wang, Wanqing Li, Chuankun Li, and Yonghong Hou, “Action Recognition Based on Joint Trajectory Maps with Convolutional Neural Networks”, Knowledge-Based Systems,2018. paper. code
Yonghong Hou, Zhaoyang Li, Pichao Wang@ and Wanqing Li, (@ Corresponding author), “Skeleton Optical Spectra Based Action Recognition Using Convolutional Neural Networks”, IEEE Transactions on Circuits and Systems for Video Technology, 2016. code
Jing Zhang, Wanqing Li, Philip Ogunbona, Pichao Wang and Chang Tang, “RGB-D based Action Recognition Datasets: A Survey”, Pattern Recognition, 2016.
Pichao Wang, Wanqing Li, Zhimin Gao, Jing Zhang, Chang Tang, and Philip Ogunbona, “Action Recognition from Depth Maps Using Deep Convolutional Neural Networks”, IEEE Transactions on Human Machine Systems, 2016. code

### Preprint (selected papers, full paper list)

Hao Luo, Pichao Wang, Yi Xu, Feng Ding, Yanxin Zhou, Fan Wang, Hao Li, Rong Jin, “Self-Supervised Pre-Training for Transformer-Based Person Re-Identification”, arXiv 2021. paper. code

Academic Activities

Editorial Works:

Associate Editor, Computer Engineering(«计算机工程», Chinese Journal), 2019-2024
Eiditorial Board of Young Scientists, Journal of Computer Science and Technology (JCST) (Tier 1, CCF B), 2022.7.1-2024.6.30
Area Chair, ICME, 2021,2022: Area Chair for Multimedia Analysis and Understanding (main area)

Selected Invited Journal Reviewer:

IEEE Transactions on Pattern Analysis and Machine Intelligence
IEEE Transactions on Image Processing
IEEE Transactions on Circuits and Systems for Video Technology
IEEE Transactions on Cybernetics
IEEE Transactions on Neural Networks and Learning Systems
IEEE Transactions on Industrial Information
IEEE Transactions on Audio, Speech and Language Processing
IEEE Transactions on Multimedia
ACM Transactions on Interactive Intelligent Systems
ACM Transactions on Multimedia Computing, Communications and Applications

Conference Technical Program Committee Member:

ICCV2017,2019,2021,2023
CVPR2018,2019,2020,2021,2022,2023,2024
ICME2018,2019,2020,2021,2022
IJCAI2018,2019,2020,2021
ACCV2018,2020
WACV2019,2020,2021
AAAI2019,2020,2021,2022
ECCV2020,2022,2024
NIPS2020,2021,2022,2023
ICML2021,2022
ICLR2022,2023,2024

Work Experience

2018.9-2022.10: I was employed as a staff/senior algorithm engineer, and conducted research on various computer vision tasks.
2017.10-2018.6: I was employed as a researcher at Motovis Inc, and I was in charge of Fixed-point quantization networks, pixel-level semantic labeling, intelligent headlight control.
2013.07-2013.11: I was employed as a Software Engineer at Beijing Hanze Technology Co., ltd and I was in charge of the development of software about video enhancement, including FFMpeg video decoding, video enhancement algorithms, denoising algorithms, and H.264 coding by CUDA.
2011.05-2011.12: I was employed as a Software Engineer at Beijing Maystar Information Technology Co., ltd , and I was in charge of decrypting the Office documents based on GPU.
2010.07-2011.06: I participated a National High-tech R&D Program (863 Program) project at Institute of Wideband Wireless Communication and 3D Imaging (IWWC&3DI): Multi-view video acquisition and demonstration system (2009AA011507). I was in charge of adaptive definition adjustment and format conversion in 3D video network and implemented the 3D video combination algorithm using paralleled methods based on CUDA.

Datasets

FT-HID Dataset: The dataset contains more than 38K RGB samples, 38K depth samples, and about 20K skeleton sequences. 30 classes of daily actions are designed specically for multi-person interaction with a wearable device and three fixed cameras. FT-HID dataset has a comparable number of data, action classes, and scenes with other RGB-D action recognition datasets. It is more complex as the data is collected from 109 distinct subjects with large variations in gender, age, and physical condition. More importantly, to the best of our knowledge, it is the first large-scale RGB-D dataset that is collected from both TPV and FPV perspectives for action recognition. Please cite the following papers if you use the dataset: Zihui Guo, Yonghong Hou, Pichao Wang, Zhimin Gao, Mingliang Xu and Wanqing Li, “FT-HID: A Large Scale RGB-D Dataset for First and Third Person Human Interaction Analysis”, Neural Computing and Applications, 2022. paper code
UOW Online Action3D Dataset: this dataset consists of action sequences of skeleton videos, the 20 actions are from the original MSR Action3D Dataset. The action videos are recorded by Microsoft Kinect V.2 with average 20fms/s frame rate. There are 20 participants to perform these actions, every participant performs each action according to his/her personal habits. For each participant, he/she first repeats each action 3–5 times, then performs 20 actions continuously in a random order. These continuous action sequences can be used for online action recognition testing. The repeated action sequences will be used for training. In order to make the dataset can be used for cross dataset test, the 20 participants perform the actions in 4 different environments.Please cite the following papers if you use the dataset:
Chang Tang, Wanqing Li, Pichao Wang, Lizhe Wang, “Online Human Action Recognition Based on Incremental Learning of Weighted Covariance Descriptors”, Information Sciences,vol.467,pp.219-237, 2018. paper. code

wangpichao