Muhammad Usama Saleem

Ph.D. Student

University of North Carolina at Charlotte

About me

I am a Ph.D. candidate in Computer Science at the University of North Carolina at Charlotte, supervised by Dr. Pu Wang in the GENIUS Lab. In industry, I work as a researcher with the Multimodal GenAI teams at Amazon and Lowe’s where I am developing large-scale, multimodal language models (MLLMs) to enhance operational efficiency and customer experience in complex, real-world environments. Moreover, I joined Google as a Research Scientist Intern in the Extended Reality (AR/VR) team, working on advancing multimodal and generative AI for immersive technologies.

Research Interests

My research focus on building multimodal foundation models that unify real-time perception with high-fidelity synthesis. I aim to develop context-aware AI systems capable of perceiving, reconstructing, and interacting with complex human behavior across both physical and virtual environments. My work centers on multimodal motion synthesis frameworks that enable controllable, high-quality 3D human animation for real-time applications , as well as 3D human pose estimation and mesh reconstruction using generative masked modeling. Ultimately, I seek to leverage these generative foundations to create AI systems that can both understand human behavior in the physical world and synthesize interactive digital counterparts within immersive XR environments.

If you have any research opportunities or open positions, please feel free to reach out at msaleem2@charlotte.edu .

News

Feb 2026: A paper on “LiveGesture: Streamable Co-Speech Gesture Generation Model” accepted at CVPR 2026!
Nov 2025: “Monocular Models are Strong Learners for Multi-View Human Mesh Recovery” will be available on arXiv!
Nov 2025: “Walk Before You Dance: High-fidelity and Editable Dance Synthesis via Generative Masked Motion Prior” was accepted to AAAI 2026!
October 2025: Joined Google as Research Scientist Intern in Extended Reality (AR/VR) Team!
October 2025: MaskControl paper selected for Oral Presentation and 🏆 Award Candidate at ICCV 2025!
Aug 2025: Available for Research Scientist / Engineer oppertunities. Please reach out if there’s a good match.
July 2025: Poster selected at Amazon WWAS Science Fair Seattle, presented next-gen multimodal shopping demo to VPs!
June 2025: Joined Amazon as Applied Scientist II Intern
June 2025: A paper on “MaskHand: Generative Masked Modeling for Robust Hand Mesh Reconstruction in the Wild” is accepted to ICCV 2025!
June 2025: A paper on “Spatio-Temporal Control for Masked Motion Synthesis” is accepted to ICCV 2025 (Oral)!
April 2025: “Walk Before You Dance: High-fidelity and Editable Dance Synthesis via Generative Masked Motion Prior” is now available on arXiv.
Dec 2024: “GenHMR: Generative Human Mesh Recovery” was accepted to AAAI 2025, presented in Philadelphia, and received a travel award.
Oct 2024: “BioPose: Biomechanically-Accurate 3D Pose Estimation from Monocular Videos” is accepted to WACV 2025!
July 2024: “BAMM: Bidirectional Autoregressive Motion Model” is accepted to ECCV 2024!
Sept 2023: Joined Lowe’s as Research Lead of the Computer Vision UNCC Team
June 2023: “Private Data Synthesis from Decentralized Non-IID Data” accepted to IJCNN 2023, presented in Queensland, Australia, and received a $5500 travel grant!
April 2023: Presented at the SIAM International Conference on Data Mining (SDM’23) Doctoral Forum; awarded NSF $1400 travel grant.
July 2022: “Privacy Enhancement for Cloud-Based Few-Shot Learning” accepted to IJCNN 2022!
Jan 2022: “DP-Shield: Face Obfuscation with Differential Privacy” accepted to EDBT 2022!

Recent Publications

LiveGesture: Streamable Co-Speech Gesture Generation Model

CVPR 2026

Muhammad Usama Saleem , Mayur Jagdishbhai Patel, Ekkasit Pinyoanuntapong, Zhongxing Qin, Li Yang, Hongfei Xue, Ahmed Helmy, Chen Chen, Pu Wang

LiveGesture is the first fully streamable, zero–look-ahead co-speech gesture generation framework that produces expressive, region-coordinated full-body motion in real time using a causal SVQ tokenizer and hierarchical autoregressive transformers.

PDF Webpage

Monocular Models are Strong Learners for Multi-View Human Mesh Recovery

arXiv

Muhammad Usama Saleem (co-author)

M2M-HMR (Monocular to Multi-view Human Mesh Recovery) framework presents a training-free approach for multi-view 3D human mesh recovery that leverages pre-trained single-view priors to overcome occlusion and generalization limits. It eliminates the need for multi-view supervision by employing a robust prediction fusion followed by geometric test-time optimization.

Walk Before You Dance: High-fidelity and Editable Dance Synthesis via Generative Masked Motion Prior

AAAI 2026

Foram Shah*, Parshwa Shah*, Muhammad Usama Saleem , Ekkasit Pinyoanuntapong, Pu Wang, Hongfei Xue, Ahmed Helmy

*Equal Contribution

DanceMosaic is a novel multimodal masked motion framework—fusing text, music, and pose adapters via progressive generative masking with inference-time optimization for precise, editable motion.

PDF Webpage

MaskHand: Generative Masked Modeling for Robust Hand Mesh Reconstruction in the Wild

ICCV 2025

Muhammad Usama Saleem , Ekkasit Pinyoanuntapong, Mayur Jagdishbhai Patel, Hongfei Xue, Ahmed Helmy, Srijan Das, Pu Wang

MaskHand is a probabilistic masked modeling framework—tokenizing articulations with VQ-MANO and using a context-aware masked transformer to fuse multi-scale image features and 2D cues for iterative, confidence-guided sampling.

PDF Webpage

MaskControl: Spatio-Temporal Control for Masked Motion Synthesis

ICCV 2025 (Oral) - Best Paper Award Nominee

Ekkasit Pinyoanuntapong, Muhammad Usama Saleem , Korrawe Karunratanakul, Pu Wang, Hongfei Xue, Chen Chen, Chuan Guo, Junli Cao, Jian Ren, Sergey Tulyakov

MaskControl is a novel masked generative motion model—combining masked consistency and inference-time logit editing in a parallel decoder for fast, high-fidelity, spatially precise motion generation.

PDF Webpage Code

GenHMR: Generative Human Mesh Recovery

AAAI 2025

Muhammad Usama Saleem , Ekkasit Pinyoanuntapong, Pu Wang, Hongfei Xue, Srijan Das, Chen Chen

GenHMR reframes monocular HMR as an image-conditioned generative task—employing a VQ-VAE pose tokenizer and masked transformer to model 2D→3D uncertainty, iteratively sampling high-confidence tokens and refining them with 2D cues for accurate mesh recovery.

PDF Webpage

BioPose: Biomechanically-accurate 3D Pose Estimation from Monocular Videos

WACV 2025

Muhammad Usama Saleem* , Farnoosh Koleini*, Pu Wang, Hongfei Xue, Ahmed Helmy, Abbey Fenwick

BioPose is a novel biomechanics-guided 3D pose estimation framework—combining a multi-query deformable transformer for precise mesh recovery, a neural IK network that enforces anatomical constraints, and 2D-informed iterative pose refinement.

PDF Webpage

BAMM: Bidirectional Autoregressive Motion Model

ECCV 2024

Ekkasit Pinyoanuntapong, Muhammad Usama Saleem , Pu Wang, Minwoo Lee, Srijan Das, Chen Chen

BAMM is a novel text-to-motion framework that employs a hybrid-masked self-attention transformer—merging generative masking with autoregression to handle dynamic sequence lengths and enable editable, high-quality motion.

PDF Webpage Code

Private Data Synthesis from Decentralized Non-IID Data

IJCNN 2023

Muhammad Usama Saleem , L. Fan

DPFedProxGAN is a federated, differentially-private GAN that generates realistic synthetic images from non-IID distributed data using local DP and FedProx optimization.

PDF Supplementary Webpage

Privacy Enhancement for Cloud-Based Few-Shot Learning

IJCNN 2022

A. Parnami, Muhammad Usama Saleem , L. Fan, M. Lee

A novel few-shot framework uses a joint privacy–classification loss to learn embeddings that protect image data while maintaining high few-shot accuracy in cloud-based vision.

PDF

DP-Shield: Face Obfuscation with Differential Privacy

EDBT 2022

Muhammad Usama Saleem , D. Reilly, L. Fan

DP-Shield safeguards against unauthorized face recognition by applying differential privacy–based obfuscation and providing image quality and recognition-risk metrics.

PDF Webpage Code

Muhammad Usama Saleem

Ph.D. Student

University of North Carolina at Charlotte

About me

Research Interests

News

Work Experience

Research Scientist Intern at Extended Reality (AR/VR) Team

Applied Scientist II Intern

Research Lead, Computer Vision UNCC Team

Recent Publications

LiveGesture: Streamable Co-Speech Gesture Generation Model

Monocular Models are Strong Learners for Multi-View Human Mesh Recovery

Walk Before You Dance: High-fidelity and Editable Dance Synthesis via Generative Masked Motion Prior

MaskHand: Generative Masked Modeling for Robust Hand Mesh Reconstruction in the Wild

MaskControl: Spatio-Temporal Control for Masked Motion Synthesis

GenHMR: Generative Human Mesh Recovery

BioPose: Biomechanically-accurate 3D Pose Estimation from Monocular Videos

BAMM: Bidirectional Autoregressive Motion Model

Private Data Synthesis from Decentralized Non-IID Data

Privacy Enhancement for Cloud-Based Few-Shot Learning

DP-Shield: Face Obfuscation with Differential Privacy