Muhammad Usama Saleem

Muhammad Usama Saleem

Ph.D. Student

University of North Carolina at Charlotte

About me

I am a Ph.D. candidate in Computer Science at the University of North Carolina at Charlotte, supervised by Dr. Pu Wang in the GENIUS Lab. In industry, I work as a researcher with the Multimodal GenAI teams at Amazon and Lowe’s where I am developing large-scale, multimodal language models (MLLMs) to enhance operational efficiency and customer experience in complex, real-world environments. Moreover, I joined Google as a Research Scientist Intern in the Extended Reality (AR/VR) team, working on advancing multimodal and generative AI for immersive technologies.

Research Interests

My research focus on building multimodal foundation models that unify real-time perception with high-fidelity synthesis. I aim to develop context-aware AI systems capable of perceiving, reconstructing, and interacting with complex human behavior across both physical and virtual environments. My work centers on multimodal motion synthesis frameworks that enable controllable, high-quality 3D human animation for real-time applications , as well as 3D human pose estimation and mesh reconstruction using generative masked modeling. Ultimately, I seek to leverage these generative foundations to create AI systems that can both understand human behavior in the physical world and synthesize interactive digital counterparts within immersive XR environments.

If you have any research opportunities or open positions, please feel free to reach out at msaleem2@charlotte.edu .

News

  • Nov 2025: “LiveGesture: Streamable Co-Speech Gesture Generation Model” will be available on arXiv!
  • Nov 2025: “Monocular Models are Strong Learners for Multi-View Human Mesh Recovery” will be available on arXiv!
  • Nov 2025: “Walk Before You Dance: High-fidelity and Editable Dance Synthesis via Generative Masked Motion Prior” was accepted to AAAI 2026!
  • October 2025: Joined Google as Research Scientist Intern in Extended Reality (AR/VR) Team!
  • October 2025: MaskControl paper selected for Oral Presentation and 🏆 Award Candidate at ICCV 2025!
  • Aug 2025: Available for Research Scientist / Engineer oppertunities. Please reach out if there’s a good match.
  • July 2025: Poster selected at Amazon WWAS Science Fair Seattle, presented next-gen multimodal shopping demo to VPs!
  • June 2025: Joined Amazon as Applied Scientist II Intern
  • June 2025: A paper on “MaskHand: Generative Masked Modeling for Robust Hand Mesh Reconstruction in the Wild” is accepted to ICCV 2025!
  • June 2025: A paper on “Spatio-Temporal Control for Masked Motion Synthesis” is accepted to ICCV 2025 (Oral)!
  • April 2025: “Walk Before You Dance: High-fidelity and Editable Dance Synthesis via Generative Masked Motion Prior” is now available on arXiv.
  • Dec 2024: “GenHMR: Generative Human Mesh Recovery” was accepted to AAAI 2025, presented in Philadelphia, and received a travel award.
  • Oct 2024: “BioPose: Biomechanically-Accurate 3D Pose Estimation from Monocular Videos” is accepted to WACV 2025!
  • July 2024: “BAMM: Bidirectional Autoregressive Motion Model” is accepted to ECCV 2024!
  • Sept 2023: Joined Lowe’s as Research Lead of the Computer Vision UNCC Team
  • June 2023: “Private Data Synthesis from Decentralized Non-IID Data” accepted to IJCNN 2023, presented in Queensland, Australia, and received a $5500 travel grant!
  • April 2023: Presented at the SIAM International Conference on Data Mining (SDM’23) Doctoral Forum; awarded NSF $1400 travel grant.
  • July 2022: “Privacy Enhancement for Cloud-Based Few-Shot Learning” accepted to IJCNN 2022!
  • Jan 2022: “DP-Shield: Face Obfuscation with Differential Privacy” accepted to EDBT 2022!

Work Experience

Research Scientist Intern at Extended Reality (AR/VR) Team

Google, San Francisco, CA Nov. 2025 – Present

Applied Scientist II Intern

Amazon Inc., Boston, MA June 2025 – Aug 2025

Research Lead, Computer Vision UNCC Team

Lowe’s, Charlotte, NC Sept. 2023 – Oct. 2025

Recent Publications

LiveGesture: Streamable Co-Speech Gesture Generation Model

arXiv
LiveGesture is the first fully streamable, zero–look-ahead co-speech gesture generation framework that produces expressive, region-coordinated full-body motion in real time using a causal SVQ tokenizer and hierarchical autoregressive transformers.
LiveGesture thumbnail

Monocular Models are Strong Learners for Multi-View Human Mesh Recovery

arXiv
M2M-HMR (Monocular to Multi-view Human Mesh Recovery) framework presents a training-free approach for multi-view 3D human mesh recovery that leverages pre-trained single-view priors to overcome occlusion and generalization limits. It eliminates the need for multi-view supervision by employing a robust prediction fusion followed by geometric test-time optimization.
M2M-HMR thumbnail

Walk Before You Dance: High-fidelity and Editable Dance Synthesis via Generative Masked Motion Prior

AAAI 2026
*Equal Contribution
DanceMosaic is a novel multimodal masked motion framework—fusing text, music, and pose adapters via progressive generative masking with inference-time optimization for precise, editable motion.

MaskHand: Generative Masked Modeling for Robust Hand Mesh Reconstruction in the Wild

ICCV 2025
MaskHand is a probabilistic masked modeling framework—tokenizing articulations with VQ-MANO and using a context-aware masked transformer to fuse multi-scale image features and 2D cues for iterative, confidence-guided sampling.

MaskControl: Spatio-Temporal Control for Masked Motion Synthesis

ICCV 2025 (Oral) - Best Paper Award Nominee
MaskControl is a novel masked generative motion model—combining masked consistency and inference-time logit editing in a parallel decoder for fast, high-fidelity, spatially precise motion generation.

GenHMR: Generative Human Mesh Recovery

AAAI 2025
GenHMR reframes monocular HMR as an image-conditioned generative task—employing a VQ-VAE pose tokenizer and masked transformer to model 2D→3D uncertainty, iteratively sampling high-confidence tokens and refining them with 2D cues for accurate mesh recovery.

BioPose: Biomechanically-accurate 3D Pose Estimation from Monocular Videos

WACV 2025
BioPose is a novel biomechanics-guided 3D pose estimation framework—combining a multi-query deformable transformer for precise mesh recovery, a neural IK network that enforces anatomical constraints, and 2D-informed iterative pose refinement.

BAMM: Bidirectional Autoregressive Motion Model

ECCV 2024
BAMM is a novel text-to-motion framework that employs a hybrid-masked self-attention transformer—merging generative masking with autoregression to handle dynamic sequence lengths and enable editable, high-quality motion.

Private Data Synthesis from Decentralized Non-IID Data

IJCNN 2023
DPFedProxGAN is a federated, differentially-private GAN that generates realistic synthetic images from non-IID distributed data using local DP and FedProx optimization.
Private Data Synthesis

Privacy Enhancement for Cloud-Based Few-Shot Learning

IJCNN 2022
A novel few-shot framework uses a joint privacy–classification loss to learn embeddings that protect image data while maintaining high few-shot accuracy in cloud-based vision.
Privacy Enhancement

DP-Shield: Face Obfuscation with Differential Privacy

EDBT 2022
DP-Shield safeguards against unauthorized face recognition by applying differential privacy–based obfuscation and providing image quality and recognition-risk metrics.