Nonisotropic Gaussian Diffusion for Realistic 3D Human Motion Prediction

CVPR 2025
1Technical University of Munich    2Munich Center for Machine Learning

Abstract

Probabilistic human motion prediction aims to forecast multiple possible future movements from past observations. While current approaches report high diversity and realism, they often generate motions with undetected limb stretching and jitter. To address this, we introduce SkeletonDiffusion, a latent diffusion model that embeds an explicit inductive bias on the human body within its architecture and training. Our model is trained with a novel nonisotropic Gaussian diffusion formulation that aligns with the natural kinematic structure of the human skeleton. Results show that our approach outperforms conventional isotropic alternatives, consistently generating realistic predictions while avoiding artifacts such as limb distortion. Additionally, we identify a limitation in commonly used diversity metrics, which may inadvertently favor models that produce inconsistent limb lengths within the same sequence. SkeletonDiffusion sets a new benchmark on three real-world datasets, outperforming various baselines across multiple evaluation metrics.

Sync Example 1 Sync Example 2 Sync Example 3 Sync Example 4 Sync Example 5
Sync Example 6 Sync Example 7 Sync Example 8 Sync Example 9 Sync Example 10
Example 10 Example 11 Example 12 Example 13 Example 14

From the past, predict the future.

Stochastic Human Motion Prediction

In this work, we address the problem of predicting human motion based on observed past movements, known as Human Motion Prediction (HMP). Specifically, from a temporal sequence of human joint positions, we aim to forecast their evolution in subsequent frames. We aim to predict not a single, deterministic future, but generate a wide range of diverse future motions (Stochastic HMP).

Nonisotropic Gaussian Diffusion

We replace the conventional isotropic Gaussian diffusion training and sampling procedure with a novel nonisotropic formulation that accounts for joint relations directly in the generation process.

More in detail, in denoising diffusion models the forward diffusion process employed during training adds Gaussian noise to the data at each diffusion timestep. In conventional diffusion approaches, such noise is sampled from a Gaussian distribution with diagonal covariance Σt, hence the process is defined as isotropic.

Isotropic Gaussian Diffusion

The isotropic formulation does not take into account that the HMP problem is defined by the skeleton kinematic graph (given by the adjacency matrix A). We exploit this knowledge to define a fixed, non-diagonal noise covariance for the diffusion process, based on correlations ΣN extracted from the adjacency matrix.

Correlation Matrix

From here we derive all necessary equations for diffusion training and sampling. To the best of our knowledge, this is the first nonisotropic formulation for a structured problem.

Nonisotropic Gaussian Diffusion

The correlated noise easies the generation task for the denoiser network: body joints are not assumed independet from each other (i.i.d) and the noise suggests that connected joints should be diffused similarly. As shown in our experiments, the nonisotropic formulation achieves better performance than the isotropic approach, requires fewer parameters and comes at no extra computational cost.

SkeletonDiffusion

We present SkeletonDiffusion, a latent diffusion model considering the skeleton structure and joint categories throughout the entire network with a Graph Convolutional architecture and joint-type attention. In contrast, existing SHMP approaches either ignore the skeleton’s graph structure or only leverage it at intermediate stage.

Method inference

Our learned latent space is two dimensional, with one dimension representing the human body joints and the other temporally and spatially compressed features. Since the notion of human body joints is preserved in latent space, we diffuse the feature dimension isotropically as conventionally done in denoising diffusion models, and the joint dimension nonisotropically with a novel nonisotropic Gaussian diffusion formulation.

Huggingface Demo

We test our model on human poses collected from casual videos from YouTube. We use Neural Localizer Fields (NLF) to extract 3D poses from a given input video. Our model is confronted with scenes and action that were not present at training time. In addition, SkeletonDiffusion has to deal with noisy input, as the pose extraction from NLF is not perfect and the videos naturally contain ambiguities. SkeletonDiffusion generates plausibles and realistic predictions. Try it out yourself in our HuggingFace Hugging Face Demo !






Poster

Citation



      

Acknowledgements

This work was supported by the ERC Advanced Grant SIMULACRON. Thanks to Dr. Almut Sophia Koepke, Yuesong Shen and Shenhan Qian for the proofreading and feedback, Lu Sang for the discussion, Stefania Zunino and the whole CVG team for the support.