Decouple and Track: Benchmarking and Improving Video Diffusion Transformers For Motion Transfer

Abstract

The motion transfer task involves transferring motion from a source video to newly generated videos, requiring the model to decouple motion from appearance.

Previous diffusion-based methods primarily rely on separate spatial and temporal attention mechanisms within 3D U-Net. In contrast, state-of-the-art Diffusion Transformer (DiT) models use 3D full attention, which does not explicitly separate temporal and spatial information. Thus, the interaction between spatial and temporal dimensions makes decoupling motion and appearance more challenging for DiT models.

In this paper, we propose DeT, a method that adapts DiT models to improve motion transfer ability. Our approach introduces a simple yet effective temporal kernel to smooth DiT features along the temporal dimension, facilitating the decoupling of foreground motion from background appearance. Meanwhile, the temporal kernel effectively captures temporal variations in DiT features, which are closely related to motion. Moreover, we introduce explicit supervision along trajectories in the latent feature space to further enhance motion consistency. Additionally, we present MTBench, a general and challenging benchmark for motion transfer. We also introduce a hybrid motion fidelity metric that considers both the global and local similarity of motion. Therefore, our work provides a more comprehensive evaluation than previous works. Extensive experiments on MTBench demonstrate that DeT achieves the best trade-off between motion fidelity and edit fidelity.

DeT

We first implement two existing methods compatible for DiT models as baselines. Through toy experiments shown in Fig. (2) (a), we observe that these baselines struggle to control the background through text. By visualizing the outputs of 3D full attention in Fig. (1), we find that foreground and background features are difficult to distinguish, indicating that foreground motion and background appearance are not correctly decoupled. We argue that this is due to the temporal inconsistencies of the background feature in the denoising process, as shown in Fig. (2) (a). This makes it challenging to differentiate between foreground and background in certain frames.

To encourage the model to decouple background appearance from foreground motion, we smooth the feature in 3D full attention along the temporal dimension. This facilitates a clearer separation between foreground and background.

On the other hand, to accurately learn motion, we examine the 3D attention map and find that significant attention scores appear primarily along the diagonal for adjacent frames. This observation motivates us to adopt a temporal 1D kernel to learn temporal information while simultaneously decoupling foreground appearance. The temporal kernel also serves as an effective method for smoothing features across time. By applying a trainable shared temporal kernel along the temporal dimension, we achieve motion alignment while ensuring the decoupling of both foreground and background appearance.

MTBench

We construct a more general benchmark based on the DAVIS and YouTube-VOS datasets, which includes 100 high-quality videos with 500 evaluation prompts. We use the Qwen2.5-VL-7B to annotate captions for the videos. Based on each caption, we generate five evaluation text prompts using the Qwen2.5-14B, swapping the foreground and background while keeping the verb unchanged. Additionally, we use the SAM and CoTracker to annotate masks and trajectories for the foreground in the videos.

We then perform automatic K-means clustering of all trajectories based on the silhouette coefficient. The motion in each video is categorized into easy, medium, and hard levels based on the number of clusters. We visualize the motion distribution of evaluation prompts Fig (a) and present examples of the difficulty division in Fig (b). We also visualize the difficulty distribution in Fig (c).

Experimental Results

Quantitative Results

Our method with HunyuanVideo achieves the highest Motion Fidelity. While MOFT is slightly higher in edit fidelity, our method achieves the best balance between edit and motion fidelity. Additionally, we adapt MotionInversion, DreamBooth, and DMT to the DiT model, all of which underperform compared to our method across all metrics. Furthermore, in previous works, motion fidelity decreases as difficulty increases, while edit fidelity improves. This indicates that our benchmark difficulty division is reasonable. Notably, our method achieves the highest score on the medium difficulty level, which we attribute to its significant performance improvement in this level.

Qualitative Results

Our method accurately transfers the motion from the source video without overfitting to the appearance. This enables flexible text control over both the foreground and background. Additionally, our method supports motion transfer across different categories, such as from a human to a panda or from a train to a boat.

Comparison

Compared with other motion transfer methods, our method accurately transfers motion patterns while allowing flexible text-based control over both the foreground and background appearance.