Optimization time t₁
Optimization time t₂
Optimization time t₃
We present Animus3D, a text-driven 3D animation framework that addresses limitations in current motion generation techniques based on Score Distillation Sampling (SDS). Our approach introduces a novel Motion Score Distillation strategy, where a LoRA-enhanced video diffusion model defines a static source distribution in a canonical space, and a noise inversion technique ensures appearance preservation while guiding motion. To improve motion field regularization, we incorporate temporal and spatial regularization terms that reduce geometric distortions across time and space. Additionally, we propose a motion refinement module to extend temporal resolution and enhance motion details, overcoming the constraints of fixed video diffusion resolution.
Our method enables high-quality motion generation for static 3D assets from diverse text prompts, while maintaining visual integrity. We also provide comparisons with state-of-the-art methods, demonstrating that our approach generates more substantial and fine-grained motion than existing baselines.
Given canonical 3D Gaussian \(\mathcal G\), the motion field predicts the offset of Gaussian properties in each timestamp, obtaining the Gaussian sequence \(\mathcal G_{0:T-1}\). Then given a camera parameter, we can render the image and video from \(\mathcal G\) and \(\mathcal G_{0:T-1}\). We use video diffusion and video diffusion with LoRA to model the dynamic distribution and static distribution, respectively. Given dynamic text prompt \(c\) and static text prompt \(c'\), the loss gradient is computed with two predicted noises. The gradient will guide the optimization of the motion field. We further design temporal and spatial regularization terms for the motion field to improve the performance.
"A lion is wandering"
4D-fy [1]
Dream-in-4D [2]
AYG [3]
TC4D [4]
Ours
"A clown fish is swimming"
"A giraffe is walking"
"A astronaut is walking"
AKD [5]
Ours
"A hippo is walking"
"A cat is walking"
"A middle-age knight riding a horse is walking forward, HD, 4K"
"An astronaut shreds an electric guitar with full, unbridled enthusiasm, 4K"
"A humanoid robot is playing the violin with two legs stamping"
"A monkey with hat is playing base guitar excitedly, HD, high-quality"
"A roman soldier raising his right hand, HD, high-quality"
"A roman soldier raising his right leg while raising his both arms, HD, high-quality"
Input 3D model
"... is walking"
"... is dancing"
"... is squating down"
"... is raising his arms"
Input 3D model
Input 3D model
Input 3D model
Input 3D model
"A lion is wandering"
SDS (CFG=7.5) [6]
VSD [7]
SDS (CFG=100)
MSD w/o dual distribution modeling
MSD w/o faithful noise
MSD (Ours)
"A elephant is walking"
Static prompt
LoRA approximation (Ours)
"A camel is walking"
w/o TV-3D loss
w/o ARAP loss
Full regularization (Ours)
[1] Zheng et al., A Unified Approach for Text- and Image-guided 4D Scene Generation, CVPR 2024.
[2] Bahmani et al., 4D-fy: Text-to-4D Generation Using Hybrid Score Distillation Sampling, CVPR 2024.
[3] Ling et al., Align Your Gaussians:Text-to-4D with Dynamic 3D Gaussians and Composed Diffusion Models, CVPR 2024.
[4] Bahmani et al., TC4D: Trajectory-Conditioned Text-to-4D Generation, ECCV 2024.
[5] Li et al., Articulated Kinematics Distillation from Video Diffusion Models, CVPR 2025.
[6] Poole et al., DreamFusion: Text-to-3D using 2D Diffusion, ICLR 2023.
[7] Wang et al., ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation, NeurIPS 2023.
@inproceedings{sun2025animus3d,
author = {Qi Sun, Can Wang, Jiaxiang Shang, Wensen Feng, Jing Liao},
title = {Animus3D: Text-driven 3D Animation via Motion Score Distillation},
booktitle = {ACM SIGGRAPH Asia},
year = {2025},
}