Animus3D: Text-driven 3D Animation via Motion Score Distillation

SIGGRAPH Asia 2025

1City University of Hong Kong, 2Central Media Technology Institute, Huawei

Demo Video (Low-res) .

Animus3D Teaser

Animus3D transforms a static 3D object to an animated object sequence given text descriptions via motion score distillation.

Optimization time t₁

Optimization time t₂

Optimization time t₃

Abstract

We present Animus3D, a text-driven 3D animation framework that addresses limitations in current motion generation techniques based on Score Distillation Sampling (SDS). Our approach introduces a novel Motion Score Distillation strategy, where a LoRA-enhanced video diffusion model defines a static source distribution in a canonical space, and a noise inversion technique ensures appearance preservation while guiding motion. To improve motion field regularization, we incorporate temporal and spatial regularization terms that reduce geometric distortions across time and space. Additionally, we propose a motion refinement module to extend temporal resolution and enhance motion details, overcoming the constraints of fixed video diffusion resolution.

Our method enables high-quality motion generation for static 3D assets from diverse text prompts, while maintaining visual integrity. We also provide comparisons with state-of-the-art methods, demonstrating that our approach generates more substantial and fine-grained motion than existing baselines.

Pipeline

Animus3D Pipeline

Given canonical 3D Gaussian \(\mathcal G\), the motion field predicts the offset of Gaussian properties in each timestamp, obtaining the Gaussian sequence \(\mathcal G_{0:T-1}\). Then given a camera parameter, we can render the image and video from \(\mathcal G\) and \(\mathcal G_{0:T-1}\). We use video diffusion and video diffusion with LoRA to model the dynamic distribution and static distribution, respectively. Given dynamic text prompt \(c\) and static text prompt \(c'\), the loss gradient is computed with two predicted noises. The gradient will guide the optimization of the motion field. We further design temporal and spatial regularization terms for the motion field to improve the performance.

Results

1. Comparison with the state-of-the-art

"A lion is wandering"

Comparison Image

4D-fy [1]

Dream-in-4D [2]

AYG [3]

TC4D [4]

Ours

"A clown fish is swimming"

Comparison Image

"A giraffe is walking"

Comparison Image

"A astronaut is walking"

Comparison Image

AKD [5]

Ours

"A hippo is walking"

Comparison Image

"A cat is walking"

Comparison Image

2. More results on text-driven 3D Animation

Complex Cases

"A middle-age knight riding a horse is walking forward, HD, 4K"

"An astronaut shreds an electric guitar with full, unbridled enthusiasm, 4K"

"A humanoid robot is playing the violin with two legs stamping"

"A monkey with hat is playing base guitar excitedly, HD, high-quality"

"A roman soldier raising his right hand, HD, high-quality"

"A roman soldier raising his right leg while raising his both arms, HD, high-quality"

3. 3D Animation results with different text prompt

Input 3D model

Comparison Image

"... is walking"

"... is dancing"

"... is squating down"

"... is raising his arms"

Input 3D model

Comparison Image

Input 3D model

Comparison Image

Input 3D model

Comparison Image

Input 3D model

Comparison Image

Ablation Studies

on score distillation methods

"A lion is wandering"

Lion

SDS (CFG=7.5) [6]

VSD [7]

SDS (CFG=100)

MSD w/o dual distribution modeling

MSD w/o faithful noise

MSD (Ours)

on static distribution modeling

"A elephant is walking"

Tiger

Static prompt

LoRA approximation (Ours)

on motion regularization

"A camel is walking"

Tiger

w/o TV-3D loss

w/o ARAP loss

Full regularization (Ours)

BibTeX

@inproceedings{sun2025animus3d,
  author    = {Qi Sun, Can Wang, Jiaxiang Shang, Wensen Feng, Jing Liao},
  title     = {Animus3D: Text-driven 3D Animation via Motion Score Distillation},
  booktitle = {ACM SIGGRAPH Asia},
  year      = {2025},
}