Skip to content

Commit 04f9d2b

Browse files
zhangjiewuwjayyiyixuxusayakpaul
authored
add ChronoEdit (#12593)
* add ChronoEdit * add ref to original function & remove wan2.2 logics * Update src/diffusers/pipelines/chronoedit/pipeline_chronoedit.py Co-authored-by: YiYi Xu <yixu310@gmail.com> * Update src/diffusers/pipelines/chronoedit/pipeline_chronoedit.py Co-authored-by: YiYi Xu <yixu310@gmail.com> * add ChronoeEdit test * add docs * add docs * make fix-copies * fix chronoedit test --------- Co-authored-by: wjay <wjay@nvidia.com> Co-authored-by: YiYi Xu <yixu310@gmail.com> Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
1 parent bc8fd86 commit 04f9d2b

File tree

15 files changed

+1961
-0
lines changed

15 files changed

+1961
-0
lines changed

docs/source/en/_toctree.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -329,6 +329,8 @@
329329
title: BriaTransformer2DModel
330330
- local: api/models/chroma_transformer
331331
title: ChromaTransformer2DModel
332+
- local: api/models/chronoedit_transformer_3d
333+
title: ChronoEditTransformer3DModel
332334
- local: api/models/cogvideox_transformer3d
333335
title: CogVideoXTransformer3DModel
334336
- local: api/models/cogview3plus_transformer2d
@@ -628,6 +630,8 @@
628630
- sections:
629631
- local: api/pipelines/allegro
630632
title: Allegro
633+
- local: api/pipelines/chronoedit
634+
title: ChronoEdit
631635
- local: api/pipelines/cogvideox
632636
title: CogVideoX
633637
- local: api/pipelines/consisid
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
<!-- Copyright 2025 The ChronoEdit Team and HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License. -->
11+
12+
# ChronoEditTransformer3DModel
13+
14+
A Diffusion Transformer model for 3D video-like data from [ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation](https://huggingface.co/papers/2510.04290) from NVIDIA and University of Toronto, by Jay Zhangjie Wu, Xuanchi Ren, Tianchang Shen, Tianshi Cao, Kai He, Yifan Lu, Ruiyuan Gao, Enze Xie, Shiyi Lan, Jose M. Alvarez, Jun Gao, Sanja Fidler, Zian Wang, Huan Ling.
15+
16+
> **TL;DR:** ChronoEdit reframes image editing as a video generation task, using input and edited images as start/end frames to leverage pretrained video models with temporal consistency. A temporal reasoning stage introduces reasoning tokens to ensure physically plausible edits and visualize the editing trajectory.
17+
18+
The model can be loaded with the following code snippet.
19+
20+
```python
21+
from diffusers import ChronoEditTransformer3DModel
22+
23+
transformer = ChronoEditTransformer3DModel.from_pretrained("nvidia/ChronoEdit-14B-Diffusers", subfolder="transformer", torch_dtype=torch.bfloat16)
24+
```
25+
26+
## ChronoEditTransformer3DModel
27+
28+
[[autodoc]] ChronoEditTransformer3DModel
29+
30+
## Transformer2DModelOutput
31+
32+
[[autodoc]] models.modeling_outputs.Transformer2DModelOutput
Lines changed: 156 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,156 @@
1+
<!-- Copyright 2025 The ChronoEdit Team and HuggingFace Team. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License. -->
14+
15+
<div style="float: right;">
16+
<div class="flex flex-wrap space-x-1">
17+
<a href="https://huggingface.co/docs/diffusers/main/en/tutorials/using_peft_for_inference" target="_blank" rel="noopener">
18+
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
19+
</a>
20+
</div>
21+
</div>
22+
23+
# ChronoEdit
24+
25+
[ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation](https://huggingface.co/papers/2510.04290) from NVIDIA and University of Toronto, by Jay Zhangjie Wu, Xuanchi Ren, Tianchang Shen, Tianshi Cao, Kai He, Yifan Lu, Ruiyuan Gao, Enze Xie, Shiyi Lan, Jose M. Alvarez, Jun Gao, Sanja Fidler, Zian Wang, Huan Ling.
26+
27+
> **TL;DR:** ChronoEdit reframes image editing as a video generation task, using input and edited images as start/end frames to leverage pretrained video models with temporal consistency. A temporal reasoning stage introduces reasoning tokens to ensure physically plausible edits and visualize the editing trajectory.
28+
29+
*Recent advances in large generative models have greatly enhanced both image editing and in-context image generation, yet a critical gap remains in ensuring physical consistency, where edited objects must remain coherent. This capability is especially vital for world simulation related tasks. In this paper, we present ChronoEdit, a framework that reframes image editing as a video generation problem. First, ChronoEdit treats the input and edited images as the first and last frames of a video, allowing it to leverage large pretrained video generative models that capture not only object appearance but also the implicit physics of motion and interaction through learned temporal consistency. Second, ChronoEdit introduces a temporal reasoning stage that explicitly performs editing at inference time. Under this setting, target frame is jointly denoised with reasoning tokens to imagine a plausible editing trajectory that constrains the solution space to physically viable transformations. The reasoning tokens are then dropped after a few steps to avoid the high computational cost of rendering a full video. To validate ChronoEdit, we introduce PBench-Edit, a new benchmark of image-prompt pairs for contexts that require physical consistency, and demonstrate that ChronoEdit surpasses state-of-the-art baselines in both visual fidelity and physical plausibility. Project page for code and models: [this https URL](https://research.nvidia.com/labs/toronto-ai/chronoedit).*
30+
31+
The ChronoEdit pipeline is developed by the ChronoEdit Team. The original code is available on [GitHub](https://github.com/nv-tlabs/ChronoEdit), and pretrained models can be found in the [nvidia/ChronoEdit](https://huggingface.co/collections/nvidia/chronoedit) collection on Hugging Face.
32+
33+
34+
### Image Editing
35+
36+
```py
37+
import torch
38+
import numpy as np
39+
from diffusers import AutoencoderKLWan, ChronoEditTransformer3DModel, ChronoEditPipeline
40+
from diffusers.utils import export_to_video, load_image
41+
from transformers import CLIPVisionModel
42+
from PIL import Image
43+
44+
model_id = "nvidia/ChronoEdit-14B-Diffusers"
45+
image_encoder = CLIPVisionModel.from_pretrained(model_id, subfolder="image_encoder", torch_dtype=torch.float32)
46+
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
47+
transformer = ChronoEditTransformer3DModel.from_pretrained(model_id, subfolder="transformer", torch_dtype=torch.bfloat16)
48+
pipe = ChronoEditPipeline.from_pretrained(model_id, image_encoder=image_encoder, transformer=transformer, vae=vae, torch_dtype=torch.bfloat16)
49+
pipe.to("cuda")
50+
51+
image = load_image(
52+
"https://huggingface.co/spaces/nvidia/ChronoEdit/resolve/main/examples/3.png"
53+
)
54+
max_area = 720 * 1280
55+
aspect_ratio = image.height / image.width
56+
mod_value = pipe.vae_scale_factor_spatial * pipe.transformer.config.patch_size[1]
57+
height = round(np.sqrt(max_area * aspect_ratio)) // mod_value * mod_value
58+
width = round(np.sqrt(max_area / aspect_ratio)) // mod_value * mod_value
59+
print("width", width, "height", height)
60+
image = image.resize((width, height))
61+
prompt = (
62+
"The user wants to transform the image by adding a small, cute mouse sitting inside the floral teacup, enjoying a spa bath. The mouse should appear relaxed and cheerful, with a tiny white bath towel draped over its head like a turban. It should be positioned comfortably in the cup’s liquid, with gentle steam rising around it to blend with the cozy atmosphere. "
63+
"The mouse’s pose should be natural—perhaps sitting upright with paws resting lightly on the rim or submerged in the tea. The teacup’s floral design, gold trim, and warm lighting must remain unchanged to preserve the original aesthetic. The steam should softly swirl around the mouse, enhancing the spa-like, whimsical mood."
64+
)
65+
66+
output = pipe(
67+
image=image,
68+
prompt=prompt,
69+
height=height,
70+
width=width,
71+
num_frames=5,
72+
num_inference_steps=50,
73+
guidance_scale=5.0,
74+
enable_temporal_reasoning=False,
75+
num_temporal_reasoning_steps=0,
76+
).frames[0]
77+
Image.fromarray((output[-1] * 255).clip(0, 255).astype("uint8")).save("output.png")
78+
```
79+
80+
Optionally, enable **temporal reasoning** for improved physical consistency:
81+
```py
82+
output = pipe(
83+
image=image,
84+
prompt=prompt,
85+
height=height,
86+
width=width,
87+
num_frames=29,
88+
num_inference_steps=50,
89+
guidance_scale=5.0,
90+
enable_temporal_reasoning=True,
91+
num_temporal_reasoning_steps=50,
92+
).frames[0]
93+
export_to_video(output, "output.mp4", fps=16)
94+
Image.fromarray((output[-1] * 255).clip(0, 255).astype("uint8")).save("output.png")
95+
```
96+
97+
### Inference with 8-Step Distillation Lora
98+
99+
```py
100+
import torch
101+
import numpy as np
102+
from diffusers import AutoencoderKLWan, ChronoEditTransformer3DModel, ChronoEditPipeline
103+
from diffusers.utils import export_to_video, load_image
104+
from transformers import CLIPVisionModel
105+
from PIL import Image
106+
107+
model_id = "nvidia/ChronoEdit-14B-Diffusers"
108+
image_encoder = CLIPVisionModel.from_pretrained(model_id, subfolder="image_encoder", torch_dtype=torch.float32)
109+
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
110+
transformer = ChronoEditTransformer3DModel.from_pretrained(model_id, subfolder="transformer", torch_dtype=torch.bfloat16)
111+
pipe = ChronoEditPipeline.from_pretrained(model_id, image_encoder=image_encoder, transformer=transformer, vae=vae, torch_dtype=torch.bfloat16)
112+
lora_path = hf_hub_download(repo_id=model_id, filename="lora/chronoedit_distill_lora.safetensors")
113+
pipe.load_lora_weights(lora_path)
114+
pipe.fuse_lora(lora_scale=1.0)
115+
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config, flow_shift=2.0)
116+
pipe.to("cuda")
117+
118+
image = load_image(
119+
"https://huggingface.co/spaces/nvidia/ChronoEdit/resolve/main/examples/3.png"
120+
)
121+
max_area = 720 * 1280
122+
aspect_ratio = image.height / image.width
123+
mod_value = pipe.vae_scale_factor_spatial * pipe.transformer.config.patch_size[1]
124+
height = round(np.sqrt(max_area * aspect_ratio)) // mod_value * mod_value
125+
width = round(np.sqrt(max_area / aspect_ratio)) // mod_value * mod_value
126+
print("width", width, "height", height)
127+
image = image.resize((width, height))
128+
prompt = (
129+
"The user wants to transform the image by adding a small, cute mouse sitting inside the floral teacup, enjoying a spa bath. The mouse should appear relaxed and cheerful, with a tiny white bath towel draped over its head like a turban. It should be positioned comfortably in the cup’s liquid, with gentle steam rising around it to blend with the cozy atmosphere. "
130+
"The mouse’s pose should be natural—perhaps sitting upright with paws resting lightly on the rim or submerged in the tea. The teacup’s floral design, gold trim, and warm lighting must remain unchanged to preserve the original aesthetic. The steam should softly swirl around the mouse, enhancing the spa-like, whimsical mood."
131+
)
132+
133+
output = pipe(
134+
image=image,
135+
prompt=prompt,
136+
height=height,
137+
width=width,
138+
num_frames=5,
139+
num_inference_steps=8,
140+
guidance_scale=1.0,
141+
enable_temporal_reasoning=False,
142+
num_temporal_reasoning_steps=0,
143+
).frames[0]
144+
export_to_video(output, "output.mp4", fps=16)
145+
Image.fromarray((output[-1] * 255).clip(0, 255).astype("uint8")).save("output.png")
146+
```
147+
148+
## ChronoEditPipeline
149+
150+
[[autodoc]] ChronoEditPipeline
151+
- all
152+
- __call__
153+
154+
## ChronoEditPipelineOutput
155+
156+
[[autodoc]] pipelines.chronoedit.pipeline_output.ChronoEditPipelineOutput

src/diffusers/__init__.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -202,6 +202,7 @@
202202
"BriaTransformer2DModel",
203203
"CacheMixin",
204204
"ChromaTransformer2DModel",
205+
"ChronoEditTransformer3DModel",
205206
"CogVideoXTransformer3DModel",
206207
"CogView3PlusTransformer2DModel",
207208
"CogView4Transformer2DModel",
@@ -436,6 +437,7 @@
436437
"BriaPipeline",
437438
"ChromaImg2ImgPipeline",
438439
"ChromaPipeline",
440+
"ChronoEditPipeline",
439441
"CLIPImageProjection",
440442
"CogVideoXFunControlPipeline",
441443
"CogVideoXImageToVideoPipeline",
@@ -909,6 +911,7 @@
909911
BriaTransformer2DModel,
910912
CacheMixin,
911913
ChromaTransformer2DModel,
914+
ChronoEditTransformer3DModel,
912915
CogVideoXTransformer3DModel,
913916
CogView3PlusTransformer2DModel,
914917
CogView4Transformer2DModel,
@@ -1113,6 +1116,7 @@
11131116
BriaPipeline,
11141117
ChromaImg2ImgPipeline,
11151118
ChromaPipeline,
1119+
ChronoEditPipeline,
11161120
CLIPImageProjection,
11171121
CogVideoXFunControlPipeline,
11181122
CogVideoXImageToVideoPipeline,

src/diffusers/models/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,7 @@
8686
_import_structure["transformers.transformer_bria"] = ["BriaTransformer2DModel"]
8787
_import_structure["transformers.transformer_bria_fibo"] = ["BriaFiboTransformer2DModel"]
8888
_import_structure["transformers.transformer_chroma"] = ["ChromaTransformer2DModel"]
89+
_import_structure["transformers.transformer_chronoedit"] = ["ChronoEditTransformer3DModel"]
8990
_import_structure["transformers.transformer_cogview3plus"] = ["CogView3PlusTransformer2DModel"]
9091
_import_structure["transformers.transformer_cogview4"] = ["CogView4Transformer2DModel"]
9192
_import_structure["transformers.transformer_cosmos"] = ["CosmosTransformer3DModel"]
@@ -179,6 +180,7 @@
179180
BriaFiboTransformer2DModel,
180181
BriaTransformer2DModel,
181182
ChromaTransformer2DModel,
183+
ChronoEditTransformer3DModel,
182184
CogVideoXTransformer3DModel,
183185
CogView3PlusTransformer2DModel,
184186
CogView4Transformer2DModel,

src/diffusers/models/transformers/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@
2020
from .transformer_bria import BriaTransformer2DModel
2121
from .transformer_bria_fibo import BriaFiboTransformer2DModel
2222
from .transformer_chroma import ChromaTransformer2DModel
23+
from .transformer_chronoedit import ChronoEditTransformer3DModel
2324
from .transformer_cogview3plus import CogView3PlusTransformer2DModel
2425
from .transformer_cogview4 import CogView4Transformer2DModel
2526
from .transformer_cosmos import CosmosTransformer3DModel

0 commit comments

Comments
 (0)