-
Notifications
You must be signed in to change notification settings - Fork 6.7k
LTX 2 Improve encode_video by Accepting More Input Types
#13057
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
sayakpaul
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Left some comments.
| video_np = video.cpu().numpy() | ||
| """ | ||
| Encodes a video with audio using the PyAV library. Based on code from the original LTX-2 repo: | ||
| https://github.com/Lightricks/LTX-2/blob/main/packages/ltx-pipelines/src/ltx_pipelines/utils/media_io.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit): let's put a permalink.
| The sampling rate of the audio waveform. For LTX 2, this is typically 24000 (24 kHz). | ||
| output_path (`str`): | ||
| The path to save the encoded video to. | ||
| video_chunks_number (`int`, *optional*, defaults to `1`): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When is this option helpful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original LTX-2 code will use a video_chunks_number calculated from the video VAE tiling config, for example in two stage inference:
For the default num_frames value of 121 and default tiling config TilingConfig.default(), I believe this works out to 3 chunks. The idea seems to be that the chunks correspond to each tiled stride when decoding.
In practice, I haven't had any issues with the current code, which is equivalent to just using one chunk. I don't fully understand the reasoning behind why the original code supports it; my guess is that it is useful for very long videos or if there are compute constraints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #13057 (comment) for discussion about some complications for supporting video_chunks_number.
| video = torch.from_numpy(video) | ||
| elif isinstance(video, np.ndarray): | ||
| # Pipeline output_type="np" | ||
| video = (video * 255).round().astype("uint8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check for the value range before doing this? Just to be safe.
| for video_chunk in tqdm(all_tiles(first_chunk, video), total=video_chunks_number): | ||
| video_chunk_cpu = video_chunk.to("cpu").numpy() | ||
| for frame_array in video_chunk_cpu: | ||
| frame = av.VideoFrame.from_ndarray(frame_array, format="rgb24") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we let the users control this format? 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could allow the users to specify the format, but this would be in tension with value checking as suggested in #13057 (comment): for example, if we always convert denormalized inputs with values in uint8 values in
We could conditionally convert based on the supplied video_format, but my understanding is that there are a lot of video formats, and I don't think we can anticipate all of the use cases that users may have. So I think we could support a video_format argument with a "use at your own risk" caveat:
elif isinstance(video, np.ndarray):
# Pipeline output_type="np"
is_denormalized = np.logical_and(np.zeros_like(video) <= video, video <= np.ones_like(video))
if np.all(is_denormalized) and video_format == "rgb24":
video = (video * 255).round().astype("uint8")
else:
logger.warning(
f"The video will be encoded using the input `video` values as-is with format {video_format}. Make sure"
f" the values are in the proper range for the supplied format".
)
video = torch.from_numpy(video)An alternative would be to only support "rgb24" as the original LTX-2 code does with the idea that power users can use their own video encoding code if they have a different use case.
EDIT: the right terminology here might be "pixel format" rather than "video format".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative would be to only support "rgb24" as the original LTX-2 code does with the idea that power users can use their own video encoding code if they have a different use case.
Okay let's go with this.
| yield first_chunk | ||
| yield from tiles_generator | ||
|
|
||
| for video_chunk in tqdm(all_tiles(first_chunk, video), total=video_chunks_number): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYT of getting rid of all_tiles() and doing it like so?
from itertools import chain
for video_chunk in tqdm(chain([first_chunk], video), total=video_chunks_number):
video_chunk_cpu = video_chunk.to("cpu").numpy()
for frame_array in video_chunk_cpu:
frame = av.VideoFrame.from_ndarray(frame_array, format="rgb24")
for packet in stream.encode(frame):
container.mux(packet)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does the right thing but appears not to work well with tqdm, which doesn't update properly from the chain object:
33%|████████████████████████████████▋ | 1/3 [00:04<00:09, 4.57s/it]There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think #13057 (comment) is wrong - as we generally supply a single torch.Tensor to encode_video (for example from a pipeline output), this line creates an iterator with one element:
diffusers/src/diffusers/pipelines/ltx2/export_utils.py
Lines 146 to 149 in 857735f
| if isinstance(video, torch.Tensor): | |
| video = iter([video]) | |
| first_chunk = next(video) |
So when we call next(video) in the following line, the video iterator is now exhausted. So even if we set video_num_chunks > 1 in this case, our for loop through first_chunk and video will only yield one element in total, whether that's using all_tiles or chain. Thus, the progress bar will end up being wrong in this case since we tell tqdm that we have video_num_chunks > 1 elements when we in fact only have one.
I think the underlying difference is that the original LTX 2 code will return an iterator over decoded tiles when performing tiled VAE decoding, whereas we will return the whole decoded output as a single tensor with the tiles stitched back together. So maybe it doesn't make sense to support video_chunks_number as this will only work well when we supply an Iterator[torch.Tensor] to encode_video (in the current implementation).
What does this PR do?
This PR improves the
diffusers.pipelines.ltx2.export_utils.encode_videofunction for exporting videos with audio by allowing it to accept input types besidestorch.Tensor, such asnp.ndarrayandList[PIL.Image.Image]. This is meant to make it easier to use as users will not have to do special processing such asbefore calling
encode_video; the PR will handle this logic insideencode_video. (Note that using the above processing will still work after the change, but will no longer be necessary.)Changelist
np.ndarrayandList[PIL.Image.Image]videoinputs forencode_video, which are assumed to be outputs from pipelines usingoutput_types of"np"and"pil", respectively.encode_video.encode_videowithout any special processing.Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@sayakpaul
@yiyixuxu