A presenter at a conference stands by a large screen displaying a slide titled "VideoPoet: LLM for Video Generation." The slide contains diagrams and text explaining the concept, likely related to a deep learning model that integrates video, audio, and text for generating media content. Attendees sit on the floor and in chairs, attentively observing the presentation. The room is well-lit with modern design elements, featuring white walls and a carpeted floor. A projector can be seen on a table in the foreground, indicating the use of visual aids to support the technical discussion. The atmosphere seems focused and educational, as participants engage with the latest advancements in technology and machine learning.
Text transcribed from the image:
VideoPoet
LLM for video generation
-bidirectional attention prefix-
VideoPoet (LLM)
bop dalo bot i test tokens cot bov
encoder
MAGVIT-V2
encoder
sing
then explode in the background
autoregressively generated output
keov boa audio tokens ceca resbov o Visual out ceov oboa o audio out 000 Oreos
Sound
Stream
decoder
Sound
Stream
encoder
MAGVITV2
decoder
(audio)
mage depth &
optical flow
cropped or
masked video
output audio
output video
.
Tokenizers: MAGVIT V2 (video) and SoundStream (audio) convert media into discrete codes
compatible with text-based models.
⚫ Autoregressive Model: Learns across video, image, audio, and text to predict the next token.
Multimodal Learning: Supports text-to-video, image-to-video, video continuation, inpainting,
stylization, video-to-audio, and zero-shot tasks.
Kondratyk et al, "VideoPoet: A Large Language Model for Zero-Shot Video Generation," ICML 2024,
Copyright DMike Shou, NUS
105