A presenter at a conference stands by a large screen displaying a slide titled "VideoPoet: LLM for Video Generation." The slide contains diagrams and text explaining the concept, likely related to a deep learning model that integrates video, audio, and text for generating media content. Attendees sit on the floor and in chairs, attentively observing the presentation. The room is well-lit with modern design elements, featuring white walls and a carpeted floor. A projector can be seen on a table in the foreground, indicating the use of visual aids to support the technical discussion. The atmosphere seems focused and educational, as participants engage with the latest advancements in technology and machine learning. Text transcribed from the image: VideoPoet LLM for video generation -bidirectional attention prefix- VideoPoet (LLM) bop dalo bot i test tokens cot bov encoder MAGVIT-V2 encoder sing then explode in the background autoregressively generated output keov boa audio tokens ceca resbov o Visual out ceov oboa o audio out 000 Oreos Sound Stream decoder Sound Stream encoder MAGVITV2 decoder (audio) mage depth & optical flow cropped or masked video output audio output video . Tokenizers: MAGVIT V2 (video) and SoundStream (audio) convert media into discrete codes compatible with text-based models. ⚫ Autoregressive Model: Learns across video, image, audio, and text to predict the next token. Multimodal Learning: Supports text-to-video, image-to-video, video continuation, inpainting, stylization, video-to-audio, and zero-shot tasks. Kondratyk et al, "VideoPoet: A Large Language Model for Zero-Shot Video Generation," ICML 2024, Copyright DMike Shou, NUS 105