Caption: Participants attentively listen to a presentation on "VideoPoet," a language model for video generation, during a tech conference. The speaker explains the architecture and functionalities of the model, elucidating how it integrates video, image, audio, and text to predict subsequent tokens. The presentation highlights the capabilities of "MACVIT-v2" (video) and "SoundStream" (audio) tokenizers in converting media into discrete codes, enabling various forms of multimedia transformations. The attendees, seated on the floor due to the packed room, follow along as the speaker delves into key aspects such as text-to-video conversion, video continuation, video stylization, and more. The environment reflects the intense interest and engagement in cutting-edge AI and machine learning technologies. Text transcribed from the image: VideoPoet LLM for video generation -bidirectional attention prefix- VideoPoet (LLM) bop dalo bot i test tokens cot bov encoder MAGVIT-V2 encoder sing then explode in the background autoregressively generated output keov boa audio tokens ceca resbov o Visual out ceov oboa o audio out 000 Oreos Sound Stream decoder Sound Stream encoder MAGVITV2 decoder (audio) mage depth & optical flow cropped or masked video output audio output video . Tokenizers: MAGVIT V2 (video) and SoundStream (audio) convert media into discrete codes compatible with text-based models. ⚫ Autoregressive Model: Learns across video, image, audio, and text to predict the next token. Multimodal Learning: Supports text-to-video, image-to-video, video continuation, inpainting, stylization, video-to-audio, and zero-shot tasks. Kondratyk et al, "VideoPoet: A Large Language Model for Zero-Shot Video Generation," ICML 2024, Copyright DMike Shou, NUS 105