A speaker is delivering a presentation on "Multimodal-Guided Video Generation" in a conference room. The projection screen displays several schematics and block diagrams for various projects or models, including "MovieFactory," "CoDi," "MM-Diffusion," and "NEXT-GPT." The audience, including at least three visible attendees, is focused on the speaker and the screen. The setting appears professional, indicating a detailed technical presentation aimed at researchers or professionals in the field. The room’s design and the use of advanced technical terms suggest an academic or industry conference environment. Text transcribed from the image: Multimodal-Guided Video Generation: More Works Sep 1: Spatial Fl Step 2: Tempiral Training Composable Cendening Stam 2 www Cecoration (b) Cendoning Alg Coding Bingle Draining L Presined and Final Moddle Added and Table Module MovieFactory (Zhu et al.) "MovieFactory: Automatic Movie Creation from Text using Large Generative Models for Language and Images," arXiv 2023. (0) CODI (Tang et al.) ଅ "Any-to-Any Generation via Composable Diffusion," NeuriPS 2023. M 00 Audo Branch and Sp Ereding MM-Diffusion (Ruan et al.) "MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation," CVPR 2023. Xing et al., "A Survey on Video Diffusion Models," arXiv 2023. Video More modales 掴 Above] the Projesi LIM Audio Output Encoding A Alignment Generation NEXT-GPT (Wu et al.) "NEXT-GPT: Any-to-Any Multimodal LLM," arXiv 2023. Copyright Mike Shou, NUS 160 Я ТЕХОЛ Е