A speaker is delivering a presentation on "Multimodal-Guided Video Generation" in a conference room. The projection screen displays several schematics and block diagrams for various projects or models, including "MovieFactory," "CoDi," "MM-Diffusion," and "NEXT-GPT." The audience, including at least three visible attendees, is focused on the speaker and the screen. The setting appears professional, indicating a detailed technical presentation aimed at researchers or professionals in the field. The room’s design and the use of advanced technical terms suggest an academic or industry conference environment.
Text transcribed from the image:
Multimodal-Guided Video Generation: More Works
Sep 1: Spatial Fl
Step 2: Tempiral Training
Composable
Cendening
Stam 2
www
Cecoration
(b)
Cendoning Alg
Coding
Bingle Draining
L
Presined and Final Moddle
Added and Table Module
MovieFactory (Zhu et al.)
"MovieFactory: Automatic Movie Creation from Text using Large Generative Models
for Language and Images," arXiv 2023.
(0)
CODI (Tang et al.)
ଅ
"Any-to-Any Generation via Composable Diffusion," NeuriPS 2023.
M
00
Audo Branch and Sp
Ereding
MM-Diffusion (Ruan et al.)
"MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video
Generation," CVPR 2023.
Xing et al., "A Survey on Video Diffusion Models," arXiv 2023.
Video
More modales
掴
Above]
the
Projesi
LIM
Audio Output
Encoding
A
Alignment
Generation
NEXT-GPT (Wu et al.)
"NEXT-GPT: Any-to-Any Multimodal LLM," arXiv 2023.
Copyright Mike Shou, NUS
160
Я ТЕХОЛ Е