The image captures a conference setting where attendees are focused on a presentation slide titled "What metrics are used in existing models?". The slide discusses different metrics employed by various state-of-the-art generative models, such as "Gen-1, Runway," which uses CLIP (consecutive frames) and CLIP (text, frame), as well as "EMU-Video, Meta" which uses FVD and Inception Scores. Other models and their metrics mentioned include "Tune-a-video" and "Lumiere, Google." Additionally, the slide outlines the metrics for "Video Diffusion Models, Google," "Imagen Video, Google," and "Stable Diffusion Video," listing metrics such as Peak Signal-to-Noise Ratio (PSNR) and Perceptual Image Patch Similarity (PIPS). The audience members, partially visible with one individual raising his hand to his chin, are evidently engaged with the detailed technical content presented. The setting of the room suggests an academic or professional seminar dedicated to advanced topics in AI and video generation technology. Text transcribed from the image: What metrics are used in existing models? Gen-1, Runway [Esser et.al, 2023] CLIP (consecutive frames) CLIP (text, frame) EMU-Video, Meta [Giridhar et.al, 2024] FVD Video Diffusion Models, Google [Ho et.al, 2022] Inception score FVD Imagen Video, Google [Ho et.al, 2023] CLIP Inception Scores Tune-a-video [Wu et.al, 2023] . • CLIP (consecutive frames) CLIP (text, frame) Lumiere, Google [Tal et.al, 2024] • • FVD Inception Scores Stable diffusion video [Blattmann et.al, 2024] Peak Signal-to-Noise Ratio (PSNR) Perceptual Image Patch Similarity [Zhang et.al, CVPR'18] CLIP. Copyright Deept LS1 M NCY