In a conference room, participants are viewing a detailed presentation slide. The slide, titled "What metrics are used in existing models?" highlights various models, including Gen-1 from Runway, EMU-Video by Meta, and Google's Video Diffusion Models and Lumiere. Models like Tune-a-video and Stable Diffusion Video are also listed. Metrics referenced include CLIP (consecutive frames and text, frame), FVD, Inception Scores, Peak Signal-to-Noise Ratio (PSNR), Perceptual Image Patch Similarity, and others. There is a focus on CLIP for model evaluation. The slide is a comprehensive comparative analysis of current video and image model metrics. The setting suggests a formal academic or professional environment, and the audience appears engaged in the content presented. Text transcribed from the image: What metrics are used in existing models? Gen-1, Runway [Esser et.al, 2023] CLIP (consecutive frames) CLIP (text, frame) EMU-Video, Meta [Giridhar et.al, 2024] FVD Video Diffusion Models, Google [Ho et.al, 2022] Inception score FVD Imagen Video, Google [Ho et.al, 2023] CLIP Inception Scores Tune-a-video [Wu et.al, 2023] CLIP (consecutive frames) CLIP (text, frame) Lumiere, Google [Tal et.al, 2024] . FVD Inception Scores Stable diffusion video [Blattmann et.al, 2024] Peak Signal-to-Noise Ratio (PSNR) Perceptual Image Patch Similarity [Zhang et.al, CVPR'18] CLIP. Copyrigh ELS1