In a conference room, participants are viewing a detailed presentation slide. The slide, titled "What metrics are used in existing models?" highlights various models, including Gen-1 from Runway, EMU-Video by Meta, and Google's Video Diffusion Models and Lumiere. Models like Tune-a-video and Stable Diffusion Video are also listed. Metrics referenced include CLIP (consecutive frames and text, frame), FVD, Inception Scores, Peak Signal-to-Noise Ratio (PSNR), Perceptual Image Patch Similarity, and others. There is a focus on CLIP for model evaluation. The slide is a comprehensive comparative analysis of current video and image model metrics. The setting suggests a formal academic or professional environment, and the audience appears engaged in the content presented.
Text transcribed from the image:
What metrics are used in existing models?
Gen-1, Runway [Esser et.al, 2023]
CLIP (consecutive frames)
CLIP (text, frame)
EMU-Video, Meta [Giridhar et.al,
2024]
FVD
Video Diffusion Models, Google [Ho et.al,
2022]
Inception score
FVD
Imagen Video, Google [Ho et.al, 2023]
CLIP
Inception Scores
Tune-a-video [Wu et.al, 2023]
CLIP (consecutive frames)
CLIP (text, frame)
Lumiere, Google [Tal et.al, 2024]
.
FVD
Inception Scores
Stable diffusion video [Blattmann et.al, 2024]
Peak Signal-to-Noise Ratio (PSNR)
Perceptual Image Patch Similarity
[Zhang et.al, CVPR'18]
CLIP.
Copyrigh
ELS1