A participant attentively watches a presentation on "Transformer-based video diffusion models" during a conference. The slide on display clearly outlines a flow diagram, which begins with a stack of video frames, visualized as an image sequence on the left. It transitions through a "Visual encoder", depicted by a funnel-shaped icon, before transforming into a large three-dimensional grid. This grid then narrows down to a one-dimensional row of cells on the far right end, representing an encoded video representation. The slide credits Peebles et al., Sora, 2024 at the bottom left corner and mentions the collaborating entity as Shou, NUS. The setting is professional, indicating a focused and educational atmosphere, with audience members seated and engaged with the content.
Text transcribed from the image:
Sora
Transformer-based video diffusion models
Peebles et al., Sora, 2024.
Visual
encoder
Shou, NUS
101