A researcher stands in front of a detailed poster presentation titled "Panacea: Panoramic and Controllable Video Generation for Autonomous Driving" at the CVPR 2024 conference. The poster is authored by a team from the University of Science and Technology of China and MEGVII Technology. The left section of the poster outlines the motivation and overview of the research including applications such as BEV-guided multi-view text-to-video and text-to-image-to-video generation. The central part details the method, describing the diffusion training process and two-stage inference pipeline. The right section presents experimental results and quantitative evaluations. The researcher, wearing a CVPR 2024 t-shirt, is engaged with their phone, possibly reviewing notes or additional data related to the presentation. The event seems to be well-attended with other attendees visible in the background. Text transcribed from the image: 1958 of Science MEGVII B Panacea: Panoramic and Controllable Video Generation for Autonomous Driving Yuqing Wen Yucheng Zhao Yingfei Liu, Fan Jia?, Yanhui Wang', Chong Luo, Chi Zhang³, Tiancai Wang, Xiaoyan Sun, Xiangyu Zhang² 'University of Science and Technology of China, 2MEGVII Technology, 3Mach Drive. 'Equal contribution, *Corresponding author 迈驰智行 1 1. Overview 204 CV 3. Experime Quantitative Results: Motivation: In autonomous driving, collecting high-quality annotated video datasets is labor-intensive and the collected datasets often suffer from limited diversity. What can Panacea do? BEV-guided Multi-view Text-to-Video Generation BEV Sequence Text Prompt View Front Panacea Back 2. Method Diffusion Training Process Time. Panacea follows the typical diffusion training process while incorporating decomposed 4D attention and control signals. view diffuse Panacea generates high-quality multi-view controllability. Method BEVG(41) 0 Diffusion Diffusion Encoder Decodes BEVC52 DriveDreamed 46) Panacea Tanabe Frozen Control Signals Decomposed 4D Attention Controllable Modules Frame ➤ Enhancing both multi-view and temporal coherence. Enabling the integration of diverse control signals. Malti-View Mal-Frame FVDL FID 452 26 Comparing FID and FVD metrics with SOTA methods on the validation set of the nuScenes dataset. The synthetic video datase training of BEV perceptic Real Generated BADEL 7 345 225 594 BEV-guided Multi-view Text-Image-to-Video Generation BEV Sequence Text Prompt Panacea Front Attribute-controllable Video Generation Autumn Front Right Spring 37.1 (+264) 5 Comparing involving data synthetic data. Qualitative Re Panacea gener Panacea gene CVPR 2024 Conditional Frame Intra-View Attention Cross-View Attention Cross-Frame Attention Text Prompt Query Key, Value Other Image Condition EV Sequence Generated. Frame T 6 Detals of Cond Front Back Right Day-time Two-stage Inference Pipeline >Creating multi-view images with BEV layouts first and then using these images and subsequent BEV layouts to generate the following frames. Night-time BEV Sequence Text Prompt Panacea Text Prompt w/ Attribute Fog Snow Rain Sandstorm Countryside Mountain ✰ Generating video datasets or elevating image-only datasets into video datasets to augment perception models' training. Performance Improvement by Performance Improvement by BEV Sequence Adding Synthetic Data Using Synthetic Videos 42 50 +5.8 +2.3 49 38 48 47 34 40 45 NOS NDS Original w/ Synthetic Data Realistic Image Synthetic Video CLIP VOXELS1 --Stage-1 AKOOL -Stage-2 ControlNet Multi-view controllable -Stage-1&2 Panacea. Diffusion Diffusion Encoder Decoder Ablations: Settings Panacea w/o Cross-view w/o Two stage monization: Lighting aware Por 425