This image captures a detailed poster presentation titled "Text is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval," which was highlighted at the CVPR2024 event in Seattle, WA. The research is a joint effort by Jiamian Wang, Guohao Sun, Pichao Wang, Dongfang Liu, Sohail Dianat, Majid Rabbani, Raghuveer Rao, and Zhiqiang Tao from the Rochester Institute of Technology and Amazon Prime Video, with collaboration from the Army Research Laboratory. The poster is systematically divided into several sections: 1. **Motivation**: It explains the foundational reasons and context for the research, emphasizing the challenges in describing redundant content in videos and the necessity for better models in text-to-video information sharing. 2. **Learning in Joint Space**: This segment discusses the methodology for learning embeddings in a shared space to facilitate effective text-video matching. 3. **Method: T-MASS**: It details the proposed technique for text-video feature extraction, the embedding process, and how similarity-aware modeling improves over existing methods. Illustrations and flow diagrams elucidate the process. 4. **Similarity-Aware Radius Modeling**: A deeper dive into the specific modeling approach used to enhance text-video retrieval accuracy utilizing a radius-based method. Graphs and tables depict the performance metrics and comparisons. 5. **Experiment Results**: Comprehensive tables provide quantitative results across various benchmarks (MSR-VTT, LSMDC, Didemo, VATEX, and others), comparing the proposed T-MASS model against several other baseline methods. Bold figures indicate the best performance. 6. **Contact Information and References**: The poster concludes with detailed contact information for further inquiries and references to previously published works by the authors. The layout is visually organized with text, tables, graphs, and images to allow the audience to easily comprehend the presented research. The clear headings and structured content aid in conveying the research findings effectively to the audience present at the event. Text transcribed from the image: 17-21, 2024 HIS CVPR Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval SEATTLE, WA Jiamian Wang Guohao Sun¹ Pichao Wang² CVPR2024 Highlight Dongfang Liu Sohail Dianat¹ Majid Rabbani¹ Dev prime DEVCOM Video ARMY RESEARCH LABORATORY RIT Rochester Institute of Technology Raghuveer Rao³ Zhiqiang Tao¹ Rochester Institute of Technology 2Amazon Prime Video (The work does not relate to author's position at Amazon) 3Army Research Laboratory Paper Code Experiment Results Supplementary Method: T-MASS MSRVTT Retrieval LSMDC Retrieval Text-Video Feature Extraction Text-video Embedding Video 1 (relevant) Video 1 (relevant) Video 2 (irrelevant) Video 3 (irrelevant) Video Encoder 4(•) Feature Fusion Method CLIP-VIT-B/32 X-Pool [17] R11 R@5+ R@10t MdR↓ MnR R@11 R@51 R@10↑ MdR↓ MnR↓ 46.9 72.8 82.2 2.0 25.2 14.3 43.7 53.5 8.0 53.2 回 Motivation [f1, fz, f] Video 2 (irrelevant) fE Rd *(•) Similarity-aware Radius RE Rd DiffusionRet [26] 49.0 75.2 82.7 2.0 12.1 24.4 43.1 54.3 8.0 40.7 UATVR [13] 47.5 73.9 83.5 2.0 12.3 R₁ =443.9 ||R|₁ 497.0 R 484.9 TEFAL [21] 49.4 75.9 83.9 2.0 12.0 26.8 46.1 56.5 7.0 44.4 CLIP-VIP [57] 50.1 74.8. 84.6 1.0 25.6 45.3 54.4 8.0 Similarity WER Training: Random sample T-MASS (Ours) 50.2 75.3 85.1 1.0 11.9 28.9 48.2 57.6 6.0 43.3 Video 3 (irrelevant) S=s(f,t), i=1,..., T' Ots CLIP-VIT-B/16 X-Pool [17] 48.2 73.7 82.6 2.0 12.7 26.1 46.8 56.7 7.0 47.3 UATVR [13] 50.8 76.3 85.5 1.0. 12.4 exp() 0 V t 0 CLIP-VIP [57] 54.2 77.2 84.8 1.0 29.4 50.6 59.0 5.0 teR T-MASS (Ours) 52.7 77.1 85.6 1.0 10.5 30.3 52.2 61.3 5.0 40.1 Lightless lantern X Shocked w/o hat Before mast Sword P R Text Mass/ Table 2. Text-to-video comparisons on MSRVTT and LSMDC. Bold denotes the best. Query: "women are modeling clothes" Text Encoder Φε() ts Testing: t, t+RE,E P Choose closest Description: "a pirate man tries to lift a lantern with his sword while on a boat" Existing Embedding Proposed Embedding +0 Test point A A 0 Test mass A A ΔΟ vit embedding A Other wit pairs Similarity Joint space The text content is hard to fully describe the redundant semantics of the video. Accordingly, single text embedding may be less expressive to handle the video information in joint space. Learning in Joint Space Rather than the original t embedding t, here Introduce stochastic mbedding t, to 14 Ou 文 BL Text Ma о 0 Bochastic ed anbedding nt a text mass A Vida wedding rameterization, Ting -E₁E~ P₁ Support d endedding we identify a support text embedding tsup the direction from v to t and being placed at the text mass, which serves as a proxy to xt mass (both shifting and scaling). tsup =t+ v-t v-t R. s based on symmetric cross entropy, Le=(+-)- It is non-trivial to determine an optimal value for the radius of the text mass (i.e., R)-oversized radius improperly encompasses less relevant or irrelevant video embedding, too small text mass may lack expressiveness to bridge the video. We propose similarity-aware radius S=s(t,f),i=1,...,T', R = exp(SW), S [S₁,..., ST], During inference, we modify the inference pipeline to take advantage of the text mass. For a given text-video pair {t, v), we repeat sampling for M trials and select the optimal ts t, arg max s(t, v), i = 1,..., M, Similarity-Aware Radius Modeling (1) Using text mass (ts) can result in performance boost. (2) R is insensitive to varied implementations. (3) Linear performs best. Dynamics of radius R. T-MASS learns a precise text semantics for the relevant text-video pairs (smallest R₁ correspond to the red curve). This is typically observed correctly retrieved pairs. We provide both desired t. DiDeMo Retrieval VATEX Retrieval Method CLIP-VIT-B/32 R@11 R@5+ R@10+ MdR↓ MnR↓ R@11 R@51 R@10+ MdR↓ MnR↓ X-Pool [17] DiffusionRet [26] 44.6 73.2 82.0 2.0 15.4 60.0 90.0 95.0 1.0 3.8 46.7 74.7 82.7 2.0 14.3 - UATVR [13] 43.1 71.8 82.3 2.0 15.1 61.3 91.0 95.6 1.0 3.3 CLIP-VIP [57] 48.6 77.1 84.4 2.0 - T-MASS (Ours) 50.9 77.2 85.3 1.0 12.1 63.0 92.3 96.4 1.0 3.2 CLIP-VIT-B/16 X-Pool [17] 47.3 74.8 82.8 2.0 14.2 62.6 91.7 96.0 1.0 3.4 UATVR [13] 45.8 73.7 83.3 2.0 13.5 64.5 92.6 96.8 1.0 2.8 CLIP-VIP [57] 50.5 78.4 87.1 1.0 - = T-MASS (Ours) 53.3 80.1 87.7 1.0 9.8 65.6 93.9 97.2 1.0 2.7 Table 3. Text-to-video comparisons on DiDeMo and VATEX. Bold denotes the best. MSRVTT Retrieval DiDeMo Retrieval Radius R wio R exp(S) exp(S) exp(SW) Re11 Rest Re101 MdR MnR R11 R@5↑ R@10↑ MdRMnR Method R@1 R@5 R@10 MdR MnR Method RO1 R@5 R@10 MdR MnR 46.9 72.8 82.2 2.0 14.3 44.6 73.2 82.0 2.0 15.4 CLIP-VIT-B/32 CLIP-VIT-B/32 48.7 74.7 83.7 2.0 12.7 48.0 75.4 85.0 2.0 13.0 CLIP4Clip [39] 42.7 70.9 80.6 ClipBERT [28] 2.0 11.6 49.2 75.7 84.7 2.0 11.7 49.7 75.8 85.3 2.0 12.6 CenterCLIP [60] CLIP4Clip [39] 42.8 71.7 82.2 2.0 10.9 49.1 75.7 85.7 2.0 11.9 49.8 78.1 86.0 2.0 11.8 X-Pool [17] X-Pool [17] 44.4 73.3 84.0 2.0 9.0 T-MASS (Ours) 6.7 9.9 11.2 14.2 36.2 17.3 27.1 28.3. 38.8 25.2 32.0 149.7 36.8 21.0 85.4 20.0 48.3 12.0 82.7 54.8 TS2-Net [36] 45.3 74.1 83.7 2.0 9.2 CLIP-VIT-B/16 DiffusionRet [26] 47.7 73.8 84.5 2.0 8.8 CLIP4Clip [39] 16.0 UATVR [13] T-MASS (Ours) 46.9 73.8 83.8 2.0 8.6 X-Pool [17] 47.7 78.0 86.3 2.0 8.0 T-MASS (Ours) 38.2 48.5 12.0 54.1 20.7 42.5 53.5 9.0 47.4 26.7 51.7 63.9 5.0 30.0 CLIP-VIT-B/16 X-Pool [17] TS2-Net [36] 46.4 73.9 Table 5. Text-to-video, on Charades. 84.1 2.0 8.4 46.6 75.9 84.9 CenterCLIP [60] UATVR [13] T-MASS (Ours) 2.0 #Trials (M) 8.9 47.7 75.0 83.3 2.0 10.2 w/o sampling R@1 44.4 72.4 R@5 R@10 MdR MnR 81.9 2.0 13.1 48.1 76.3 5 85.4 2.0 46.8 74.7 84.0 8.0 2.0 12.5 50.9 10 80.2 88.0 50.0 75.2 1.0 84.1 7.4 2.0 12.3 Table 4. Video-to-text performance (MSRVTT). 20 50.2 75.3 85.1 1.0 11.9 35 Training Epochs R@1 X-Pool 40- R@5 T-MASS (Ours) R@10 50 14- 45 5 40 12 Table 6. Stochastic sampling trails. 55 R@1 50 80- X-Pool 75 R@5 90- T-MASS (Ours) R@10 85 520 520 500 500 480 480 on 460 460 440 440- Irrelevant Text-video Pair and failing examples in the supplementary. Relevant Text-video Pair Negative Text-video Pair Positive Text-video Pair 420 420- 0 1 3 0 3 Training Epochs Compare t, with t 0.5 tsup For the irrelevant pairs, Maximum: t v.S. V Maximum: t, v.S. V tv.s. v tv.s. 0.4 = L₁+alp t, enables smaller cosine 0.3 tv.s. v -tv.s. v 40 12 15 18 21 24 12 15 18 21 12 15 18 21 24 #Frames #Frames #Frames -Avgit, v. s. v):0.588 Avglt v. s. v):0.635 05 0.8 1.0 1.2 1.5 Alpha 05 0.8 10 12 1.5 Alpha 0.5 0.8 1.0 1.2 1.5 Alpha R@10 MdR MAR similarity values (left side) 0.2 82.2 0.1 2.0 14.3 3 2.0 12.3 0.0- 2.0 11.9 For the relevant pairs, -0.1- 1.0 11.9 ts enables smaller loss -0.2 bedding. values (rights side). 200 400 Query Text in MSRVTT-1K Testing Dataset 600 800 1000 15 Query Text Batch/32: MSRVTT-1K Testing Dataset 25 30 [3] Wang, J., Wu for Efficient Image Super-Resolution. In ICC Contact E-mail: jw4905@rit.edu Website: https://jiamian-wang.github.io/ [1] Wang, J., Zhang, Y., Yuan, X., Meng, Z., & Tao, Z. (2022). Modeling Mask Uncertainty in Hyperspectral Image Reconstruction. In ECCV 2022 (Oral). [2] Wang, J., Wang, H., Zhang, Y., Fu, Y., & Tao, Z. (2023). Iterative Soft