This image shows a research poster titled "Text Is MASS: Modeling Embedding for Text-Video Retrieval" presented at CVPR 2024 in Seattle, WA. The poster features work from researchers at the Rochester Institute of Technology and Amazon Prime Video. The main sections include: 1. **Motivation**: Discusses the limitations of using single text embeddings in conveying video information. Visual examples and text explanations highlight the redundancy in text content and the need for better representation. 2. **Method: T-MASS**: Explains the methodology of T-MASS, which includes text-video feature extraction and embedding. The process involves comparing video segments and text queries to determine similarities. Illustrations and mathematical formulas support the explanation. 3. **Learning in Joint Space**: Details the learning approach used to better embed text and video in a shared space. Mathematical models and graphs illustrate the concept of joint learning and optimization between text and video data. 4. **Similarity-Aware Radius Modeling**: Introduces a precise modeling technique for better similarity matching, using radius adjustment methods. Graphs and visualizations show the comparison of different methods and their impact on retrieval performance. 5. **Experiment Results**: Presents tables comparing the performance of T-MASS with other methods on benchmark datasets like MSRVTT and DiDeMo. Results are expressed in terms of recall rates and median rank positions. Comparative bar charts provide a clear visual representation of the findings. 6. **Contact Information**: Provides the contact details of the authors for further inquiries and collaborations. The entire poster is rich with diagrams, graphs, and mathematical notations, indicating a comprehensive approach to improving text-video retrieval through innovative embedding techniques. Text transcribed from the image: CVPR JUNE 17-21, 2024 Text Is MASS: SEATTLE, WA Embedding for Text-Video Re CVPR2024 Highlight F Jiamian Wang Guohao Sun¹ Pichao Wang² Dongfang Liu1 Sohail Dianat¹ Majid Rabbani Raghuveer Technology 2Amazon Prime Video (The work does not relate to author's position at Amazon) 3Arm 1Rochester Institute Paper Code Supplementary Method: T-MASS Video 1 (relevant) Experiment Resul MSRVTT Retrieval R@101 AO vit embedding text embedding t, here we introduce stochastic text embedding t, to implement a text mass video bedding using reparameterization, Texting t₁ =t+RE, E~ P, Support text embedding Stochastic text bedding Besides ts, we identify a support text embedding tsup locating along the direction from v to t and being placed at the surface of the text mass, which serves as a proxy to better control text mass (both shifting and scaling). trup =t+ v-t R₁ v-t We introduce two terms based on symmetric cross entropy, (v) Lee == (1+0+ Lo-+1). Dynamics of radius R. T-MASS learns a precise text semantics for the relevant text-video pairs (smallest R₁ correspond to the red curve). This is typically observed correctly retrieved pairs. We provide both desired on (1) Using text mass (ts) can result in performance boost. (2) R is insensitive to varied implementations. (3) Linear performs best. Radius R w/o R exp(S) exp(S) Description: "a pirate man tries to lift a lantern with his sword while on a boat" Existing Embedding +0 Text point A A Lat Proposed Embedding 0 Text mass A ▲ Other vit pairs Similarity Joint space The text content is hard to fully describe the redundant semantics of the video. Accordingly, single text embedding may be less expressive to handle the video information in joint space. Learning in Joint Space Rather than the original It is non-trivial to determine an optimal value for the radius of the text mass (i.e., R)- oversized radius improperly encompasses less relevant or irrelevant video embedding, too small text mass may lack expressiveness to bridge the video. We propose similarity-aware radius S=s(t,f), i = 1,..., T', R = exp(SW), S = [S1, ..., ST], During inference, we modify the inference pipeline to take advantage of the text mass. For a given text-video pair {t, v}, we repeat sampling for M trials and select the optimal ts t, = arg max s(t, v), i = 1,..., M, Similarity-Aware Radius Modeling Text-video Embedding Text-Video Feature Extraction Video 1 (relevant) Video 2 (irrelevant) Video 3 (irrelevant) Φυ() Video 2 (irrelevant) Video Encoder [f1, f2,..., fT] fЄ Rd Feature Fusion 40) Method CLIP-VIT-B/32 X-Pool [17] R@11 R@51 MdR↓ 46.9 72.8 82.2 2.0 V DiffusionRet [26] 49.0 75.2 82.7 2.0 UATVR [13] 47.5 73.9 83.5 2.0 Similarity-aware Radius RE Rd R₁ =443.9 R₁ = 497.0 |R|1 484.9 TEFAL [21] 49.4 75.9 83.9 2.0 CLIP-VIP [57] 50.1 74.8 84.6 1.0 Motivation Video 3 (irrelevant) Similarity S₁ =s(f₁, t), i = 1,..., T' SWERT'xd Training: Random sample T-MASS (Ours) 50.2 75.3 85.1 1.0 CLIP-VIT-B/16 S₂ X-Pool [17] 48.2 73.7 82.6 2.0 UATVR [13] 50.8 76.3 85.5 1.0 STi exp() V t CLIP-ViP [57] 54.2 77.2 84.8 1.0 te Rd R T-MASS (Ours) 52.7 77.1 85.6 1.0 Lightless lantern X Shocked X w/o hat X Before mast Sword. E P Query: "women are modeling clothes" Text Encoder Φε() ts R Text Mass, Table 2. Text-to-video comparisons on MS tst+RE, E P Testing: Choose closest DiDeMo Retrieval Method R@1t R@5 ↑ R@10t MdR↓ CLIP-VIT-B/32 X-Pool [17] 44.6 73.2 82.0 2.0 DiffusionRet [26] 46.7 74.7 82.7 2.0 UATVR [13] 43.1 71.8 82.3 2.0 CLIP-ViP [57] 48.6 77.1 84.4 2.0 T-MASS (Ours) 50.9 77.2 85.3 1.0 CLIP-VIT-B/16 t X-Pool [17] 47.3 74.8 82.8 2.0 UATVR [13] 45.8 73.7 83.3 2.0 CLIP-VIP [57] 50.5 78.4 87.1 1.0 T-MASS (Ours) 53.3 80.1 87.7 1.0 Table 3. Text-to-video comparisons on DiD MSRVTT Retrieval DiDeMo Retrieval R@1↑ R@51 R@101 MdR↓ MnRR@1↑ R@5↑ R@10↑ MdRMnR↓↓ Method R@1 R@5 R@10 MdR MnR 46.9 72.8 82.2 2.0 14.3 44.6 73.2 82.0 2.0 15.4 CLIP-VIT-B/32 48.7 74.7 83.7 2.0 12.7 48.0 75.4 85.0 2.0 13.0 49.2 75.7 84.7 CLIP4Clip [39] 42.7 70.9 80.6 2.0 2.0 11.7 11.6 49.7 75.8 85.3 2.0 exp(SW) 12.6 49.1 75.7 85.7 Ou 2.0 11.9 49.8 CenterCLIP [60] 42.8 71.7 82.2 2.0 78.1 10.9 86.0 2.0 11.8 X-Pool [17] 44.4 73.3 84.0 2.0 9.0 95 Text M O norm of R 1-norm of R 520 520- 500 500- 480 480 460 Table 4. Video-to-text performance (MSRVTT). 460 440 440 and failing examples in the supplementary. Irrelevant Text-video Pair Relevant Text-video Pair Negative Text-video Pair Positive Text-video Pair 16 R@1 X-Pool R@5 R@10 40- 50 T-MASS (Ours) 45 420 420- 14 0 1 2 4 5 0 1 3 4 5 30 40 Training Epochs Training Epochs 12 25 35 Compare t, with t 0.5- (V)-A tsup For the irrelevant pairs, Maximum: tv.s. V -Maximum: ts v.S. V ▾ tv.s. V t, V.S. V 10 20 30- 15 18 21 24 ts V.S. V tv.s. v 12 15 18 21 24 12 15 18 21 24 #Frames 0.4- 2log Crotal = L₁ + asup Avg(t, v. s. v):0.588 Avg(t v. s. v):0.635 #Frames #Frames t, enables smaller cosine 0.3 C. CR@I R@5 R@10 MdR MnR similarity values (left side). 0.2 X 46.9 72.8 82.2 20 0.1 14.3 X 48.5 74.8 84.3 2.0 12.3 X 49.1 75.7 ✓ X 85.7 2.0 11.9 For the relevant pairs, 0.0 Contact E-mail; [1] Wang, J. -0.1 50.2 75.3 Table 1. Ablation study of losses and text embedding. 85.1 1.0 11.9 values (rights side). t, enables smaller loss -0.2 200 Query Text in MSRVTT-1K Testing Da 400 600 800 1000 TS2-Net [36] 45.3 74.1 83.7 2.0 9.2 DiffusionRet [26] 47.7 73.8 84.5 2.0 8.8 UATVR [13] 46.9 73.8 83.8 2.0 T-MASS (Ours) 8.6 47.7 78.0 86.3 2.0 CLIP-VIT-B/16 8.0 X-Pool [17] 46.4 TS2-Net [36] 73.9 84.1 2.0 46.6 8.4 75.9 CenterCLIP [60] 84.9 2.0 UATVR [13] T-MASS (Ours) 47.7 75.0 8.9 83.3 2.0 48.1 10.2 76.3 85.4 2.0 50.9 8.0 80.2 88.0 1.0 7.4 Retrieval Results