Caption: A detailed research poster presented at CVPR 2024, titled "Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval," by researchers from Rochester Institute of Technology (RIT), Amazon Prime Video, and Army Research Laboratory. The poster highlights the T-MASS method, which involves Text-Video Feature Extraction and Similarity-Aware Radius Modeling to enhance text-video retrieval. It includes a comprehensive presentation of the motivation behind the research, methodologies, and experimental results across various benchmarks like MSRVTT, LSMDC, DiDeMo, and VATEX. The left section delves into the motivation and learning in the joint space, while the center focuses on the T-MASS methodology, and the right section showcases experimental comparisons with other methods in tabular form. Techniques such as radius modeling and detailed visual graphs are also included to clarify the innovative approach and its effectiveness. Text transcribed from the image: CVPR Text Is MASS: Modeling as Stochastic JUNE 17-21, 2024 SEATTLE, WA Embedding for Text-Video Retrieval CVPR2024 Highlight- DEVCOM ARMY RESEARCH LABORATORY Roch RIT of Tec Jiamian Wang Guohao Sun¹ Pichao Wang2 Dongfang Liu1 Sohail Dianat¹ Majid Rabbani¹ Raghuveer Rao³ Zhiqi 1Rochester Institute of Technology 2Amazon Prime Video (The work does not relate to author's position at Amazon) Paper Oxto ❖ Motivation Code Supplementary Method: T-MASS Video 1 (relevant) Video 2 (irrelevant) Text-Video Feature Extraction Video Encoder Øv(•) [f1, f2, fT] fЄ Rd 3Army Research L Experiment Results MSRVTT Retrieval LSMDC Text-video Embedding Video 1 (relevant) Video 2 (irrelevant) Video 3 (irrelevant) Feature Fusion 4(•) Method CLIP-VIT-B/32 X-Pool [17] R@1 t R@5t R@10+ MdR↓ MnR↓ R@11 R@5 ↑ R@ 46.9 72.8 82.2 2.0 14.3 25.2 43.7 53 V DiffusionRet [26] 49.0 75.2 82.7 2.0 12.1 24.4 43.1 54 Similarity-aware Radius RE Rd UATVR [13] 47.5 73.9 83.5 2.0 12.3 R₁ =443.9 |R|1=497.0 ||R|₁ = 484.9 TEFAL [21] 49.4 75.9 83.9 2.0 12.0 26.8 46.1 56 CLIP-VIP [57] 50.1 74.8 84.6 1.0 25.6 45.3 54 Video 3 (irrelevant) Similarity S₁ = s(f₁, t), SWERT'xd Training: Random sample T-MASS (Ours) 50.2 75.3 85.1 1.0 11.9 28.9 48.2 57 i = 1,...,T Ots CLIP-VIT-B/16 X-Pool [17] 48.2 73.7 82.6 2.0 12.7 26.1 46.8 56. Sri exp(-) UATVR [13] 50.8 76.3 85.5 1.0 12.4 t CLIP-VIP [57] 54.2 77.2 84.8 teRd 1.0 29.4 50.6 59. T-MASS (Ours) 52.7 77.1 85.6 1.0 10.5 30.3 52.2 61. P Query: "women are modeling clothes" Text Encoder (•) to R ts t+RE, E P Testing: Choose closest Text Mass Table 2. Text-to-video comparisons on MSRVTT and LSMDC. Bold de Lightless lantern X Shocked w/o hat Before mast X Sword Description: "a pirate man tries to lift a lantern with his sword while on a boat" Existing Embedding Proposed Embedding 0 Text point A Text mass A +0 AO vit embedding AO Other vit pairs Similarity Joint space The text content is hard to fully describe the redundant semantics of the video. Accordingly, single text embedding may be less expressive to handle the video information in joint space. Learning in Joint Space Rather than the original text embedding t, here we introduce stochastic text embedding t, to 0 Ou R 0 Text Mass 0 Support at embedding implement a text mass A Video embedding using reparameterization, Tedding t, t+RE,~ P, Stochastic test embedding Besides t, we identify a support text embedding tsup locating along the direction from v to t and being placed at the surface of the text mass, which serves as a proxy to better control text mass (both shifting and scaling). tsup = t + v-t R. v- We introduce two terms based on symmetric cross entropy, (tv) L= Σ C-= to Le CCR@1 It is non-trivial to determine an optimal value for the radius of the text mass (i.e., R)-oversized radius improperly encompasses less relevant or irrelevant video embedding, too small text mass may lack expressiveness to bridge the video. We propose similarity-aware radius S=s(t,fi),i=1,..., T', R= exp(SW), S = [S1, ..., ST],| During inference, we modify the inference pipeline to take advantage of the text mass. For a given text-video pair {t, v}, we repeat sampling for M trials and select the optimal ts t, arg max s(t, v), i = 1,..., M, Similarity-Aware Radius Modeling (1) Using text mass (ts) can result in performance boost. (2) R is insensitive to varied implementations. (3) Linear performs best. Dynamics of radius R. T-MASS learns a precise text semantics for the relevant text-video pairs (smallest R₁ correspond to the red curve). This is typically observed correctly retrieved pairs. We provide both desired on DiDeMo Retrieval VATEX R Method R@1↑ R@5+ R@101 MdR↓↓ MnR↓ R@11 R@5 ↑ R@1 CLIP-VIT-B/32 X-Pool [17] 44.6 73.2 82.0 2.0 15.4 60.0 90.0 95.0 DiffusionRet [26] 46.7 74.7 82.7 2.0 14.3 UATVR [13] 43.1 71.8 82.3 2.0 15.1 61.3 91.0 95. CLIP-VIP [57] 48.6 77.1 84.4 2.0 T-MASS (Ours) 50.9 77.2 85.3 1.0 12.1 63.0 92.3 96.4 CLIP-VIT-B/16 X-Pool [17] 47.3 74.8 82.8 2.0 14.2 62.6 91.7 96.0 UATVR [13] 45.8 73.7 83.3 2.0 13.5 64.5 92.6 96.8 CLIP-VIP [57] 50.5 78.4 87.1 1.0 T-MASS (Ours) 53.3 80.1 87.7 1.0 9.8 65.6 93.9 97.2 Table 3. Text-to-video comparisons on DiDeMo and VATEX. Bold denc MSRVTT Retrieval DiDeMo Retrieval Radius R wlo R exp(S) exp(S) R@11 46.9 R@51 R@10† MdRMnR↓ R@1↑ R@5↑ R@10+ MdR↓ MnR↓ Method R@1 R@5 R@10 MdR MnR Method R@1 R@5 CLIP-VIT-B/32 72.8 82.2 2.0 14.3 44.6 73.2 82.0 2.0 15.4 CLIP-VIT-B/32 48.7 74.7 83.7 2.0 12.7 48.0 75.4 85.0 2.0 13.0 CLIP4Clip [39] 42.7 70.9 80.6 2.0 11.6 ClipBERT [28] 6.7 17.3 CLIP4Clip [39] 9.9 27.1 49.2 75.7 84.7 2.0 11.7 49.7 75.8 85.3 2.0 12.6 CenterCLIP [60] 42.8 71.7 82.2 2.0 10.9 X-Pool [17] 11.2 28.3 exp(SW) 49.1 75.7 85.7 2.0 11.9 49.8 78.1 86.0 2.0 11.8 X-Pool [17] 44.4 73.3 84.0 2.0 9.0 T-MASS (Ours) 14.2 36.2 TS2-Net [36] 45.3 74.1 83.7 2.0 9.2 CLIP-VIT-B/16 DiffusionRet [26] 47.7 73.8 84.5 2.0 8.8 CLIP4Clip [39] 16.0 38.2 UATVR [13] 46.9 73.8 83.8 2.0 8.6 X-Pool [17] 20.7 42.5 T-MASS (Ours) 47.7 78.0 86.3 2.0 8.0 T-MASS (Ours) 26.7 51.7 CLIP-VIT-B/16 Table 5. Text-to-video, X-Pool [17] 46.4 73.9 84.1 2.0 8.4 TS2-Net [36] 46.6 75.9 84.9 #Trials (M) R@1 2.0 8.9 R@5 R CenterCLIP [60] 47.7 75.0 83.3 2.0 w/o sampling 44.4 72.4 10.2 S UATVR [13] 48.1 76.3 5 85.4 46.8 2.0 74.7 T-MASS (Ours) 8.0 50.9 10 80.2 88.0 50.0 1.0 75.2 8 7.4 Table 4. Video-to-text performance (MSRVTT). 20 50.2 75.3 Table 6. Stochastic sa R@1 X-Pool 40- R@5 50- R@10 T-MASS (Ours) 14 35- 5 45 30 40- 35 10 12 15 18 21 #Frames 15 18 21 30- 12 #Frames 15 18 21 #Frames 24 Retrieval Results 55 R@1 80 R@5 X-Pool 50- T-MASS ( 75 45- 70- 40- 0.5 0.8 1 L₁-norm of 520 520 500- 500- 480 480 460 460 440 and failing examples in the supplementary. 440 Relevant Text-video Pair 420 0 1 2 3 Training Epochs 4 5 420- 0 Negative Text-video Pair Positive Text-video Pair 3 Training Epochs Maximum: tv.s. V Maximum: t, V.S. V tv.s. v ts v.s. V t, v.S. V tv.s. v --Avg(t, v. s. v):0.588 --Avgít v. s. v):0.635 Compare t, with t For the irrelevant pairs, t, enables smaller cosine similarity values (left side). Text-video Cosine Similarity Irrelevant Text-video Pair L₁ = = (-x+ Lost). 0.5 tsup 0.4- Lotal = L,+alsup 0.3 0.2 R@5 R@10 MdR MnR 0.1 X X X 46.9 72.8 82.2 2.0 14.3 0.0- ✓ ✓ ✓ X 48.5 748 84.3 2.0 12.3 X X For the relevant pairs, 49.1 75.7 85.7 -0.1 2.0 11.9 50.2 75.3 85.1 1.0 11.9 t, enables smaller loss values (rights side). -0.2 0 200 Query 400 Table 1. Ablation study of losses and text embedding.