In the image, a large poster is displayed on the wall, showcasing various diagrams, graphs, and text describing techniques for text highlighting and embedding. The poster also includes a photo of a person standing in front of the poster, possibly presenting the information to others. The overall scene suggests that the poster is an educational tool or demonstration, intended to teach people about the methods of text highlighting for text embeddings. Text transcribed from the image: CVPR JUNE 17-21, 2024 Text Is MASS: SEATTLE, WA Embedding for Text-Video Re CVPR2024 Highlight F Jiamian Wang Guohao Sun¹ Pichao Wang² Dongfang Liu1 Sohail Dianat¹ Majid Rabbani Raghuveer Technology 2Amazon Prime Video (The work does not relate to author's position at Amazon) 3Arm 1Rochester Institute Paper Code Supplementary Method: T-MASS Video 1 (relevant) Experiment Resul MSRVTT Retrieval R@101 AO vit embedding text embedding t, here we introduce stochastic text embedding t, to implement a text mass video bedding using reparameterization, Texting t₁ =t+RE, E~ P, Support text embedding Stochastic text bedding Besides ts, we identify a support text embedding tsup locating along the direction from v to t and being placed at the surface of the text mass, which serves as a proxy to better control text mass (both shifting and scaling). trup =t+ v-t R₁ v-t We introduce two terms based on symmetric cross entropy, (v) Lee == (1+0+ Lo-+1). Dynamics of radius R. T-MASS learns a precise text semantics for the relevant text-video pairs (smallest R₁ correspond to the red curve). This is typically observed correctly retrieved pairs. We provide both desired on (1) Using text mass (ts) can result in performance boost. (2) R is insensitive to varied implementations. (3) Linear performs best. Radius R w/o R exp(S) exp(S) Description: "a pirate man tries to lift a lantern with his sword while on a boat" Existing Embedding +0 Text point A A Lat Proposed Embedding 0 Text mass A ▲ Other vit pairs Similarity Joint space The text content is hard to fully describe the redundant semantics of the video. Accordingly, single text embedding may be less expressive to handle the video information in joint space. Learning in Joint Space Rather than the original It is non-trivial to determine an optimal value for the radius of the text mass (i.e., R)- oversized radius improperly encompasses less relevant or irrelevant video embedding, too small text mass may lack expressiveness to bridge the video. We propose similarity-aware radius S=s(t,f), i = 1,..., T', R = exp(SW), S = [S1, ..., ST], During inference, we modify the inference pipeline to take advantage of the text mass. For a given text-video pair {t, v}, we repeat sampling for M trials and select the optimal ts t, = arg max s(t, v), i = 1,..., M, Similarity-Aware Radius Modeling Text-video Embedding Text-Video Feature Extraction Video 1 (relevant) Video 2 (irrelevant) Video 3 (irrelevant) Φυ() Video 2 (irrelevant) Video Encoder [f1, f2,..., fT] fЄ Rd Feature Fusion 40) Method CLIP-VIT-B/32 X-Pool [17] R@11 R@51 MdR↓ 46.9 72.8 82.2 2.0 V DiffusionRet [26] 49.0 75.2 82.7 2.0 UATVR [13] 47.5 73.9 83.5 2.0 Similarity-aware Radius RE Rd R₁ =443.9 R₁ = 497.0 |R|1 484.9 TEFAL [21] 49.4 75.9 83.9 2.0 CLIP-VIP [57] 50.1 74.8 84.6 1.0 Motivation Video 3 (irrelevant) Similarity S₁ =s(f₁, t), i = 1,..., T' SWERT'xd Training: Random sample T-MASS (Ours) 50.2 75.3 85.1 1.0 CLIP-VIT-B/16 S₂ X-Pool [17] 48.2 73.7 82.6 2.0 UATVR [13] 50.8 76.3 85.5 1.0 STi exp() V t CLIP-ViP [57] 54.2 77.2 84.8 1.0 te Rd R T-MASS (Ours) 52.7 77.1 85.6 1.0 Lightless lantern X Shocked X w/o hat X Before mast Sword. E P Query: "women are modeling clothes" Text Encoder Φε() ts R Text Mass, Table 2. Text-to-video comparisons on MS tst+RE, E P Testing: Choose closest DiDeMo Retrieval Method R@1t R@5 ↑ R@10t MdR↓ CLIP-VIT-B/32 X-Pool [17] 44.6 73.2 82.0 2.0 DiffusionRet [26] 46.7 74.7 82.7 2.0 UATVR [13] 43.1 71.8 82.3 2.0 CLIP-ViP [57] 48.6 77.1 84.4 2.0 T-MASS (Ours) 50.9 77.2 85.3 1.0 CLIP-VIT-B/16 t X-Pool [17] 47.3 74.8 82.8 2.0 UATVR [13] 45.8 73.7 83.3 2.0 CLIP-VIP [57] 50.5 78.4 87.1 1.0 T-MASS (Ours) 53.3 80.1 87.7 1.0 Table 3. Text-to-video comparisons on DiD MSRVTT Retrieval DiDeMo Retrieval R@1↑ R@51 R@101 MdR↓ MnRR@1↑ R@5↑ R@10↑ MdRMnR↓↓ Method R@1 R@5 R@10 MdR MnR 46.9 72.8 82.2 2.0 14.3 44.6 73.2 82.0 2.0 15.4 CLIP-VIT-B/32 48.7 74.7 83.7 2.0 12.7 48.0 75.4 85.0 2.0 13.0 49.2 75.7 84.7 CLIP4Clip [39] 42.7 70.9 80.6 2.0 2.0 11.7 11.6 49.7 75.8 85.3 2.0 exp(SW) 12.6 49.1 75.7 85.7 Ou 2.0 11.9 49.8 CenterCLIP [60] 42.8 71.7 82.2 2.0 78.1 10.9 86.0 2.0 11.8 X-Pool [17] 44.4 73.3 84.0 2.0 9.0 95 Text M O norm of R 1-norm of R 520 520- 500 500- 480 480 460 Table 4. Video-to-text performance (MSRVTT). 460 440 440 and failing examples in the supplementary. Irrelevant Text-video Pair Relevant Text-video Pair Negative Text-video Pair Positive Text-video Pair 16 R@1 X-Pool R@5 40- 50 R@10 T-MASS (Ours) 45 420 420- 14 0 1 2 4 5 0 1 3 4 5 30 40 Training Epochs Training Epochs 12 25 35 Compare t, with t 0.5- (V)-A tsup For the irrelevant pairs, Maximum: tv.s. V -Maximum: ts v.S. V ▾ tv.s. V t, V.S. V 10 20 30- 15 18 21 24 ts V.S. V tv.s. v 12 15 18 21 24 12 15 #Frames 0.4- 2log Crotal = L₁ + asup Avg(t, v. s. v):0.588 Avg(t v. s. v):0.635 #Frames 18 21 24 #Frames t, enables smaller cosine 0.3 C. CR@I R@5 R@10 MdR MnR similarity values (left side). 0.2 X 46.9 72.8 82.2 20 0.1 14.3 X 48.5 74.8 84.3 2.0 12.3 X 49.1 75.7 ✓ X 85.7 2.0 11.9 For the relevant pairs, 0.0 Contact E-mail; [1] Wang, J. -0.1 50.2 75.3 Table 1. Ablation study of losses and text embedding. 85.1 1.0 11.9 values (rights side). t, enables smaller loss -0.2 200 Query Text in MSRVTT-1K Testing Da 400 600 800 1000 TS2-Net [36] 45.3 74.1 83.7 2.0 9.2 DiffusionRet [26] 47.7 73.8 84.5 2.0 8.8 UATVR [13] 46.9 73.8 83.8 2.0 T-MASS (Ours) 8.6 47.7 78.0 86.3 2.0 CLIP-VIT-B/16 8.0 X-Pool [17] 46.4 TS2-Net [36] 73.9 84.1 2.0 46.6 8.4 75.9 CenterCLIP [60] 84.9 2.0 UATVR [13] T-MASS (Ours) 47.7 75.0 8.9 83.3 2.0 48.1 10.2 76.3 85.4 2.0 50.9 8.0 80.2 88.0 1.0 7.4 Retrieval Results