Caption: 

A detailed research poster presented at CVPR 2024, titled "Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval," by researchers from Rochester Institute of Technology (RIT), Amazon Prime Video, and Army Research Laboratory. The poster highlights the T-MASS method, which involves Text-Video Feature Extraction and Similarity-Aware Radius Modeling to enhance text-video retrieval. It includes a comprehensive presentation of the motivation behind the research, methodologies, and experimental results across various benchmarks like MSRVTT, LSMDC, DiDeMo, and VATEX. The left section delves into the motivation and learning in the joint space, while the center focuses on the T-MASS methodology, and the right section showcases experimental comparisons with other methods in tabular form. Techniques such as radius modeling and detailed visual graphs are also included to clarify the innovative approach and its effectiveness.
Text transcribed from the image:
CVPR Text Is MASS: Modeling as Stochastic
JUNE 17-21, 2024
SEATTLE, WA
Embedding for Text-Video Retrieval
CVPR2024 Highlight-
DEVCOM
ARMY RESEARCH
LABORATORY
Roch
RIT of Tec
Jiamian Wang Guohao Sun¹ Pichao Wang2 Dongfang Liu1 Sohail Dianat¹ Majid Rabbani¹ Raghuveer Rao³ Zhiqi
1Rochester Institute of Technology 2Amazon Prime Video (The work does not relate to author's position at Amazon)
Paper
Oxto
❖ Motivation
Code
Supplementary
Method: T-MASS
Video 1 (relevant)
Video 2 (irrelevant)
Text-Video Feature Extraction
Video Encoder
Øv(•)
[f1, f2, fT]
fЄ Rd
3Army Research L
Experiment Results
MSRVTT Retrieval
LSMDC
Text-video Embedding
Video 1 (relevant) Video 2 (irrelevant)
Video 3 (irrelevant)
Feature Fusion
4(•)
Method
CLIP-VIT-B/32
X-Pool [17]
R@1 t R@5t
R@10+ MdR↓ MnR↓
R@11 R@5 ↑ R@
46.9
72.8
82.2
2.0
14.3
25.2
43.7
53
V
DiffusionRet [26]
49.0
75.2
82.7
2.0
12.1
24.4
43.1
54
Similarity-aware
Radius RE Rd
UATVR [13]
47.5
73.9
83.5
2.0
12.3
R₁ =443.9 |R|1=497.0
||R|₁ = 484.9
TEFAL [21]
49.4
75.9
83.9
2.0
12.0
26.8
46.1
56
CLIP-VIP [57]
50.1
74.8
84.6
1.0
25.6
45.3
54
Video 3 (irrelevant)
Similarity
S₁ = s(f₁, t),
SWERT'xd
Training:
Random sample
T-MASS (Ours)
50.2
75.3
85.1
1.0
11.9
28.9
48.2
57
i = 1,...,T
Ots
CLIP-VIT-B/16
X-Pool [17]
48.2
73.7
82.6
2.0
12.7
26.1
46.8
56.
Sri
exp(-)
UATVR [13]
50.8
76.3
85.5
1.0
12.4
t
CLIP-VIP [57]
54.2
77.2
84.8
teRd
1.0
29.4
50.6
59.
T-MASS (Ours)
52.7
77.1
85.6
1.0
10.5
30.3
52.2
61.
P
Query: "women are modeling clothes"
Text Encoder
(•)
to
R
ts t+RE, E P
Testing:
Choose closest
Text Mass
Table 2. Text-to-video comparisons on MSRVTT and LSMDC. Bold de
Lightless lantern
X
Shocked
w/o hat
Before mast
X
Sword
Description: "a pirate man tries to lift a lantern with his sword while on a boat"
Existing Embedding
Proposed Embedding
0
Text point A
Text mass A
+0
AO vit embedding
AO Other vit pairs
Similarity
Joint space
The text content is hard to fully describe the redundant
semantics of the video.
Accordingly, single text embedding may be less
expressive to handle the video information in joint space.
Learning in Joint Space
Rather than the original
text embedding t, here
we introduce stochastic
text embedding t, to
0
Ou
R
0
Text Mass
0
Support at embedding
implement a text mass A Video embedding
using reparameterization, Tedding
t, t+RE,~ P,
Stochastic test embedding
Besides t, we identify a support text embedding tsup
locating along the direction from v to t and being placed at
the surface of the text mass, which serves as a proxy to
better control text mass (both shifting and scaling).
tsup = t +
v-t
R.
v-
We introduce two terms based on symmetric cross entropy,
(tv)
L=
Σ
C-=
to Le CCR@1
It is non-trivial to determine an optimal value for the radius of the text
mass (i.e., R)-oversized radius improperly encompasses less relevant
or irrelevant video embedding, too small text mass may lack
expressiveness to bridge the video. We propose similarity-aware radius
S=s(t,fi),i=1,..., T', R= exp(SW), S = [S1, ..., ST],|
During inference, we modify the inference
pipeline to take advantage of the text mass.
For a given text-video pair {t, v}, we repeat
sampling for M trials and select the optimal ts
t, arg max s(t, v), i = 1,..., M,
Similarity-Aware Radius Modeling
(1) Using text mass (ts)
can result in performance
boost. (2) R is insensitive
to varied implementations.
(3) Linear performs best.
Dynamics of radius R.
T-MASS learns a precise
text semantics for the
relevant text-video pairs
(smallest R₁ correspond
to the red curve). This is
typically observed
correctly retrieved pairs.
We provide both desired
on
DiDeMo Retrieval
VATEX R
Method
R@1↑ R@5+ R@101
MdR↓↓ MnR↓ R@11
R@5 ↑ R@1
CLIP-VIT-B/32
X-Pool [17]
44.6
73.2
82.0
2.0
15.4
60.0
90.0
95.0
DiffusionRet [26]
46.7
74.7
82.7
2.0
14.3
UATVR [13]
43.1
71.8
82.3
2.0
15.1
61.3
91.0
95.
CLIP-VIP [57]
48.6
77.1
84.4
2.0
T-MASS (Ours)
50.9
77.2
85.3
1.0
12.1
63.0
92.3
96.4
CLIP-VIT-B/16
X-Pool [17]
47.3
74.8
82.8
2.0
14.2
62.6
91.7
96.0
UATVR [13]
45.8
73.7
83.3
2.0
13.5
64.5
92.6
96.8
CLIP-VIP [57]
50.5
78.4
87.1
1.0
T-MASS (Ours)
53.3
80.1
87.7
1.0
9.8
65.6
93.9
97.2
Table 3. Text-to-video comparisons on DiDeMo and VATEX. Bold denc
MSRVTT Retrieval
DiDeMo Retrieval
Radius R
wlo R
exp(S)
exp(S)
R@11
46.9
R@51 R@10† MdRMnR↓ R@1↑ R@5↑ R@10+ MdR↓ MnR↓
Method
R@1
R@5
R@10
MdR
MnR
Method
R@1
R@5
CLIP-VIT-B/32
72.8
82.2
2.0
14.3
44.6
73.2
82.0
2.0
15.4
CLIP-VIT-B/32
48.7
74.7
83.7
2.0
12.7
48.0
75.4
85.0
2.0
13.0
CLIP4Clip [39]
42.7
70.9
80.6
2.0
11.6
ClipBERT [28]
6.7
17.3
CLIP4Clip [39]
9.9
27.1
49.2
75.7
84.7
2.0
11.7
49.7
75.8
85.3
2.0
12.6
CenterCLIP [60]
42.8
71.7
82.2
2.0
10.9
X-Pool [17]
11.2
28.3
exp(SW)
49.1
75.7
85.7
2.0
11.9
49.8
78.1
86.0
2.0
11.8
X-Pool [17]
44.4
73.3
84.0
2.0
9.0
T-MASS (Ours)
14.2
36.2
TS2-Net [36]
45.3
74.1
83.7
2.0
9.2
CLIP-VIT-B/16
DiffusionRet [26]
47.7
73.8
84.5
2.0
8.8
CLIP4Clip [39]
16.0
38.2
UATVR [13]
46.9
73.8
83.8
2.0 8.6
X-Pool [17]
20.7
42.5
T-MASS (Ours)
47.7
78.0
86.3
2.0
8.0
T-MASS (Ours)
26.7
51.7
CLIP-VIT-B/16
Table 5. Text-to-video,
X-Pool [17]
46.4 73.9
84.1
2.0
8.4
TS2-Net [36]
46.6
75.9 84.9
#Trials (M)
R@1
2.0
8.9
R@5
R
CenterCLIP [60]
47.7
75.0
83.3
2.0
w/o sampling
44.4
72.4
10.2
S
UATVR [13]
48.1
76.3
5
85.4
46.8
2.0
74.7
T-MASS (Ours)
8.0
50.9
10
80.2
88.0
50.0
1.0
75.2
8
7.4
Table 4. Video-to-text performance (MSRVTT).
20
50.2
75.3
Table 6. Stochastic sa
R@1
X-Pool
40-
R@5
50-
R@10
T-MASS (Ours)
14
35-
5
45
30
40-
35
10 12 15 18 21
#Frames
15 18 21
30-
12
#Frames
15 18 21
#Frames
24
Retrieval Results
55
R@1
80
R@5
X-Pool
50-
T-MASS (
75
45-
70-
40-
0.5 0.8
1
L₁-norm of
520
520
500-
500-
480
480
460
460
440
and failing examples in
the supplementary.
440
Relevant Text-video Pair
420
0
1
2
3
Training Epochs
4
5
420-
0
Negative Text-video Pair
Positive Text-video Pair
3
Training Epochs
Maximum: tv.s. V
Maximum: t, V.S. V
tv.s. v
ts v.s. V
t, v.S. V
tv.s. v
--Avg(t, v. s. v):0.588
--Avgít v. s. v):0.635
Compare t, with t
For the irrelevant pairs,
t, enables smaller cosine
similarity values (left side).
Text-video Cosine Similarity
Irrelevant Text-video Pair
L₁ = = (-x+ Lost).
0.5
tsup
0.4-
Lotal = L,+alsup
0.3
0.2
R@5 R@10 MdR
MnR
0.1
X
X
X
46.9 72.8
82.2
2.0
14.3
0.0-
✓
✓
✓
X
48.5
748
84.3
2.0
12.3
X
X
For the relevant pairs,
49.1
75.7
85.7
-0.1
2.0
11.9
50.2
75.3
85.1
1.0
11.9
t, enables smaller loss
values (rights side).
-0.2
0
200
Query
400
Table 1. Ablation study of losses and text embedding.