A detailed research poster titled "Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval" is displayed at a conference. The poster was presented at the CVPR2024 event held in Seattle, WA. The work is a collaboration between researchers from the Rochester Institute of Technology, Amazon Prime Video, and the U.S. Army Research Laboratory. Key contributors include Jiamian Wang, Guohao Sun, Pichao Wang, Dongfang Liu, Sohail Dianat, Majid Rabbani, Raghuveer Rao, and Zhiqiang Tao.

The poster is organized into multiple sections including:
- **Motivation**: Discussing the challenges in linking text to video data.
- **Learning in Joint Space**: Explaining the integration of text and video embeddings in a common space.
- **Method: T-MASS**: Detailing the Text-Video Feature Extraction and Stochastic Radius Sampling methodology.
- **Experiment Results**: Presenting quantitative comparisons and performance metrics across different methods and datasets.
- **Similarity-Aware Radius Modeling**: An innovative approach to enhance text-video matching precision.

Results are shown on various benchmarks like MSRVTT, LSMDC, Didemo, and VATEX, with tables and graphs illustrating the significant improvements offered by their approach. The poster includes QR codes to access the paper, code, and supplementary materials. Contact information for further inquiries and related references is provided at the bottom. The event highlight indicates a significant impact of this research in the field.
Text transcribed from the image:
17-21, 2024
HIS
CVPR Text Is MASS: Modeling as Stochastic
Embedding for Text-Video Retrieval
SEATTLE, WA
Jiamian Wang Guohao Sun¹
Pichao Wang²
CVPR2024 Highlight
Dongfang Liu Sohail Dianat¹ Majid Rabbani¹
Dev
prime
DEVCOM Video
ARMY RESEARCH
LABORATORY
RIT Rochester Institute
of Technology
Raghuveer Rao³ Zhiqiang Tao¹
Rochester Institute of Technology 2Amazon Prime Video (The work does not relate to author's position at Amazon) 3Army Research Laboratory
Paper
Code
Experiment Results
Supplementary
Method: T-MASS
MSRVTT Retrieval
LSMDC Retrieval
Text-Video Feature Extraction
Text-video Embedding
Video 1 (relevant)
Video 1 (relevant) Video 2 (irrelevant)
Video 3 (irrelevant)
Video Encoder
4(•)
Feature Fusion
Method
CLIP-VIT-B/32
X-Pool [17]
R11 R@5+ R@10t
MdR↓ MnR R@11
R@51 R@10↑ MdR↓ MnR↓
46.9
72.8
82.2
2.0
25.2
14.3
43.7
53.5
8.0
53.2
回
Motivation
[f1, fz, f]
Video 2 (irrelevant)
fE Rd
*(•)
Similarity-aware
Radius RE Rd
DiffusionRet [26]
49.0
75.2
82.7
2.0
12.1
24.4
43.1
54.3
8.0
40.7
UATVR [13]
47.5
73.9
83.5
2.0
12.3
R₁ =443.9 ||R|₁
497.0
R
484.9
TEFAL [21]
49.4 75.9
83.9
2.0
12.0
26.8
46.1
56.5
7.0
44.4
CLIP-VIP [57]
50.1
74.8.
84.6
1.0
25.6
45.3
54.4
8.0
Similarity
WER
Training:
Random sample
T-MASS (Ours)
50.2
75.3
85.1
1.0
11.9
28.9
48.2
57.6
6.0
43.3
Video 3 (irrelevant)
S=s(f,t),
i=1,..., T'
Ots
CLIP-VIT-B/16
X-Pool [17]
48.2
73.7
82.6
2.0
12.7
26.1
46.8
56.7
7.0
47.3
UATVR [13]
50.8
76.3
85.5
1.0.
12.4
exp()
0
V
t
0
CLIP-VIP [57]
54.2
77.2
84.8
1.0
29.4
50.6
59.0
5.0
teR
T-MASS (Ours)
52.7
77.1
85.6
1.0
10.5
30.3
52.2
61.3
5.0
40.1
Lightless lantern
X
Shocked
w/o hat
Before mast Sword
P
R
Text Mass/
Table 2. Text-to-video comparisons on MSRVTT and LSMDC. Bold denotes the best.
Query: "women are modeling clothes" Text Encoder
Φε()
ts
Testing:
t, t+RE,E P
Choose closest
Description: "a pirate man tries to lift a lantern with his sword while on a boat"
Existing Embedding
Proposed Embedding
+0
Test point A
A
0
Test mass A
A
ΔΟ vit embedding
A Other wit pairs
Similarity
Joint space
The text content is hard to fully describe the redundant
semantics of the video.
Accordingly, single text embedding may be less
expressive to handle the video information in joint space.
Learning in Joint Space
Rather than the original
t embedding t, here
Introduce stochastic
mbedding t, to
14
Ou
文
BL
Text Ma
о
0
Bochastic ed anbedding
nt a text mass A Vida wedding
rameterization, Ting
-E₁E~ P₁
Support d endedding
we identify a support text embedding tsup
the direction from v to t and being placed at
the text mass, which serves as a proxy to
xt mass (both shifting and scaling).
tsup =t+
v-t
v-t
R.
s based on symmetric cross entropy,
Le=(+-)-
It is non-trivial to determine an optimal value for the radius of the text
mass (i.e., R)-oversized radius improperly encompasses less relevant
or irrelevant video embedding, too small text mass may lack
expressiveness to bridge the video. We propose similarity-aware radius
S=s(t,f),i=1,...,T', R = exp(SW), S [S₁,..., ST],
During inference, we modify the inference
pipeline to take advantage of the text mass.
For a given text-video pair {t, v), we repeat
sampling for M trials and select the optimal ts
t, arg max s(t, v), i = 1,..., M,
Similarity-Aware Radius Modeling
(1) Using text mass (ts)
can result in performance
boost. (2) R is insensitive
to varied implementations.
(3) Linear performs best.
Dynamics of radius R.
T-MASS learns a precise
text semantics for the
relevant text-video pairs
(smallest R₁ correspond
to the red curve). This is
typically observed
correctly retrieved pairs.
We provide both desired
t.
DiDeMo Retrieval
VATEX Retrieval
Method
CLIP-VIT-B/32
R@11 R@5+ R@10+
MdR↓ MnR↓ R@11 R@51 R@10+ MdR↓ MnR↓
X-Pool [17]
DiffusionRet [26]
44.6
73.2
82.0
2.0
15.4
60.0
90.0
95.0
1.0
3.8
46.7
74.7
82.7
2.0
14.3
-
UATVR [13]
43.1
71.8
82.3
2.0
15.1
61.3
91.0
95.6
1.0
3.3
CLIP-VIP [57]
48.6
77.1
84.4
2.0
-
T-MASS (Ours)
50.9
77.2
85.3
1.0
12.1
63.0
92.3
96.4
1.0
3.2
CLIP-VIT-B/16
X-Pool [17]
47.3
74.8
82.8
2.0
14.2
62.6
91.7
96.0
1.0
3.4
UATVR [13]
45.8
73.7
83.3
2.0
13.5
64.5
92.6
96.8
1.0
2.8
CLIP-VIP [57]
50.5
78.4
87.1
1.0
-
=
T-MASS (Ours) 53.3 80.1
87.7
1.0
9.8
65.6
93.9
97.2
1.0
2.7
Table 3. Text-to-video comparisons on DiDeMo and VATEX. Bold denotes the best.
MSRVTT Retrieval
DiDeMo Retrieval
Radius R
wio R
exp(S)
exp(S)
exp(SW)
Re11 Rest Re101 MdR
MnR R11 R@5↑ R@10↑ MdRMnR
Method
R@1
R@5
R@10
MdR
MnR
Method
RO1 R@5 R@10 MdR MnR
46.9
72.8
82.2
2.0
14.3
44.6 73.2
82.0
2.0
15.4
CLIP-VIT-B/32
CLIP-VIT-B/32
48.7
74.7
83.7
2.0
12.7
48.0
75.4
85.0
2.0
13.0
CLIP4Clip [39]
42.7
70.9
80.6
ClipBERT [28]
2.0
11.6
49.2
75.7
84.7
2.0
11.7
49.7
75.8
85.3
2.0
12.6
CenterCLIP [60]
CLIP4Clip [39]
42.8
71.7
82.2
2.0
10.9
49.1
75.7
85.7
2.0
11.9
49.8
78.1
86.0
2.0
11.8
X-Pool [17]
X-Pool [17]
44.4 73.3
84.0
2.0
9.0
T-MASS (Ours)
6.7
9.9
11.2
14.2 36.2
17.3
27.1
28.3. 38.8
25.2 32.0 149.7
36.8 21.0 85.4
20.0
48.3 12.0
82.7
54.8
TS2-Net [36]
45.3 74.1
83.7
2.0
9.2
CLIP-VIT-B/16
DiffusionRet [26]
47.7 73.8 84.5
2.0
8.8
CLIP4Clip [39] 16.0
UATVR [13]
T-MASS (Ours)
46.9
73.8
83.8
2.0
8.6
X-Pool [17]
47.7 78.0
86.3
2.0 8.0
T-MASS (Ours)
38.2 48.5 12.0 54.1
20.7 42.5 53.5 9.0 47.4
26.7 51.7 63.9 5.0 30.0
CLIP-VIT-B/16
X-Pool [17]
TS2-Net [36]
46.4
73.9
Table 5. Text-to-video, on Charades.
84.1
2.0
8.4
46.6 75.9 84.9
CenterCLIP [60]
UATVR [13]
T-MASS (Ours)
2.0
#Trials (M)
8.9
47.7
75.0
83.3
2.0
10.2
w/o sampling
R@1
44.4 72.4
R@5
R@10
MdR
MnR
81.9
2.0
13.1
48.1 76.3
5
85.4
2.0
46.8 74.7
84.0
8.0
2.0
12.5
50.9
10
80.2
88.0
50.0
75.2
1.0
84.1
7.4
2.0 12.3
Table 4. Video-to-text performance (MSRVTT).
20
50.2 75.3 85.1
1.0 11.9
35
Training Epochs
R@1
X-Pool
40-
R@5
T-MASS (Ours)
R@10
50
14-
45
5
40
12
Table 6. Stochastic sampling trails.
55
R@1
50
80-
X-Pool
75
R@5
90-
T-MASS (Ours)
R@10
85
520
520
500
500
480
480
on
460
460
440
440-
Irrelevant Text-video Pair
and failing examples in
the supplementary.
Relevant Text-video Pair
Negative Text-video Pair
Positive Text-video Pair
420
420-
0
1
3
0
3
Training Epochs
Compare t, with t
0.5
tsup
For the irrelevant pairs,
Maximum: t v.S. V
Maximum: t, v.S. V
tv.s. v
tv.s.
0.4
= L₁+alp
t, enables smaller cosine
0.3
tv.s. v
-tv.s. v
40
12 15 18 21 24
12 15 18 21
12 15 18 21 24
#Frames
#Frames
#Frames
-Avgit, v. s. v):0.588
05 0.8 1.0 1.2 1.5
Alpha
05 0.8 10 12 1.5
Alpha
0.5 0.8 1.0 1.2 1.5
Alpha
Avglt v. s. v):0.635
R@10 MdR MAR
similarity values (left side)
0.2
82.2
0.1
2.0 14.3
3 2.0
12.3
0.0-
2.0
11.9
For the relevant pairs,
-0.1-
1.0 11.9
ts enables smaller loss
-0.2
bedding.
values (rights side).
200
400
Query Text in MSRVTT-1K Testing Dataset
600
800
1000
15
Query Text Batch/32: MSRVTT-1K Testing Dataset
25
30
[3] Wang, J., Wu
for Efficient Image Super-Resolution. In ICC
Contact E-mail: jw4905@rit.edu Website: https://jiamian-wang.github.io/
[1] Wang, J., Zhang, Y., Yuan, X., Meng, Z., & Tao, Z. (2022). Modeling Mask Uncertainty in
Hyperspectral Image Reconstruction. In ECCV 2022 (Oral).
[2] Wang, J., Wang, H., Zhang, Y., Fu, Y., & Tao, Z. (2023). Iterative Soft