In the image, a large poster is displayed on the wall, showcasing various diagrams, graphs, and text describing techniques for text highlighting and embedding. The poster also includes a photo of a person standing in front of the poster, possibly presenting the information to others. The overall scene suggests that the poster is an educational tool or demonstration, intended to teach people about the methods of text highlighting for text embeddings.
Text transcribed from the image:
CVPR
JUNE 17-21, 2024
Text Is MASS:
SEATTLE, WA
Embedding for Text-Video Re
CVPR2024 Highlight
F
Jiamian Wang Guohao Sun¹ Pichao Wang² Dongfang Liu1 Sohail Dianat¹ Majid Rabbani Raghuveer
Technology 2Amazon Prime Video (The work does not relate to author's position at Amazon) 3Arm
1Rochester Institute
Paper
Code
Supplementary
Method: T-MASS
Video 1 (relevant)
Experiment Resul
MSRVTT Retrieval
R@101
AO vit embedding
text embedding t, here
we introduce stochastic
text embedding t, to
implement a text mass video bedding
using reparameterization, Texting
t₁ =t+RE, E~ P,
Support text embedding
Stochastic text bedding
Besides ts, we identify a support text embedding tsup
locating along the direction from v to t and being placed at
the surface of the text mass, which serves as a proxy to
better control text mass (both shifting and scaling).
trup =t+
v-t
R₁
v-t
We introduce two terms based on symmetric cross entropy,
(v)
Lee == (1+0+ Lo-+1).
Dynamics of radius R.
T-MASS learns a precise
text semantics for the
relevant text-video pairs
(smallest R₁ correspond
to the red curve). This is
typically observed
correctly retrieved pairs.
We provide both desired
on
(1) Using text mass (ts)
can result in performance
boost. (2) R is insensitive
to varied implementations.
(3) Linear performs best.
Radius R
w/o R
exp(S)
exp(S)
Description: "a pirate man tries to lift a lantern with his sword while on a boat"
Existing Embedding
+0
Text point
A
A
Lat
Proposed Embedding
0
Text mass A
▲ Other vit pairs
Similarity
Joint space
The text content is hard to fully describe the redundant
semantics of the video.
Accordingly, single text embedding may be less
expressive to handle the video information in joint space.
Learning in Joint Space
Rather than the original
It is non-trivial to determine an optimal value for the radius of the text
mass (i.e., R)- oversized radius improperly encompasses less relevant
or irrelevant video embedding, too small text mass may lack
expressiveness to bridge the video. We propose similarity-aware radius
S=s(t,f), i = 1,..., T', R = exp(SW), S = [S1, ..., ST],
During inference, we modify the inference
pipeline to take advantage of the text mass.
For a given text-video pair {t, v}, we repeat
sampling for M trials and select the optimal ts
t, = arg max s(t, v), i = 1,..., M,
Similarity-Aware Radius Modeling
Text-video Embedding
Text-Video Feature Extraction
Video 1 (relevant) Video 2 (irrelevant)
Video 3 (irrelevant)
Φυ()
Video 2 (irrelevant)
Video Encoder
[f1, f2,..., fT]
fЄ Rd
Feature Fusion
40)
Method
CLIP-VIT-B/32
X-Pool [17]
R@11 R@51
MdR↓
46.9
72.8
82.2
2.0
V
DiffusionRet [26]
49.0
75.2
82.7
2.0
UATVR [13]
47.5
73.9
83.5
2.0
Similarity-aware
Radius RE Rd
R₁ =443.9 R₁ = 497.0
|R|1
484.9
TEFAL [21]
49.4
75.9
83.9
2.0
CLIP-VIP [57]
50.1
74.8
84.6
1.0
Motivation
Video 3 (irrelevant)
Similarity
S₁ =s(f₁, t),
i = 1,..., T'
SWERT'xd
Training:
Random sample
T-MASS (Ours)
50.2
75.3
85.1
1.0
CLIP-VIT-B/16
S₂
X-Pool [17]
48.2
73.7
82.6
2.0
UATVR [13]
50.8
76.3
85.5
1.0
STi
exp()
V
t
CLIP-ViP [57]
54.2
77.2
84.8
1.0
te Rd
R
T-MASS (Ours)
52.7
77.1
85.6
1.0
Lightless lantern
X
Shocked
X
w/o hat
X
Before mast Sword.
E
P
Query: "women are modeling clothes"
Text Encoder
Φε()
ts
R
Text Mass,
Table 2. Text-to-video comparisons on MS
tst+RE, E P
Testing:
Choose closest
DiDeMo Retrieval
Method
R@1t R@5 ↑ R@10t MdR↓
CLIP-VIT-B/32
X-Pool [17]
44.6
73.2
82.0
2.0
DiffusionRet [26]
46.7
74.7
82.7
2.0
UATVR [13]
43.1
71.8
82.3
2.0
CLIP-ViP [57]
48.6
77.1
84.4
2.0
T-MASS (Ours)
50.9
77.2
85.3
1.0
CLIP-VIT-B/16
t
X-Pool [17]
47.3
74.8
82.8
2.0
UATVR [13]
45.8
73.7
83.3
2.0
CLIP-VIP [57]
50.5
78.4
87.1
1.0
T-MASS (Ours)
53.3
80.1
87.7
1.0
Table 3. Text-to-video comparisons on DiD
MSRVTT Retrieval
DiDeMo Retrieval
R@1↑ R@51 R@101 MdR↓
MnRR@1↑
R@5↑
R@10↑
MdRMnR↓↓
Method
R@1
R@5
R@10
MdR
MnR
46.9
72.8
82.2
2.0
14.3
44.6
73.2
82.0
2.0
15.4
CLIP-VIT-B/32
48.7
74.7
83.7
2.0
12.7
48.0
75.4
85.0
2.0
13.0
49.2
75.7
84.7
CLIP4Clip [39]
42.7
70.9
80.6
2.0
2.0
11.7
11.6
49.7
75.8
85.3
2.0
exp(SW)
12.6
49.1
75.7
85.7
Ou
2.0
11.9
49.8
CenterCLIP [60]
42.8
71.7
82.2
2.0
78.1
10.9
86.0
2.0
11.8
X-Pool [17]
44.4
73.3
84.0
2.0
9.0
95
Text M
O
norm of R
1-norm of R
520
520-
500
500-
480
480
460
Table 4. Video-to-text performance (MSRVTT).
460
440
440
and failing examples in
the supplementary.
Irrelevant Text-video Pair
Relevant Text-video Pair
Negative Text-video Pair
Positive Text-video Pair
16
R@1
X-Pool
R@5
40-
50
R@10
T-MASS (Ours)
45
420
420-
14
0
1
2
4
5
0
1
3
4
5
30
40
Training Epochs
Training Epochs
12
25
35
Compare t, with t
0.5-
(V)-A
tsup
For the irrelevant pairs,
Maximum: tv.s. V
-Maximum: ts v.S. V
▾ tv.s. V
t, V.S. V
10
20
30-
15 18 21
24
ts V.S. V
tv.s. v
12 15 18 21
24
12 15
#Frames
0.4-
2log
Crotal = L₁ + asup
Avg(t, v. s. v):0.588
Avg(t v. s. v):0.635
#Frames
18 21 24
#Frames
t, enables smaller cosine
0.3
C. CR@I
R@5 R@10 MdR MnR
similarity values (left side).
0.2
X
46.9
72.8
82.2
20
0.1
14.3
X
48.5
74.8
84.3
2.0
12.3
X
49.1
75.7
✓ X
85.7
2.0
11.9
For the relevant pairs,
0.0
Contact E-mail;
[1] Wang, J.
-0.1
50.2
75.3
Table 1. Ablation study of losses and text embedding.
85.1
1.0
11.9
values (rights side).
t, enables smaller loss
-0.2
200
Query Text in MSRVTT-1K Testing Da
400
600
800
1000
TS2-Net [36]
45.3
74.1
83.7
2.0
9.2
DiffusionRet [26]
47.7
73.8
84.5
2.0
8.8
UATVR [13]
46.9
73.8
83.8
2.0
T-MASS (Ours)
8.6
47.7
78.0
86.3
2.0
CLIP-VIT-B/16
8.0
X-Pool [17]
46.4
TS2-Net [36]
73.9 84.1
2.0
46.6
8.4
75.9
CenterCLIP [60]
84.9
2.0
UATVR [13]
T-MASS (Ours)
47.7 75.0
8.9
83.3
2.0
48.1
10.2
76.3
85.4
2.0
50.9
8.0
80.2
88.0
1.0
7.4
Retrieval Results