A detailed technical poster titled "Task-Aware Encoder Control for Deep Video" showcases advanced methodologies and frameworks in video encoding. Prominently featured at an academic or research conference, the poster is authored by Xingtong Ge, Jixiang Luo, Xingjie Zhang, Tongtao Xie, et al., affiliated with institutions like SenseTime, HKUST, Tsinghua University, and SUTD. The content focuses on "Controlling DVC for Machine" and details the innovative "Dynamic Vision Mode Prediction (DVMP)" and "DivGOP & GoP Selection" mechanisms. Diagrams and flowcharts illustrate the complex processes, including tasks like Hypoerprior Information, Bottleneck Layers, Entropy Coding, and various modules integrating Pre-Analysis, Feature Extraction, and Selection mechanisms. Detailed results and graphs are presented on bitrate reduction effectiveness while maintaining semantic information, optimizing GoP structure, and achieving low reconstruction error rates, emphasizing the intricate balance of data compression and computational efficiency.
Text transcribed from the image:
INSTITUTE OF
商汤
sensetime
Task-Aware Encoder Control for Deep Video
Xingtong Gel2, Jixiang Luo, Xinjie Zhang, Tongda Xu, Guo Lu, Dailan He, Jing Geng, Yan
BIT&Sense Time & HKUST & Tsinghua University&SITU&CU
"Controlling DVC for Machine" Framework DivGoP & GoP Selection
01010101
GoP Structure Vector
xt
Xref
Encoder
Controlled
Learned Video Codec
AE
AD
Decoded Frame Buffer
Pretrained
Decoder
Residual
054
0.58p
0.57
0.56
0.56
0.55
ft
DVMP
fm
0.54
0.53
0.52
PVC OFS
P
0.52
<-PVC DIVGOP
-PVC
0.50
Freture Encode
FC
0.51
0.05
0.10 015
Bpp
004 008 012 36 125
Trac
(a)
Pre-Analysis
RAFT
Detector
Input frames {x1, x2, x3, ...}
Motion
Encoder
(b)
GoP Prediction
Conv(32)
GoP Feature Sogit
Extractor
Gumbel-Softmax
Sampling/Logit
Distribution
LReLU
get better Bpp-mAP trade-offs.
Conv(32)
AdaAvePool
GoP Structure Vector
Linear, 2)
(c)
Previous works require individually customized codecs to support different downstream
tasks, which is complex and difficult to deploy.
GoP Feature Extractor
How to use one pre-trained decoder to support both human and machine vision tasks?
1. Dividing the original P frames into two types: P frames and new Pm frames (predicted
with DVMP).
2. Using GoP Selection Module to control the encoding GoP structure for different
objectives, such as vision tasks and video reconstruction.
3. Maintaining the decoder weights constant to ensure compatibility across multiple tasks.
Dynamic Vision Mode Prediction(DVMP)
Hyperprior
Information
Encoded Residual
Feature
Hyperprior
Information
Conv(C,3,1)
Conv(C, 3, 1)
ResBlock(C, 3)
ResBlock(C, 3)
ResBlock(C, 3)
ResBlock(C, 3)
Selection
Entropy
Coding
ResBlock(C,3)
Contextual
rmation
Conv(C, 3, 1)
Gumbel Softmax
Effectively reduce the bitrate
while preserving critical semantic
information.
(Up) DVMP for hyper prior
entropy models (suitable for FVC
Decoded Residual and DCVC-TCM).
Feature
(Down) DVMP for entropy
models with autoregressive
components (suitable for DCVC).
(Left) DFS optimal GoP Structure and DviGoP Structure. Succeed to
(Right) Simply fine-tuning FVC for machine task Fail to get better bpp
mAP trade-offs.
GoP Structure Optimization Target:
arg ming R(0)+(0)
GoP Selection Module
Stage a): Pre-Analysis
Detector+RAFT+Motion Encoder th
Stage b): GoP Prediction
GoP Feature Extractor: produce Stogie for the current GoP sequence.
Vector Sampler:
(Training) Gumbel-Softmax Sampling for GoP-1 times using Stogie-
(Inference) Logit Distribution using Softmax(Stogit)
Dynamically determine the GOP structures for different video sequences
Loss Function & Training Strategy
Training Stage 1: Train DVMP (frame-wise)
L= R+
Training Stage 2: Train GoP Selection Module (GoP-wise).
Opeham
20
715
L₂ = R + Agh
BD-BR Results
Pm frame(semantic friendly):
low bitrate,
low reconstruction quality.
P frame:
high bitrate,
high reconstruction quality.
Hybrid encoding: using Pm frames to reduce bitrate and
es to suppress reconstruction error propagation.
TCM [31]
Ours+TCM
-25.19 -32.34 -31.02 -39.85 -26.44
40.82 45.10 46.15 -51.66 -38.98-
HEVC [35] -33.88 -31.35 40.43 15.80 -34.02
0.0
0.0
0.0
MOTA MAP MAP50 MOTP FN
0.0
0.0
Method
DCVC [20]
-5.32 -9.75 -14.53 -1.39 6.20 -4
Ours+DCVC 41.82 -39.43 40.60 -37.73 41.74 40
-34.28 -31.27 -32.09 -35.02 -32.89 3
FVC [16]
Ours+FVC
ResBlock(C, 3)
MaskC
Conv(C, 1, 1)
Conv(C, 1, 1)
Gumbel Softmax
Max