The image depicts a display of information about a wide-base pose optimization, with a sign describing methods and results of an optimization procedure. The display is placed on a wall or table, and contains various diagrams, charts, and images that show different angles of the pose optimization. The display may also include information about the computational approach to optimization, as well as details about the results and benefits of the optimization procedure.
Text transcribed from the image:
OBARSZ
Multi-Session SLAM with Differentiable
Wide-Baseline Pose Optimization
Lahav Lipson and Jia Deng
Multi-Session SLAM
We introduce a new system for Monocular Multi-Session SLAM, which tracks camera motion across
multiple disjoint videos under a single global reference. Our approach couples the prediction of optical
flow with optimization layers to estimate camera pose.
Simultaneous Localization and Mapping (SLAM) is the task of estimating camera motion and a 3D map
from video. Video data in the wild often consists of not a single continuous stream, but rather multiple
disjoint sessions, either deliberately such as in collaborative mapping when multiple robots perform joint
rapid 3D reconstruction, or inadvertently due to visual discontinuities in the video stream which can result
from camera failures, extreme parallax, rapid turns, auto-exposure lag, dark areas, or occlusions.
Key Challenges: Real-time camera pose + scale/location/position ambiguities between monocular videos
Vo
PRINCETON-
VISION & LEARNING LAB
Recurrent Sparse Optical Flow for 2-view Camera Pose
A core subsystem of our Multi-Session SLAM method is a new approach to wide-baseline, 2-view relative camera
pose. Given two views as input, we alternate between estimating sparse optical flow residuals using a weight-tied
network, and updating the relative pose estimate with an optimization layer. Our method also implicitly learns a
confidence measure for each predicted flow vector.
Σ
Quantitative Results
We evaluate our full system on the EuRoC and
ETH-3D datasets in which all the ground truth
trajectories are in a unified coordinate system.
Compared to existing methods, our approach is
significantly more robust and accurate. All reported
methods run in real-time (camera hz = 20 FPS)
MH01-03 MH01-05 V101-103 V201-203 Ca
Scene name
#Disjoint Trajectories
3
0.022
Mono-Visual
CCM-SLAM [36]
0.036
0.01
0.024
20
Mono-Visual
ORB-SLAMS
0210
38
CVPR
JUNE 17-21, 2024
10
SEATTLE, WA
Our prediction for sequence group V101-V100 of EUROC
3.
"
0.058
0.058
Mono-Visual
0264
20
Scene name
#Disjoint Trajectories
Sola
Table Plant Scene Einstein Planar
4
2
3
VINS [23]
Mono-Inertial
ORB-SLAM3
Mono-Inertial
ORB-SLAM) 41
Mino-Visual
FAIL
FAIL
FAIL
0.010
(no in
0.037
0.065
0.040
0.048
20
Our
Mono-Visual
0.010
0.021
0.032
0047
Multi-Session SLAM on ETH-3D
Last name
Extmate
Example: Multi-Session SLAM
Visual SLAM with Sparse Optical Flow
Our approach uses sparse optical flow to track keypoints between frames. This is as opposed to keypoint-
matching approaches like Superglue and ORB. Using sparse flow has both advantages, and disadvantages.
Benefit: Robust in low-texture environments. No requirement for keypoint detector.
Drawback: Existing approaches to MS-SLAM are incompatible, e.g., ORB-SLAM3
Connecting Disjoint Trajectories
The goal in Multi-Session SLAM is to estimate camera motion for
all monocular video streams under a single global reference.
Our approach attempts (1) to estimate camera motion from
video streams individually (like a typical SLAM system) and (2)
to connect disjoint sequences by aligning their respective
coordinate systems. To perform (2), we identify co-visible image
pairs using image retrieval and then run our two-view pose
estimator to predict a 7DOF alignment between sequences.
NatVLAD
Dosator
O
Stage 3 Graph Merging
Stage 2 Sim(3)
Stage 1 Image Retrieval
Estimation
Our approach to connecting disjoint sequences
Updating Camera Poses and Optical Flow
Each update iteration follows three
steps: (1) We predict an update to
sparse optical flow. (2) We then
update the camera pose estimates
to align with the predicted flow. (3)
Finally, we clamp the optical flow to
the newly-induced epipolar lines.
Image 2 11
Image 1-2
EN
The optimizer in (2) parameterizes the epipolar lines as a function of the camera poses, and seeks to minimize
Symmetric Epipolar Distance (SED) between the predicted matches and the epipolar lines.
Convergence basin of each optimization layer
-Input Camera Pose
Method
Pre-cond
SED Solver
Differentiable Camera Pose Optimizer(s)
Unfortunately, the SED optimizer will converge to local
minima if initialized far from the true optimum. To remedy
this, we adopt a pre-conditioning stage which uses a
weighted version of the 8-point algorithm. The combined
optimizers are fully differentiable, meaning we can
supervise on the final pose output.
Weighted 8-pt-algorithm (used for initialization)
No local minimum
Less accurate
SED Optimizer (used for refinement)
Local minimum / non-convex
More accurate
anslation Direction Emor degrees)
Pose Loss
Matching Lo
Loss
Preconditioning
SED Solver
Multi-Session SLAM on EUROC
Ablation: Two-view Relative Pose on Scannet
We evaluate our two-view relative pose subsystem in
isolation. We outperforms existing methods on Scannet.
Existing approaches perform matching as a pre-
processing step, whereas ours is a weight-tied network
alternating between optimization and matching.
Quantitative two-view results on Scannet
Qualitative Two-View Matching Results
Qualitative results on Scannet. Our two-view subsystem estimates accurate relative poses across wide
camera baselines. It initializes all matches with uniform depth and identity relative pose. Progressive
applications of our update operator lead to more accurate matches and higher predicted confidence
initialization
1 Iteration
3ation
Low Confidence
High Confidence
Confidence (1)
Our optimizer architecture, and the backward
gradients (dotted lines)
Qualitative results on Scannet
12 ration