A detailed academic poster is displayed, featuring the research titled "Multi-Session SLAM with Differentiable Wide-Baseline Pose Optimization" by Lahav Lipson and Jia Deng. The poster is presented by the Princeton Vision & Learning Lab and was showcased at the CVPR conference in Seattle, WA, scheduled for June 17-21, 2024.

**Core Content:**
- **Multi-Session SLAM**: Introduction to a novel system for Monocular Multi-Session SLAM using sparse optical flow and optimization layers to align views and refine poses.
- **Visual SLAM with Sparse Optical Flow**: Explains how the approach utilizes sparse optical flow for tracking keypoints across frames, highlighting its benefits in low-texture environments and drawbacks such as compatibility issues with pre-existing approaches.
- **Connecting Disjoint Trajectories**: Details on how the new system estimates camera motion across multiple streams under a single global reference, addressing the SLAM challenges of scale factor drifts between monocular videos.
  
**Technical Details:**
- **Recurrent Sparse Optical Flow for 2-view Camera Pose**: Description of the approach using wide-baseline two-view optical flow to refine camera pose estimates, and the integration of optimization layers to rectify pose estimates.
- **Updating Camera Poses and Optical Flow**: A step-by-step overview of how poses are iteratively updated to minimize residuals between the predicted matches and the optical flow vectors.
- **Differentiable Camera Pose Optimizer(s)**: A deep dive into the mathematical optimizations used, including the weighted bi-polar algorithm and SED (Squared Euclidean Distance) loss for refinement, enhancing pose estimates' accuracy.
  
**Results:**
- **Quantitative Results**: Featuring evaluations of the system on the EuRoC and TUM-RGBD datasets with tables showing the performance metrics in comparison to existing methods.
- **Ablation: Two-view Relative Pose on Scannet**: Analysis of performance variations by subtracting different system components.
- **Qualitative Two-View Matching Results**: Visual examples and descriptions of two-view subsystem efficacy in producing accurate relative poses across wide-baseline scenarios.

**Visuals**:
- Diagrams, tables, and images illustrate the system’s design, workflow, and performance results.
- Charts provide a visual representation of the comparative analysis between method efficiency and accuracy.

The poster is beautifully arranged on a display board, with foot traffic visible in the background, indicating its presentation at a large academic or industry conference. The Princeton and CVPR logos are prominently displayed, aligning with the collaborative and prestigious nature of the research work.
Text transcribed from the image:
OBARSZ
Multi-Session SLAM with Differentiable
Wide-Baseline Pose Optimization
Lahav Lipson and Jia Deng
Multi-Session SLAM
We introduce a new system for Monocular Multi-Session SLAM, which tracks camera motion across
multiple disjoint videos under a single global reference. Our approach couples the prediction of optical
flow with optimization layers to estimate camera pose.
Simultaneous Localization and Mapping (SLAM) is the task of estimating camera motion and a 3D map
from video. Video data in the wild often consists of not a single continuous stream, but rather multiple
disjoint sessions, either deliberately such as in collaborative mapping when multiple robots perform joint
rapid 3D reconstruction, or inadvertently due to visual discontinuities in the video stream which can result
from camera failures, extreme parallax, rapid turns, auto-exposure lag, dark areas, or occlusions.
Key Challenges: Real-time camera pose + scale/location/position ambiguities between monocular videos
Vo
PRINCETON-
VISION & LEARNING LAB
Recurrent Sparse Optical Flow for 2-view Camera Pose
A core subsystem of our Multi-Session SLAM method is a new approach to wide-baseline, 2-view relative camera
pose. Given two views as input, we alternate between estimating sparse optical flow residuals using a weight-tied
network, and updating the relative pose estimate with an optimization layer. Our method also implicitly learns a
confidence measure for each predicted flow vector.
Σ
Quantitative Results
We evaluate our full system on the EuRoC and
ETH-3D datasets in which all the ground truth
trajectories are in a unified coordinate system.
Compared to existing methods, our approach is
significantly more robust and accurate. All reported
methods run in real-time (camera hz = 20 FPS)
MH01-03 MH01-05 V101-103 V201-203 Ca
Scene name
#Disjoint Trajectories
3
0.022
Mono-Visual
CCM-SLAM [36]
0.036
0.01
0.024
20
Mono-Visual
ORB-SLAMS
0210
38
CVPR
JUNE 17-21, 2024
10
SEATTLE, WA
Our prediction for sequence group V101-V100 of EUROC
3.
"
0.058
0.058
Mono-Visual
0264
20
Scene name
#Disjoint Trajectories
Sola
Table Plant Scene Einstein Planar
4
2
3
VINS [23]
Mono-Inertial
ORB-SLAM3
Mono-Inertial
ORB-SLAM) 41
Mino-Visual
FAIL
FAIL
FAIL
0.010
(no in
0.037
0.065
0.040
0.048
20
Our
Mono-Visual
0.010
0.021
0.032
0047
Multi-Session SLAM on ETH-3D
Last name
Extmate
Example: Multi-Session SLAM
Visual SLAM with Sparse Optical Flow
Our approach uses sparse optical flow to track keypoints between frames. This is as opposed to keypoint-
matching approaches like Superglue and ORB. Using sparse flow has both advantages, and disadvantages.
Benefit: Robust in low-texture environments. No requirement for keypoint detector.
Drawback: Existing approaches to MS-SLAM are incompatible, e.g., ORB-SLAM3
Connecting Disjoint Trajectories
The goal in Multi-Session SLAM is to estimate camera motion for
all monocular video streams under a single global reference.
Our approach attempts (1) to estimate camera motion from
video streams individually (like a typical SLAM system) and (2)
to connect disjoint sequences by aligning their respective
coordinate systems. To perform (2), we identify co-visible image
pairs using image retrieval and then run our two-view pose
estimator to predict a 7DOF alignment between sequences.
NatVLAD
Dosator
O
Stage 3 Graph Merging
Stage 2 Sim(3)
Stage 1 Image Retrieval
Estimation
Our approach to connecting disjoint sequences
Updating Camera Poses and Optical Flow
Each update iteration follows three
steps: (1) We predict an update to
sparse optical flow. (2) We then
update the camera pose estimates
to align with the predicted flow. (3)
Finally, we clamp the optical flow to
the newly-induced epipolar lines.
Image 2 11
Image 1-2
EN
The optimizer in (2) parameterizes the epipolar lines as a function of the camera poses, and seeks to minimize
Symmetric Epipolar Distance (SED) between the predicted matches and the epipolar lines.
Convergence basin of each optimization layer
-Input Camera Pose
Method
Pre-cond
SED Solver
Differentiable Camera Pose Optimizer(s)
Unfortunately, the SED optimizer will converge to local
minima if initialized far from the true optimum. To remedy
this, we adopt a pre-conditioning stage which uses a
weighted version of the 8-point algorithm. The combined
optimizers are fully differentiable, meaning we can
supervise on the final pose output.
Weighted 8-pt-algorithm (used for initialization)
No local minimum
Less accurate
SED Optimizer (used for refinement)
Local minimum / non-convex
More accurate
anslation Direction Emor degrees)
Pose Loss
Matching Lo
Loss
Preconditioning
SED Solver
Multi-Session SLAM on EUROC
Ablation: Two-view Relative Pose on Scannet
We evaluate our two-view relative pose subsystem in
isolation. We outperforms existing methods on Scannet.
Existing approaches perform matching as a pre-
processing step, whereas ours is a weight-tied network
alternating between optimization and matching.
Quantitative two-view results on Scannet
Qualitative Two-View Matching Results
Qualitative results on Scannet. Our two-view subsystem estimates accurate relative poses across wide
camera baselines. It initializes all matches with uniform depth and identity relative pose. Progressive
applications of our update operator lead to more accurate matches and higher predicted confidence
initialization
1 Iteration
3ation
Low Confidence
High Confidence
Confidence (1)
Our optimizer architecture, and the backward
gradients (dotted lines)
Qualitative results on Scannet
12 ration