The image depicts a display of information about a wide-base pose optimization, with a sign describing methods and results of an optimization procedure. The display is placed on a wall or table, and contains various diagrams, charts, and images that show different angles of the pose optimization. The display may also include information about the computational approach to optimization, as well as details about the results and benefits of the optimization procedure. Text transcribed from the image: OBARSZ Multi-Session SLAM with Differentiable Wide-Baseline Pose Optimization Lahav Lipson and Jia Deng Multi-Session SLAM We introduce a new system for Monocular Multi-Session SLAM, which tracks camera motion across multiple disjoint videos under a single global reference. Our approach couples the prediction of optical flow with optimization layers to estimate camera pose. Simultaneous Localization and Mapping (SLAM) is the task of estimating camera motion and a 3D map from video. Video data in the wild often consists of not a single continuous stream, but rather multiple disjoint sessions, either deliberately such as in collaborative mapping when multiple robots perform joint rapid 3D reconstruction, or inadvertently due to visual discontinuities in the video stream which can result from camera failures, extreme parallax, rapid turns, auto-exposure lag, dark areas, or occlusions. Key Challenges: Real-time camera pose + scale/location/position ambiguities between monocular videos Vo PRINCETON- VISION & LEARNING LAB Recurrent Sparse Optical Flow for 2-view Camera Pose A core subsystem of our Multi-Session SLAM method is a new approach to wide-baseline, 2-view relative camera pose. Given two views as input, we alternate between estimating sparse optical flow residuals using a weight-tied network, and updating the relative pose estimate with an optimization layer. Our method also implicitly learns a confidence measure for each predicted flow vector. Σ Quantitative Results We evaluate our full system on the EuRoC and ETH-3D datasets in which all the ground truth trajectories are in a unified coordinate system. Compared to existing methods, our approach is significantly more robust and accurate. All reported methods run in real-time (camera hz = 20 FPS) MH01-03 MH01-05 V101-103 V201-203 Ca Scene name #Disjoint Trajectories 3 0.022 Mono-Visual CCM-SLAM [36] 0.036 0.01 0.024 20 Mono-Visual ORB-SLAMS 0210 38 CVPR JUNE 17-21, 2024 10 SEATTLE, WA Our prediction for sequence group V101-V100 of EUROC 3. " 0.058 0.058 Mono-Visual 0264 20 Scene name #Disjoint Trajectories Sola Table Plant Scene Einstein Planar 4 2 3 VINS [23] Mono-Inertial ORB-SLAM3 Mono-Inertial ORB-SLAM) 41 Mino-Visual FAIL FAIL FAIL 0.010 (no in 0.037 0.065 0.040 0.048 20 Our Mono-Visual 0.010 0.021 0.032 0047 Multi-Session SLAM on ETH-3D Last name Extmate Example: Multi-Session SLAM Visual SLAM with Sparse Optical Flow Our approach uses sparse optical flow to track keypoints between frames. This is as opposed to keypoint- matching approaches like Superglue and ORB. Using sparse flow has both advantages, and disadvantages. Benefit: Robust in low-texture environments. No requirement for keypoint detector. Drawback: Existing approaches to MS-SLAM are incompatible, e.g., ORB-SLAM3 Connecting Disjoint Trajectories The goal in Multi-Session SLAM is to estimate camera motion for all monocular video streams under a single global reference. Our approach attempts (1) to estimate camera motion from video streams individually (like a typical SLAM system) and (2) to connect disjoint sequences by aligning their respective coordinate systems. To perform (2), we identify co-visible image pairs using image retrieval and then run our two-view pose estimator to predict a 7DOF alignment between sequences. NatVLAD Dosator O Stage 3 Graph Merging Stage 2 Sim(3) Stage 1 Image Retrieval Estimation Our approach to connecting disjoint sequences Updating Camera Poses and Optical Flow Each update iteration follows three steps: (1) We predict an update to sparse optical flow. (2) We then update the camera pose estimates to align with the predicted flow. (3) Finally, we clamp the optical flow to the newly-induced epipolar lines. Image 2 11 Image 1-2 EN The optimizer in (2) parameterizes the epipolar lines as a function of the camera poses, and seeks to minimize Symmetric Epipolar Distance (SED) between the predicted matches and the epipolar lines. Convergence basin of each optimization layer -Input Camera Pose Method Pre-cond SED Solver Differentiable Camera Pose Optimizer(s) Unfortunately, the SED optimizer will converge to local minima if initialized far from the true optimum. To remedy this, we adopt a pre-conditioning stage which uses a weighted version of the 8-point algorithm. The combined optimizers are fully differentiable, meaning we can supervise on the final pose output. Weighted 8-pt-algorithm (used for initialization) No local minimum Less accurate SED Optimizer (used for refinement) Local minimum / non-convex More accurate anslation Direction Emor degrees) Pose Loss Matching Lo Loss Preconditioning SED Solver Multi-Session SLAM on EUROC Ablation: Two-view Relative Pose on Scannet We evaluate our two-view relative pose subsystem in isolation. We outperforms existing methods on Scannet. Existing approaches perform matching as a pre- processing step, whereas ours is a weight-tied network alternating between optimization and matching. Quantitative two-view results on Scannet Qualitative Two-View Matching Results Qualitative results on Scannet. Our two-view subsystem estimates accurate relative poses across wide camera baselines. It initializes all matches with uniform depth and identity relative pose. Progressive applications of our update operator lead to more accurate matches and higher predicted confidence initialization 1 Iteration 3ation Low Confidence High Confidence Confidence (1) Our optimizer architecture, and the backward gradients (dotted lines) Qualitative results on Scannet 12 ration