A detailed academic poster is displayed, featuring the research titled "Multi-Session SLAM with Differentiable Wide-Baseline Pose Optimization" by Lahav Lipson and Jia Deng. The poster is presented by the Princeton Vision & Learning Lab and was showcased at the CVPR conference in Seattle, WA, scheduled for June 17-21, 2024. **Core Content:** - **Multi-Session SLAM**: Introduction to a novel system for Monocular Multi-Session SLAM using sparse optical flow and optimization layers to align views and refine poses. - **Visual SLAM with Sparse Optical Flow**: Explains how the approach utilizes sparse optical flow for tracking keypoints across frames, highlighting its benefits in low-texture environments and drawbacks such as compatibility issues with pre-existing approaches. - **Connecting Disjoint Trajectories**: Details on how the new system estimates camera motion across multiple streams under a single global reference, addressing the SLAM challenges of scale factor drifts between monocular videos. **Technical Details:** - **Recurrent Sparse Optical Flow for 2-view Camera Pose**: Description of the approach using wide-baseline two-view optical flow to refine camera pose estimates, and the integration of optimization layers to rectify pose estimates. - **Updating Camera Poses and Optical Flow**: A step-by-step overview of how poses are iteratively updated to minimize residuals between the predicted matches and the optical flow vectors. - **Differentiable Camera Pose Optimizer(s)**: A deep dive into the mathematical optimizations used, including the weighted bi-polar algorithm and SED (Squared Euclidean Distance) loss for refinement, enhancing pose estimates' accuracy. **Results:** - **Quantitative Results**: Featuring evaluations of the system on the EuRoC and TUM-RGBD datasets with tables showing the performance metrics in comparison to existing methods. - **Ablation: Two-view Relative Pose on Scannet**: Analysis of performance variations by subtracting different system components. - **Qualitative Two-View Matching Results**: Visual examples and descriptions of two-view subsystem efficacy in producing accurate relative poses across wide-baseline scenarios. **Visuals**: - Diagrams, tables, and images illustrate the system’s design, workflow, and performance results. - Charts provide a visual representation of the comparative analysis between method efficiency and accuracy. The poster is beautifully arranged on a display board, with foot traffic visible in the background, indicating its presentation at a large academic or industry conference. The Princeton and CVPR logos are prominently displayed, aligning with the collaborative and prestigious nature of the research work. Text transcribed from the image: OBARSZ Multi-Session SLAM with Differentiable Wide-Baseline Pose Optimization Lahav Lipson and Jia Deng Multi-Session SLAM We introduce a new system for Monocular Multi-Session SLAM, which tracks camera motion across multiple disjoint videos under a single global reference. Our approach couples the prediction of optical flow with optimization layers to estimate camera pose. Simultaneous Localization and Mapping (SLAM) is the task of estimating camera motion and a 3D map from video. Video data in the wild often consists of not a single continuous stream, but rather multiple disjoint sessions, either deliberately such as in collaborative mapping when multiple robots perform joint rapid 3D reconstruction, or inadvertently due to visual discontinuities in the video stream which can result from camera failures, extreme parallax, rapid turns, auto-exposure lag, dark areas, or occlusions. Key Challenges: Real-time camera pose + scale/location/position ambiguities between monocular videos Vo PRINCETON- VISION & LEARNING LAB Recurrent Sparse Optical Flow for 2-view Camera Pose A core subsystem of our Multi-Session SLAM method is a new approach to wide-baseline, 2-view relative camera pose. Given two views as input, we alternate between estimating sparse optical flow residuals using a weight-tied network, and updating the relative pose estimate with an optimization layer. Our method also implicitly learns a confidence measure for each predicted flow vector. Σ Quantitative Results We evaluate our full system on the EuRoC and ETH-3D datasets in which all the ground truth trajectories are in a unified coordinate system. Compared to existing methods, our approach is significantly more robust and accurate. All reported methods run in real-time (camera hz = 20 FPS) MH01-03 MH01-05 V101-103 V201-203 Ca Scene name #Disjoint Trajectories 3 0.022 Mono-Visual CCM-SLAM [36] 0.036 0.01 0.024 20 Mono-Visual ORB-SLAMS 0210 38 CVPR JUNE 17-21, 2024 10 SEATTLE, WA Our prediction for sequence group V101-V100 of EUROC 3. " 0.058 0.058 Mono-Visual 0264 20 Scene name #Disjoint Trajectories Sola Table Plant Scene Einstein Planar 4 2 3 VINS [23] Mono-Inertial ORB-SLAM3 Mono-Inertial ORB-SLAM) 41 Mino-Visual FAIL FAIL FAIL 0.010 (no in 0.037 0.065 0.040 0.048 20 Our Mono-Visual 0.010 0.021 0.032 0047 Multi-Session SLAM on ETH-3D Last name Extmate Example: Multi-Session SLAM Visual SLAM with Sparse Optical Flow Our approach uses sparse optical flow to track keypoints between frames. This is as opposed to keypoint- matching approaches like Superglue and ORB. Using sparse flow has both advantages, and disadvantages. Benefit: Robust in low-texture environments. No requirement for keypoint detector. Drawback: Existing approaches to MS-SLAM are incompatible, e.g., ORB-SLAM3 Connecting Disjoint Trajectories The goal in Multi-Session SLAM is to estimate camera motion for all monocular video streams under a single global reference. Our approach attempts (1) to estimate camera motion from video streams individually (like a typical SLAM system) and (2) to connect disjoint sequences by aligning their respective coordinate systems. To perform (2), we identify co-visible image pairs using image retrieval and then run our two-view pose estimator to predict a 7DOF alignment between sequences. NatVLAD Dosator O Stage 3 Graph Merging Stage 2 Sim(3) Stage 1 Image Retrieval Estimation Our approach to connecting disjoint sequences Updating Camera Poses and Optical Flow Each update iteration follows three steps: (1) We predict an update to sparse optical flow. (2) We then update the camera pose estimates to align with the predicted flow. (3) Finally, we clamp the optical flow to the newly-induced epipolar lines. Image 2 11 Image 1-2 EN The optimizer in (2) parameterizes the epipolar lines as a function of the camera poses, and seeks to minimize Symmetric Epipolar Distance (SED) between the predicted matches and the epipolar lines. Convergence basin of each optimization layer -Input Camera Pose Method Pre-cond SED Solver Differentiable Camera Pose Optimizer(s) Unfortunately, the SED optimizer will converge to local minima if initialized far from the true optimum. To remedy this, we adopt a pre-conditioning stage which uses a weighted version of the 8-point algorithm. The combined optimizers are fully differentiable, meaning we can supervise on the final pose output. Weighted 8-pt-algorithm (used for initialization) No local minimum Less accurate SED Optimizer (used for refinement) Local minimum / non-convex More accurate anslation Direction Emor degrees) Pose Loss Matching Lo Loss Preconditioning SED Solver Multi-Session SLAM on EUROC Ablation: Two-view Relative Pose on Scannet We evaluate our two-view relative pose subsystem in isolation. We outperforms existing methods on Scannet. Existing approaches perform matching as a pre- processing step, whereas ours is a weight-tied network alternating between optimization and matching. Quantitative two-view results on Scannet Qualitative Two-View Matching Results Qualitative results on Scannet. Our two-view subsystem estimates accurate relative poses across wide camera baselines. It initializes all matches with uniform depth and identity relative pose. Progressive applications of our update operator lead to more accurate matches and higher predicted confidence initialization 1 Iteration 3ation Low Confidence High Confidence Confidence (1) Our optimizer architecture, and the backward gradients (dotted lines) Qualitative results on Scannet 12 ration