Vision-centric state estimation and mapping for visually challenging scenarios

Jianeng Wang

Reliable 3D scene understanding is essential for enabling autonomous robot operation in complex environments. This thesis addresses the challenges of vision-based state estimation and mapping in challenging scenarios, where conventional methods often struggle due to motion blur, low light, or high dynamic motion. The overarching goal is to develop vision-centric systems that enhance state estimation and scene interpretation by leveraging both novel sensing technologies and robust multi-session mapping strategies.

The first contribution of this thesis presents a stereo event-based visual odometry (VO) system that fully exploits the asynchronous and high-temporal-resolution nature of event cameras. Unlike traditional frame-based VO systems that estimate robot states at a fixed rate in a discrete manner, the proposed system models camera motion as a continuous-time trajectory, enabling per-event state estimation. It combines asynchronous feature tracking with a physically-grounded motion prior to estimate a smooth trajectory that allows pose query at any time within the measurement window. Experimental results demonstrate that this system achieves competitive performance under high-speed motion and challenging lighting conditions, offering a promising alternative for continuous-time state estimation on asynchronous data streams.

The second contribution introduces Exosense, a scene understanding system tailored for self-balancing exoskeletons. Building upon a wide field-of-view multi-camera device, Exosense can generate rich, semantically annotated elevation maps that integrate geometry, terrain traversability, and room-level semantics. The system supports indoor navigation by providing reusable environment representations for localization and planning. Designed as a wearable sensing platform, Exosense emphasizes modularity and adaptability, with the potential for integration into a broader wearable sensor ecosystem.

Building upon Exosense, the third contribution is LT-Exosense, a change-aware, multi-session mapping system designed for long-term operation in dynamic environments. LT-Exosense incrementally merges scene representations built during repeated traversals of an environment, detects environmental changes, and updates a unified global map. This map representation enables adaptive path planning in response to the dynamic environment. The system supports persistent spatial memory and demonstrates compatibility with different sensor configurations, offering a flexible and scalable foundation for lifelong assistive mobility.

Together, these contributions cover different topics under vision-centric state estimation and mapping in challenging scenarios, including high-speed sensing, semantic scene interpretation, and long-term map maintenance. The thesis opens up new possibilities for robust autonomy on resource-constrained platforms, such as drones and self-balancing exoskeletons, where reliable environmental understanding is critical to safe and intelligent operation.