Modality-invariant Visual Odometry for Embodied Vision

1University of Washington, 2Swiss Federal Institute of Technology Lausanne (EPFL)
CVPR 2023

VOTransformer localizes the agent from visual observations and is agnostic to the input modality.


Effectively localizing an agent in a realistic, noisy setting is crucial for many embodied vision tasks. Visual Odometry (VO) is a practical substitute for unreliable GPS and compass sensors, especially in indoor environments. While SLAM-based methods show a solid performance without large data requirements, they are less flexible and robust w.r.t. to noise and changes in the sensor suite compared to learning-based approaches. Recent deep VO models, however, limit themselves to a fixed set of input modalities, e.g., RGB and depth, while training on millions of samples. When sensors fail, sensor suites change, or modalities are intentionally looped out due to available resources, e.g., power consumption, the models fail catastrophically. Furthermore, training these models from scratch is even more expensive without simulator access or suitable existing models that can be fine-tuned. While such scenarios get mostly ignored in simulation, they commonly hinder a model's reusability in real-world applications.

Navigation setup where GPS+Compass is missing and (visual) sensor availability varies.

We propose a Transformer-based modality-invariant VO approach that can deal with diverse or changing sensor suites of navigation agents. Our model outperforms previous methods while training on only a fraction of the data. We hope this method opens the door to a broader range of real-world applications that can benefit from flexible and learned VO models.


Explicit Modality Invariant Training

Navigation Performance

Explicitly training the model to be invariant to its input modalities is one way of dealing with missing sensory information during test-time. We enforce modality-invariance through an explicit training scheme: dropping modalities during training to simulate missing modalities during test-time. We model this notion as a multinomial distribution over modality combinations (here: RGB, Depth, RGB-D) with equal probability. For each batch, we draw a sample from the distribution to determine on which combination to train. Try it yourself! Drop modalities from the observations and observe how the agent is still able to localize itself sufficiently to navigate to the goal. We show the shortest path and the actual path taken by the agent. Collisions with the environment are denoted by the red frame around the observations.

Hint: Use the bottons to remove modalities from the VOT's observations.

Side-by-side Comparison

The agent navigates the Cantwell scene from start to goal. We shows the shortest path, the actual path taken by the agent, and the "imaginary" path the agent believes it took through its VO estimate. Training on RGB-D (VOT-B), the localization is inaccurate once modalities are unavailable (Drop RGB / Depth). The localization error accumulates over the course of the trajectory and causes the actual and "imaginary" path to diverge. The result is a failure to complete the episodes. We propose explicit modality invariant training (w/ inv.) that enforces modality invariance through batch-wise dropping of modalities during test-time. With our training scheme, VOT w/ inv. learns to not rely on a single modality, leading to success even when modalities are missing!

Navigation setup where GPS+Compass is missing and (visual) sensor availability varies.

Attention Maps

We condition the VOT on the noiseless action taken by the agent. Inspecting the attention maps, we find that different actions prime the VOT to attend to meaningful regions in the image. For instance, turning left leads to the model focusing on regions present at both time steps (see below). This makes intuitive sense, as a turning action of 30° strongly displaces visual features or even pushes them out of the agent’s field of view. A similar behavior emerges for moving forward, which leads to attending on the center regions, e.g., the walls and the end of a hallway (see below).


Hint: Drag the slider to overlay the attention map over the observations.

Habitat Challenge

We submit our VOT (RGB-D) to the Habitat Challenge 2021 benchmark (test-std split). Using the same navigation policy as Rank 2, we achieve the highest SSPL training on only 5% of the data. (Leaderboard)

Rank Participant team S SPL SSPL
1 MultiModalVO (VOT) (ours) 93 74 77
2 VO for Realistic PointGoal 94 74 76
3 robotics 91 70 71
4 VO2021 78 59 69
5 Differentiable SLAM-net 65 47 60


    title={Modality-invariant Visual Odometry for Embodied Vision},
    author={Memmel, Marius and Bachmann, Roman and Zamir, Amir},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},