Overview
Adaptive navigation in unfamiliar environments is crucial for household service robots but remains challenging due to the need for both low-level path planning and high-level scene understanding. While recent vision-language model (VLM) based zero-shot approaches reduce dependence on prior maps and scene-specific training data, they face significant limitations: spatiotemporal discontinuity from discrete observations, unstructured memory representations, and insufficient task understanding leading to navigation failures.
We propose DORAEMON (Decentralized Ontology-aware Reliable Agent with Enhanced Memory Oriented Navigation), a novel cognitive-inspired framework consisting of Ventral and Dorsal Streams that mimics human navigation capabilities. The Dorsal Stream implements the Hierarchical Semantic-Spatial Fusion and Topology Map to handle spatiotemporal discontinuities, while the Ventral Stream combines RAG-VLM and Policy-VLM to improve decision-making. Our approach also develops Nav-Ensurance to ensure navigation safety and efficiency.
We evaluate DORAEMON on the HM3D, MP3D, and GOAT datasets, where it achieves state-of-the-art performance on both success rate (SR) and success weighted by path length (SPL) metrics, significantly outperforming existing methods. We also introduce a new evaluation metric (AORI) to assess navigation intelligence better.
Method
Overall Architecture
DORAEMON employs a decentralized ontology-awarearchitecture inspired by human navigation systems, featuring coordinated Dorsal and Ventral streams with Nav-Ensurance for robust navigation.
Dorsal Stream
The Dorsal Stream handles spatial processing through Hierarchical Semantic-Spatial Fusion and Topology Mapping, addressing spatiotemporal discontinuities in navigation.
Ventral Stream
The Ventral Stream manages object recognition and decision-making through RAG-VLM (Retrieval-Augmented Generation with Vision-Language Models) and Policy-VLM components.
📺 Demo
🛋️ SOFA
🟦 TABLE
🛏️ BED
🌳 PLANT
🗄️ CABINET
💺 CHAIR
🌳 PLANT
🛋️ SOFA
📺 TV
🚽 TOILET
🛋️ SOFA
💺 CHAIR
Results
DORAEMON achieves state-of-the-art performance across multiple benchmarks, demonstrating superior navigation capabilities and intelligence.
End-to-End Methods Comparison
(a) HM3Dv2 ObjectNav Benchmark
| Method | SR (%) ↑ | SPL (%) ↑ | AORI (%) ↓ |
|---|---|---|---|
| Prompt-only | 29.8 | 0.107 | - |
| PIVOT | 24.6 | 10.6 | 63.3 |
| VLMNav | 51.6 | 18.3 | 61.5 |
| DORAEMON (Ours) | 62.0 | 23.0 | 50.1 |
| Improvement | 20.2 | 10.0 | 18.5 |
(b) GOAT Benchmark
| Method | SR (%) ↑ | SPL (%) ↑ | AORI (%) ↓ |
|---|---|---|---|
| Prompt-only | 11.3 | 3.7 | - |
| PIVOT | 8.3 | 3.8 | 64.9 |
| VLMNav | 22.1 | 9.3 | 63.6 |
| DORAEMON (Ours) | 24.3 | 10.3 | 56.9 |
| Improvement | 10.0 | 10.8 | 10.5 |
Comprehensive Comparison with State-of-the-Art Methods
| Method | ZS | TF | E2E | HM3Dv1 | HM3Dv2 | MP3D | |||
|---|---|---|---|---|---|---|---|---|---|
| SR(%) ↑ | SPL(%) ↑ | SR(%) ↑ | SPL(%) ↑ | SR(%) ↑ | SPL(%) ↑ | ||||
| ProcTHOR | ✗ | ✗ | ✗ | 54.4 | 31.8 | - | - | - | - |
| SemEXP | ✓ | ✗ | ✗ | - | - | - | - | 36.0 | 14.4 |
| Habitat-Web | ✓ | ✗ | ✗ | 41.5 | 16.0 | - | - | 31.6 | 8.5 |
| PONI | ✓ | ✗ | ✗ | - | - | - | - | 31.8 | 12.1 |
| ProcTHOR-ZS | ✓ | ✗ | ✗ | 13.2 | 7.7 | - | - | - | - |
| ZSON | ✓ | ✗ | ✗ | 25.5 | 12.6 | - | - | 15.3 | 4.8 |
| PSL | ✓ | ✗ | ✗ | 42.4 | 19.2 | - | - | - | - |
| Pixel-Nav | ✓ | ✗ | ✗ | 37.9 | 20.5 | - | - | - | - |
| SGM | ✓ | ✗ | ✗ | 60.2 | 30.8 | - | - | 37.7 | 14.7 |
| ImagineNav | ✓ | ✗ | ✗ | 53.0 | 23.8 | - | - | - | - |
| CoW | ✓ | ✓ | ✗ | - | - | - | - | 7.4 | 3.7 |
| ESC | ✓ | ✓ | ✗ | 39.2 | 22.3 | - | - | 28.7 | 14.2 |
| L3MVN | ✓ | ✓ | ✗ | 50.4 | 23.1 | 36.3 | 15.7 | 34.9 | 14.5 |
| VLFM | ✓ | ✓ | ✗ | 52.5 | 30.4 | 63.6 | 32.5 | 36.4 | 17.5 |
| VoroNav | ✓ | ✓ | ✗ | 42.0 | 26.0 | - | - | - | - |
| TopV-Nav | ✓ | ✓ | ✗ | 52.0 | 28.6 | - | - | 35.2 | 16.4 |
| SG-Nav | ✓ | ✓ | ✗ | 54.0 | 24.9 | 49.6 | 25.5 | 40.2 | 16.0 |
| DORAEMON (Ours) | ✓ | ✓ | ✓ | 55.6 | 21.4 | 66.5 | 20.6 | 41.1 | 15.8 |