Overview
Adaptive navigation in unfamiliar environments is crucial for household service robots but remains challenging due to the need for both low-level path planning and high-level scene understanding. While recent vision-language model (VLM) based zero-shot approaches reduce dependence on prior maps and scene-specific training data, they face significant limitations: spatiotemporal discontinuity from discrete observations, unstructured memory representations, and insufficient task understanding leading to navigation failures.
We propose DORAEMON (Decentralized Ontology-aware Reliable Agent with Enhanced Memory Oriented Navigation), a novel cognitive-inspired framework consisting of Ventral and Dorsal Streams that mimics human navigation capabilities. The Dorsal Stream implements the Hierarchical Semantic-Spatial Fusion and Topology Map to handle spatiotemporal discontinuities, while the Ventral Stream combines RAG-VLM and Policy-VLM to improve decision-making. Our approach also develops Nav-Ensurance to ensure navigation safety and efficiency.
We evaluate DORAEMON on the HM3D, MP3D, and GOAT datasets, where it achieves state-of-the-art performance on both success rate (SR) and success weighted by path length (SPL) metrics, significantly outperforming existing methods. We also introduce a new evaluation metric (AORI) to assess navigation intelligence better.
Method
Overall Architecture

DORAEMON employs a decentralized ontology-awarearchitecture inspired by human navigation systems, featuring coordinated Dorsal and Ventral streams with Nav-Ensurance for robust navigation.
Dorsal Stream

The Dorsal Stream handles spatial processing through Hierarchical Semantic-Spatial Fusion and Topology Mapping, addressing spatiotemporal discontinuities in navigation.
Ventral Stream

The Ventral Stream manages object recognition and decision-making through RAG-VLM (Retrieval-Augmented Generation with Vision-Language Models) and Policy-VLM components.
📺 Demo

🛋️ SOFA

🟦 TABLE

🛏️ BED

🌳 PLANT

🗄️ CABINET

💺 CHAIR

🌳 PLANT

🛋️ SOFA

📺 TV

🚽 TOILET

🛋️ SOFA

💺 CHAIR
Results
DORAEMON achieves state-of-the-art performance across multiple benchmarks, demonstrating superior navigation capabilities and intelligence.
End-to-End Methods Comparison
(a) HM3Dv2 ObjectNav Benchmark
Method | SR (%) ↑ | SPL (%) ↑ | AORI (%) ↓ |
---|---|---|---|
Prompt-only | 29.8 | 0.107 | - |
PIVOT | 24.6 | 10.6 | 63.3 |
VLMNav | 51.6 | 18.3 | 61.5 |
DORAEMON (Ours) | 62.0 | 23.0 | 50.1 |
Improvement | 20.2 | 10.0 | 18.5 |
(b) GOAT Benchmark
Method | SR (%) ↑ | SPL (%) ↑ | AORI (%) ↓ |
---|---|---|---|
Prompt-only | 11.3 | 3.7 | - |
PIVOT | 8.3 | 3.8 | 64.9 |
VLMNav | 22.1 | 9.3 | 63.6 |
DORAEMON (Ours) | 24.3 | 10.3 | 56.9 |
Improvement | 10.0 | 10.8 | 10.5 |
Comprehensive Comparison with State-of-the-Art Methods
Method | ZS | TF | E2E | HM3Dv1 | HM3Dv2 | MP3D | |||
---|---|---|---|---|---|---|---|---|---|
SR(%) ↑ | SPL(%) ↑ | SR(%) ↑ | SPL(%) ↑ | SR(%) ↑ | SPL(%) ↑ | ||||
ProcTHOR | ✗ | ✗ | ✗ | 54.4 | 31.8 | - | - | - | - |
SemEXP | ✓ | ✗ | ✗ | - | - | - | - | 36.0 | 14.4 |
Habitat-Web | ✓ | ✗ | ✗ | 41.5 | 16.0 | - | - | 31.6 | 8.5 |
PONI | ✓ | ✗ | ✗ | - | - | - | - | 31.8 | 12.1 |
ProcTHOR-ZS | ✓ | ✗ | ✗ | 13.2 | 7.7 | - | - | - | - |
ZSON | ✓ | ✗ | ✗ | 25.5 | 12.6 | - | - | 15.3 | 4.8 |
PSL | ✓ | ✗ | ✗ | 42.4 | 19.2 | - | - | - | - |
Pixel-Nav | ✓ | ✗ | ✗ | 37.9 | 20.5 | - | - | - | - |
SGM | ✓ | ✗ | ✗ | 60.2 | 30.8 | - | - | 37.7 | 14.7 |
ImagineNav | ✓ | ✗ | ✗ | 53.0 | 23.8 | - | - | - | - |
CoW | ✓ | ✓ | ✗ | - | - | - | - | 7.4 | 3.7 |
ESC | ✓ | ✓ | ✗ | 39.2 | 22.3 | - | - | 28.7 | 14.2 |
L3MVN | ✓ | ✓ | ✗ | 50.4 | 23.1 | 36.3 | 15.7 | 34.9 | 14.5 |
VLFM | ✓ | ✓ | ✗ | 52.5 | 30.4 | 63.6 | 32.5 | 36.4 | 17.5 |
VoroNav | ✓ | ✓ | ✗ | 42.0 | 26.0 | - | - | - | - |
TopV-Nav | ✓ | ✓ | ✗ | 52.0 | 28.6 | - | - | 35.2 | 16.4 |
SG-Nav | ✓ | ✓ | ✗ | 54.0 | 24.9 | 49.6 | 25.5 | 40.2 | 16.0 |
DORAEMON (Ours) | ✓ | ✓ | ✓ | 55.6 | 21.4 | 66.5 | 20.6 | 41.1 | 15.8 |