DORAEMON: Decentralized Ontology-aware Reliable Agent with Enhanced Memory Oriented Navigation

Overview

Adaptive navigation in unfamiliar environments is crucial for household service robots but remains challenging due to the need for both low-level path planning and high-level scene understanding. While recent vision-language model (VLM) based zero-shot approaches reduce dependence on prior maps and scene-specific training data, they face significant limitations: spatiotemporal discontinuity from discrete observations, unstructured memory representations, and insufficient task understanding leading to navigation failures.

We propose DORAEMON (Decentralized Ontology-aware Reliable Agent with Enhanced Memory Oriented Navigation), a novel cognitive-inspired framework consisting of Ventral and Dorsal Streams that mimics human navigation capabilities. The Dorsal Stream implements the Hierarchical Semantic-Spatial Fusion and Topology Map to handle spatiotemporal discontinuities, while the Ventral Stream combines RAG-VLM and Policy-VLM to improve decision-making. Our approach also develops Nav-Ensurance to ensure navigation safety and efficiency.

We evaluate DORAEMON on the HM3D, MP3D, and GOAT datasets, where it achieves state-of-the-art performance on both success rate (SR) and success weighted by path length (SPL) metrics, significantly outperforming existing methods. We also introduce a new evaluation metric (AORI) to assess navigation intelligence better.

Method

Overall Architecture

DORAEMON employs a decentralized ontology-awarearchitecture inspired by human navigation systems, featuring coordinated Dorsal and Ventral streams with Nav-Ensurance for robust navigation.

Dorsal Stream

The Dorsal Stream handles spatial processing through Hierarchical Semantic-Spatial Fusion and Topology Mapping, addressing spatiotemporal discontinuities in navigation.

Ventral Stream

The Ventral Stream manages object recognition and decision-making through RAG-VLM (Retrieval-Augmented Generation with Vision-Language Models) and Policy-VLM components.

📺 Demo

🛋️ SOFA

🟦 TABLE

🛏️ BED

🌳 PLANT

🗄️ CABINET

💺 CHAIR

🌳 PLANT

🛋️ SOFA

📺 TV

🚽 TOILET

🛋️ SOFA

💺 CHAIR

Results

DORAEMON achieves state-of-the-art performance across multiple benchmarks, demonstrating superior navigation capabilities and intelligence.

End-to-End Methods Comparison

(a) HM3Dv2 ObjectNav Benchmark

Method	SR (%) ↑	SPL (%) ↑	AORI (%) ↓
Prompt-only	29.8	0.107	-
PIVOT	24.6	10.6	63.3
VLMNav	51.6	18.3	61.5
DORAEMON (Ours)	62.0	23.0	50.1
Improvement	20.2	10.0	18.5

(b) GOAT Benchmark

Method	SR (%) ↑	SPL (%) ↑	AORI (%) ↓
Prompt-only	11.3	3.7	-
PIVOT	8.3	3.8	64.9
VLMNav	22.1	9.3	63.6
DORAEMON (Ours)	24.3	10.3	56.9
Improvement	10.0	10.8	10.5

Comprehensive Comparison with State-of-the-Art Methods

Method	ZS	TF	E2E	HM3Dv1		HM3Dv2		MP3D
Method				SR(%) ↑	SPL(%) ↑	SR(%) ↑	SPL(%) ↑	SR(%) ↑	SPL(%) ↑
ProcTHOR	✗	✗	✗	54.4	31.8	-	-	-	-
SemEXP	✓	✗	✗	-	-	-	-	36.0	14.4
Habitat-Web	✓	✗	✗	41.5	16.0	-	-	31.6	8.5
PONI	✓	✗	✗	-	-	-	-	31.8	12.1
ProcTHOR-ZS	✓	✗	✗	13.2	7.7	-	-	-	-
ZSON	✓	✗	✗	25.5	12.6	-	-	15.3	4.8
PSL	✓	✗	✗	42.4	19.2	-	-	-	-
Pixel-Nav	✓	✗	✗	37.9	20.5	-	-	-	-
SGM	✓	✗	✗	60.2	30.8	-	-	37.7	14.7
ImagineNav	✓	✗	✗	53.0	23.8	-	-	-	-
CoW	✓	✓	✗	-	-	-	-	7.4	3.7
ESC	✓	✓	✗	39.2	22.3	-	-	28.7	14.2
L3MVN	✓	✓	✗	50.4	23.1	36.3	15.7	34.9	14.5
VLFM	✓	✓	✗	52.5	30.4	63.6	32.5	36.4	17.5
VoroNav	✓	✓	✗	42.0	26.0	-	-	-	-
TopV-Nav	✓	✓	✗	52.0	28.6	-	-	35.2	16.4
SG-Nav	✓	✓	✗	54.0	24.9	49.6	25.5	40.2	16.0
DORAEMON (Ours)	✓	✓	✓	55.6	21.4	66.5	20.6	41.1	15.8

DORAEMON

Decentralized Ontology-aware Reliable Agent with Enhanced Memory Oriented Navigation

Overview

Method

Overall Architecture

Dorsal Stream

Ventral Stream

📺 Demo

🛋️ SOFA

🟦 TABLE

🛏️ BED

🌳 PLANT

🗄️ CABINET

💺 CHAIR

🌳 PLANT

🛋️ SOFA

📺 TV

🚽 TOILET

🛋️ SOFA

💺 CHAIR

Results

End-to-End Methods Comparison

(a) HM3Dv2 ObjectNav Benchmark

(b) GOAT Benchmark

Comprehensive Comparison with State-of-the-Art Methods

BibTeX Citation