Doraemon Icon

DORAEMON

Decentralized Ontology-aware Reliable Agent with Enhanced Memory Oriented Navigation

Tianjun Gu1 Linfeng Li1 Xuhong Wang2 Chenghua Gong1 Jingyu Gong1 Zhizhong Zhang1 Yuan Xie1,3 Lizhuang Ma1 Xin Tan1,2
1East China Normal University
2Shanghai AI Lab
3Shanghai Innovation Institute

Overview

DORAEMON Teaser

Adaptive navigation in unfamiliar environments is crucial for household service robots but remains challenging due to the need for both low-level path planning and high-level scene understanding. While recent vision-language model (VLM) based zero-shot approaches reduce dependence on prior maps and scene-specific training data, they face significant limitations: spatiotemporal discontinuity from discrete observations, unstructured memory representations, and insufficient task understanding leading to navigation failures.

We propose DORAEMON (Decentralized Ontology-aware Reliable Agent with Enhanced Memory Oriented Navigation), a novel cognitive-inspired framework consisting of Ventral and Dorsal Streams that mimics human navigation capabilities. The Dorsal Stream implements the Hierarchical Semantic-Spatial Fusion and Topology Map to handle spatiotemporal discontinuities, while the Ventral Stream combines RAG-VLM and Policy-VLM to improve decision-making. Our approach also develops Nav-Ensurance to ensure navigation safety and efficiency.

We evaluate DORAEMON on the HM3D, MP3D, and GOAT datasets, where it achieves state-of-the-art performance on both success rate (SR) and success weighted by path length (SPL) metrics, significantly outperforming existing methods. We also introduce a new evaluation metric (AORI) to assess navigation intelligence better.

Method

Overall Architecture

Overall Architecture

DORAEMON employs a decentralized ontology-awarearchitecture inspired by human navigation systems, featuring coordinated Dorsal and Ventral streams with Nav-Ensurance for robust navigation.

Dorsal Stream

Dorsal Stream

The Dorsal Stream handles spatial processing through Hierarchical Semantic-Spatial Fusion and Topology Mapping, addressing spatiotemporal discontinuities in navigation.

Ventral Stream

Ventral Stream

The Ventral Stream manages object recognition and decision-making through RAG-VLM (Retrieval-Augmented Generation with Vision-Language Models) and Policy-VLM components.

📺 Demo

SOFA Navigation

🛋️ SOFA

TABLE Navigation

🟦 TABLE

BED Navigation

🛏️ BED

PLANT Navigation

🌳 PLANT

CABINET Navigation

🗄️ CABINET

CHAIR Navigation

💺 CHAIR

PLANT Navigation

🌳 PLANT

SOFA Navigation

🛋️ SOFA

TV Navigation

📺 TV

TOILET Navigation

🚽 TOILET

SOFA Navigation

🛋️ SOFA

CHAIR Navigation

💺 CHAIR

Results

DORAEMON achieves state-of-the-art performance across multiple benchmarks, demonstrating superior navigation capabilities and intelligence.

End-to-End Methods Comparison

(a) HM3Dv2 ObjectNav Benchmark

Method SR (%) ↑ SPL (%) ↑ AORI (%) ↓
Prompt-only 29.8 0.107 -
PIVOT 24.6 10.6 63.3
VLMNav 51.6 18.3 61.5
DORAEMON (Ours) 62.0 23.0 50.1
Improvement 20.2 10.0 18.5

(b) GOAT Benchmark

Method SR (%) ↑ SPL (%) ↑ AORI (%) ↓
Prompt-only 11.3 3.7 -
PIVOT 8.3 3.8 64.9
VLMNav 22.1 9.3 63.6
DORAEMON (Ours) 24.3 10.3 56.9
Improvement 10.0 10.8 10.5

Comprehensive Comparison with State-of-the-Art Methods

Method ZS TF E2E HM3Dv1 HM3Dv2 MP3D
SR(%) ↑ SPL(%) ↑ SR(%) ↑ SPL(%) ↑ SR(%) ↑ SPL(%) ↑
ProcTHOR 54.4 31.8 - - - -
SemEXP - - - - 36.0 14.4
Habitat-Web 41.5 16.0 - - 31.6 8.5
PONI - - - - 31.8 12.1
ProcTHOR-ZS 13.2 7.7 - - - -
ZSON 25.5 12.6 - - 15.3 4.8
PSL 42.4 19.2 - - - -
Pixel-Nav 37.9 20.5 - - - -
SGM 60.2 30.8 - - 37.7 14.7
ImagineNav 53.0 23.8 - - - -
CoW - - - - 7.4 3.7
ESC 39.2 22.3 - - 28.7 14.2
L3MVN 50.4 23.1 36.3 15.7 34.9 14.5
VLFM 52.5 30.4 63.6 32.5 36.4 17.5
VoroNav 42.0 26.0 - - - -
TopV-Nav 52.0 28.6 - - 35.2 16.4
SG-Nav 54.0 24.9 49.6 25.5 40.2 16.0
DORAEMON (Ours) 55.6 21.4 66.5 20.6 41.1 15.8