AMT-Net: Attention-based multi-task network for scene depth and semantics prediction in assistive navigation

Yunjia Lei, Joshua Luke Thompson, Son Lam Phung*, Abdesselam Bouzerdoum, Hoang Thanh Le

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Traveling safely and independently in unfamiliar environments remains a significant challenge for people with visual impairments. Conventional assistive navigation systems, while aiming to enhance spatial awareness, typically handle crucial tasks like semantic segmentation and depth estimation separately, resulting in high computational overhead and reduced inference speed. To address this limitation, we introduce AMT-Net, a novel multi-task deep neural network designed for joint semantic segmentation and monocular depth estimation. The AMT-Net is designed with a single unified decoder, which boosts not only the model's efficiency but also its scalability on portable devices with limited computational resources. We propose two self-attention-based modules, CSAPP and RSAB, to leverage the strengths of convolutional neural networks for extracting robust local features and Transformers for capturing essential long-range dependencies. This design enhances the ability of our model to interpret complex scenes effectively. Furthermore, AMT-Net has low computational complexity and achieves real-time performance, making it suitable for assistive navigation applications. Extensive experiments on the public NYUD-v2 dataset and the TrueSight dataset demonstrated our model's state-of-the-art performance and the effectiveness of the proposed components.

Original languageEnglish
Article number129468
Number of pages12
JournalNeurocomputing
Volume625
DOIs
Publication statusPublished - 7 Apr 2025

Keywords

  • Assistive navigation
  • Depth estimation
  • Multi-task learning
  • Semantic segmentation
  • Vision impairment
  • Vision transformers

Fingerprint

Dive into the research topics of 'AMT-Net: Attention-based multi-task network for scene depth and semantics prediction in assistive navigation'. Together they form a unique fingerprint.

Cite this