TY - JOUR
T1 - AMT-Net
T2 - Attention-based multi-task network for scene depth and semantics prediction in assistive navigation
AU - Lei, Yunjia
AU - Thompson, Joshua Luke
AU - Phung, Son Lam
AU - Bouzerdoum, Abdesselam
AU - Le, Hoang Thanh
N1 - Publisher Copyright:
© 2025 The Author(s)
PY - 2025/4/7
Y1 - 2025/4/7
N2 - Traveling safely and independently in unfamiliar environments remains a significant challenge for people with visual impairments. Conventional assistive navigation systems, while aiming to enhance spatial awareness, typically handle crucial tasks like semantic segmentation and depth estimation separately, resulting in high computational overhead and reduced inference speed. To address this limitation, we introduce AMT-Net, a novel multi-task deep neural network designed for joint semantic segmentation and monocular depth estimation. The AMT-Net is designed with a single unified decoder, which boosts not only the model's efficiency but also its scalability on portable devices with limited computational resources. We propose two self-attention-based modules, CSAPP and RSAB, to leverage the strengths of convolutional neural networks for extracting robust local features and Transformers for capturing essential long-range dependencies. This design enhances the ability of our model to interpret complex scenes effectively. Furthermore, AMT-Net has low computational complexity and achieves real-time performance, making it suitable for assistive navigation applications. Extensive experiments on the public NYUD-v2 dataset and the TrueSight dataset demonstrated our model's state-of-the-art performance and the effectiveness of the proposed components.
AB - Traveling safely and independently in unfamiliar environments remains a significant challenge for people with visual impairments. Conventional assistive navigation systems, while aiming to enhance spatial awareness, typically handle crucial tasks like semantic segmentation and depth estimation separately, resulting in high computational overhead and reduced inference speed. To address this limitation, we introduce AMT-Net, a novel multi-task deep neural network designed for joint semantic segmentation and monocular depth estimation. The AMT-Net is designed with a single unified decoder, which boosts not only the model's efficiency but also its scalability on portable devices with limited computational resources. We propose two self-attention-based modules, CSAPP and RSAB, to leverage the strengths of convolutional neural networks for extracting robust local features and Transformers for capturing essential long-range dependencies. This design enhances the ability of our model to interpret complex scenes effectively. Furthermore, AMT-Net has low computational complexity and achieves real-time performance, making it suitable for assistive navigation applications. Extensive experiments on the public NYUD-v2 dataset and the TrueSight dataset demonstrated our model's state-of-the-art performance and the effectiveness of the proposed components.
KW - Assistive navigation
KW - Depth estimation
KW - Multi-task learning
KW - Semantic segmentation
KW - Vision impairment
KW - Vision transformers
UR - http://www.scopus.com/inward/record.url?scp=85216689301&partnerID=8YFLogxK
U2 - 10.1016/j.neucom.2025.129468
DO - 10.1016/j.neucom.2025.129468
M3 - Article
AN - SCOPUS:85216689301
SN - 0925-2312
VL - 625
JO - Neurocomputing
JF - Neurocomputing
M1 - 129468
ER -