TY - JOUR
T1 - CoST-UNet
T2 - Convolution and swin transformer based deep learning architecture for cardiac segmentation
AU - Islam, Md Rabiul
AU - Qaraqe, Marwa
AU - Serpedin, Erchin
N1 - Publisher Copyright:
© 2024 The Author(s)
PY - 2024/10
Y1 - 2024/10
N2 - Automatic segmentation of two-dimensional (2D) echocardiogram is beneficial for heart disease diagnosis and assessment. Convolutional Neural Network (CNN) based U-shaped architectures such as UNet have shown remarkable success for medical images segmentation. UNet generally exhibits limitations for seizing long-range dependencies due to the intrinsic locality of the convolution operation. On the contrary, transformer models can capture global-level information using the multi-head attention mechanism. Taken separately these models exhibit limited localization abilities due to insufficient low-level details. To overcome these limitations, this paper proposes the novel vision transformer CoST-UNet (Convolution and Swin Transformer-based U-shaped Network) architecture that incorporates CNN to leverage spatial information from images in the upper layers and transformer to emphasize global contextual insight in the deeper levels. Unlike existing hybrid models like TransUNet and UNETR, the transformer block of the proposed model employs a Swin Transformer backbone, which ensures linear computational complexity relative to image size. Furthermore, the primary barrier to improving the performance of the transformers, which is the lack of medical images, is effectively addressed by incorporating two convolution layers at the network's uppermost level. The experimental results demonstrate that the model achieved state-of-the-art performance on the ultrasound-based CAMUS dataset (by achieving mean Dice Similarity Coefficients of 0.925, 0.851, and 0.895 for segmenting LVendo, LVepi, and LA, respectively, from apical 4CH echocardiograms), as well as competitive results for MRI-based ACDC datasets, due to its effective capture of local and global context.
AB - Automatic segmentation of two-dimensional (2D) echocardiogram is beneficial for heart disease diagnosis and assessment. Convolutional Neural Network (CNN) based U-shaped architectures such as UNet have shown remarkable success for medical images segmentation. UNet generally exhibits limitations for seizing long-range dependencies due to the intrinsic locality of the convolution operation. On the contrary, transformer models can capture global-level information using the multi-head attention mechanism. Taken separately these models exhibit limited localization abilities due to insufficient low-level details. To overcome these limitations, this paper proposes the novel vision transformer CoST-UNet (Convolution and Swin Transformer-based U-shaped Network) architecture that incorporates CNN to leverage spatial information from images in the upper layers and transformer to emphasize global contextual insight in the deeper levels. Unlike existing hybrid models like TransUNet and UNETR, the transformer block of the proposed model employs a Swin Transformer backbone, which ensures linear computational complexity relative to image size. Furthermore, the primary barrier to improving the performance of the transformers, which is the lack of medical images, is effectively addressed by incorporating two convolution layers at the network's uppermost level. The experimental results demonstrate that the model achieved state-of-the-art performance on the ultrasound-based CAMUS dataset (by achieving mean Dice Similarity Coefficients of 0.925, 0.851, and 0.895 for segmenting LVendo, LVepi, and LA, respectively, from apical 4CH echocardiograms), as well as competitive results for MRI-based ACDC datasets, due to its effective capture of local and global context.
KW - CNN-transformer
KW - Echocardiogram
KW - Local-global
KW - Segmentation
KW - Vision transformer
UR - http://www.scopus.com/inward/record.url?scp=85198098625&partnerID=8YFLogxK
U2 - 10.1016/j.bspc.2024.106633
DO - 10.1016/j.bspc.2024.106633
M3 - Article
AN - SCOPUS:85198098625
SN - 1746-8094
VL - 96
JO - Biomedical Signal Processing and Control
JF - Biomedical Signal Processing and Control
M1 - 106633
ER -