Enhanced Swin Transformer and Edge Spatial attention For Remote Sensing Image Semantic Segmentation

Fuxiang Liu; Zhiqiang Hu; Lei Li; Hanlu Li; Xinxin Liu

doi:10.1109/LSP.2025.3550858

Enhanced Swin Transformer and Edge Spatial attention For Remote Sensing Image Semantic Segmentation

Fuxiang Liu, Zhiqiang Hu, Lei Li^*, Hanlu Li, Xinxin Liu

^*Corresponding author for this work

School of Aerospace Engineering

Research output: Contribution to journal › Article › peer-review

Abstract

Combining convolutional neural networks (CNNs) and transformers is a crucial direction in remote sensing image semantic segmentation. However, due to differences in the spatial information focus and feature extraction methods, existing feature transfer and fusion strategies do not effectively integrate the advantages of both approaches. To address these issues, we propose a CNN-transformer hybrid network for precise remote sensing image semantic segmentation. We propose a novel Swin Transformer block to optimize feature extraction and enable the model to handle remote sensing images of arbitrary sizes. Additionally, we design an Edge Spatial Attention module to focus attention on local edge structures, effectively integrating global features and local details. This facilitates efficient information flow between the Transformer encoder and CNN decoder. Finally, a multi-scale convolutional decoder is employed to fully leverage both global information from the Transformer and local features from the CNN, leading to accurate segmentation results. Our network achieved state-of-the-art performance on the Vaihingen and Potsdam datasets, reaching mIoU and F1 scores of 67.37% and 79.82%, as well as 72.39% and 83.68%, respectively. Our code is publicly available at: http://github.com/TarsDolores/LZ.

Original language	English
Journal	IEEE Signal Processing Letters
DOIs	http://doi.org/10.1109/LSP.2025.3550858
Publication status	Accepted/In press - 2025

Keywords

edge detection
remote sensing image
semantic segmentation
swin transformer

Access to Document

10.1109/LSP.2025.3550858

Cite this

@article{1dbcab08b3014e3d9d5aeaa461453731,

title = "Enhanced Swin Transformer and Edge Spatial attention For Remote Sensing Image Semantic Segmentation",

abstract = "Combining convolutional neural networks (CNNs) and transformers is a crucial direction in remote sensing image semantic segmentation. However, due to differences in the spatial information focus and feature extraction methods, existing feature transfer and fusion strategies do not effectively integrate the advantages of both approaches. To address these issues, we propose a CNN-transformer hybrid network for precise remote sensing image semantic segmentation. We propose a novel Swin Transformer block to optimize feature extraction and enable the model to handle remote sensing images of arbitrary sizes. Additionally, we design an Edge Spatial Attention module to focus attention on local edge structures, effectively integrating global features and local details. This facilitates efficient information flow between the Transformer encoder and CNN decoder. Finally, a multi-scale convolutional decoder is employed to fully leverage both global information from the Transformer and local features from the CNN, leading to accurate segmentation results. Our network achieved state-of-the-art performance on the Vaihingen and Potsdam datasets, reaching mIoU and F1 scores of 67.37\% and 79.82\%, as well as 72.39\% and 83.68\%, respectively. Our code is publicly available at: http://github.com/TarsDolores/LZ.",

keywords = "edge detection, remote sensing image, semantic segmentation, swin transformer",

author = "Fuxiang Liu and Zhiqiang Hu and Lei Li and Hanlu Li and Xinxin Liu",

year = "2025",

doi = "10.1109/LSP.2025.3550858",

language = "English",

journal = "IEEE Signal Processing Letters",

issn = "1070-9908",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - Enhanced Swin Transformer and Edge Spatial attention For Remote Sensing Image Semantic Segmentation

AU - Liu, Fuxiang

AU - Hu, Zhiqiang

AU - Li, Lei

AU - Li, Hanlu

AU - Liu, Xinxin

PY - 2025

Y1 - 2025

N2 - Combining convolutional neural networks (CNNs) and transformers is a crucial direction in remote sensing image semantic segmentation. However, due to differences in the spatial information focus and feature extraction methods, existing feature transfer and fusion strategies do not effectively integrate the advantages of both approaches. To address these issues, we propose a CNN-transformer hybrid network for precise remote sensing image semantic segmentation. We propose a novel Swin Transformer block to optimize feature extraction and enable the model to handle remote sensing images of arbitrary sizes. Additionally, we design an Edge Spatial Attention module to focus attention on local edge structures, effectively integrating global features and local details. This facilitates efficient information flow between the Transformer encoder and CNN decoder. Finally, a multi-scale convolutional decoder is employed to fully leverage both global information from the Transformer and local features from the CNN, leading to accurate segmentation results. Our network achieved state-of-the-art performance on the Vaihingen and Potsdam datasets, reaching mIoU and F1 scores of 67.37% and 79.82%, as well as 72.39% and 83.68%, respectively. Our code is publicly available at: http://github.com/TarsDolores/LZ.

AB - Combining convolutional neural networks (CNNs) and transformers is a crucial direction in remote sensing image semantic segmentation. However, due to differences in the spatial information focus and feature extraction methods, existing feature transfer and fusion strategies do not effectively integrate the advantages of both approaches. To address these issues, we propose a CNN-transformer hybrid network for precise remote sensing image semantic segmentation. We propose a novel Swin Transformer block to optimize feature extraction and enable the model to handle remote sensing images of arbitrary sizes. Additionally, we design an Edge Spatial Attention module to focus attention on local edge structures, effectively integrating global features and local details. This facilitates efficient information flow between the Transformer encoder and CNN decoder. Finally, a multi-scale convolutional decoder is employed to fully leverage both global information from the Transformer and local features from the CNN, leading to accurate segmentation results. Our network achieved state-of-the-art performance on the Vaihingen and Potsdam datasets, reaching mIoU and F1 scores of 67.37% and 79.82%, as well as 72.39% and 83.68%, respectively. Our code is publicly available at: http://github.com/TarsDolores/LZ.

KW - edge detection

KW - remote sensing image

KW - semantic segmentation

KW - swin transformer

UR - http://www.scopus.com/pages/publications/105000129361

U2 - 10.1109/LSP.2025.3550858

DO - 10.1109/LSP.2025.3550858

M3 - Article

AN - SCOPUS:105000129361

SN - 1070-9908

JO - IEEE Signal Processing Letters

JF - IEEE Signal Processing Letters

ER -

Enhanced Swin Transformer and Edge Spatial attention For Remote Sensing Image Semantic Segmentation

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this