ACMMM-2024 卫星遥感图像相关论文14篇
Spatial-Temporal Context Model for Remote Sensing Imagery Compression
文章解读: http://www.studyai.com/xueshu/paper/detail/1309bf5fd0
文章链接: (https://openreview.net/forum?id=YTNN0mOPQN)
摘要
With the increasing spatial and temporal resolutions of obtained remote sensing (RS) images, effective compression becomes critical for storage, transmission, and large-scale in-memory processing.
Although image compression methods achieve a series of breakthroughs for daily images, a straightforward application of these methods to RS domain underutilizes the properties of the RS images, such as content duplication, homogeneity, and temporal redundancy.
This paper proposes a Spatial-Temporal Context model (STCM) for RS image compression, jointly leveraging context from a broader spatial scope and across different temporal images.
Specifically, we propose a stacked diagonal masked module to expand the contextual reference scope, which is stackable and maintains its parallel capability.
Furthermore, we propose spatial-temporal contextual adaptive coding to enable the entropy estimation to reference context across different temporal RS images at the same geographic location.
Experiments show that our method outperforms previous state-of-the-art compression methods on rate-distortion (RD) performance.
For downstream tasks validation, our method reduces the bitrate by 52 times for single temporal images in the scene classification task while maintaining accuracy…
Rethinking the Implicit Optimization Paradigm with Dual Alignments for Referring Remote Sensing Image Segmentation
文章解读: http://www.studyai.com/xueshu/paper/detail/187044f80f
文章链接: (https://openreview.net/forum?id=hpWtPMxOjm)
摘要
Referring Remote Sensing Image Segmentation (RRSIS) is a challenging task that aims to identify specific regions in aerial images that are relevant to given textual conditions.
Existing methods tend to adopt the paradigm of implicit optimization, utilizing a framework consisting of early cross-modal feature fusion and a fixed convolutional kernel-based predictor, neglecting the inherent inter-domain gap and conducting class-agnostic predictions.
In this paper, we rethink the issues with the implicit optimization paradigm and address the RRSIS task from a dual-alignment perspective.
Specifically, we prepend the dedicated Dual Alignment Network (DANet), including an explicit alignment strategy and a reliable agent alignment module.
The explicit alignment strategy effectively reduces domain discrepancies by narrowing the inter-domain affinity distribution.
Meanwhile, the reliable agent alignment module aims to enhance the predictor’s multi-modality awareness and alleviate the impact of deceptive noise interference.
Extensive experiments on two remote sensing datasets demonstrate the effectiveness of our proposed DANet in achieving superior segmentation performance without introducing additional learnable parameters compared to state-of-the-art methods…
MTSNet: Joint Feature Adaptation and Enhancement for Text-Guided Multi-view Martian Terrain Segmentation
文章解读: http://www.studyai.com/xueshu/paper/detail/2b83f22cc9
文章链接: (https://openreview.net/forum?id=SWuns4mMsy)
摘要
Martian terrain segmentation plays a crucial role in autonomous navigation and safe driving of Mars rovers as well as global analysis of Martian geological landforms.
However, most deep learning-based segmentation models cannot effectively handle the challenges of highly unstructured and unbalanced terrain distribution on the Martian surface, thus leading to inadequate adaptability and generalization ability.
In this paper, we propose a novel multi-view Martian Terrain Segmentation framework (MTSNet) by developing an efficient Martian Terrain text-Guided Segment Anything Model (MTG-SAM) and combining it with a tailored Local Terrain Feature Enhancement Network (LTEN) to capture intricate terrain details.
Specifically, the proposed MTG-SAM is equipped with a Terrain Context attention Adapter Module (TCAM) to efficiently and effectively unleashing the model adaptability and transferability on Mars-specific terrain distribution.
Then, a Local Terrain Feature Enhancement Network (LTEN) is designated to compensate for the limitations of MTG-SAM in capturing the fine-grained local terrain features of Mars surface.
Afterwards, a simple yet efficient Gated Fusion Module (GFM) is introduced to dynamically merge the global contextual features from MTG-SAM encoder and the local refined features from LTEN module for comprehensive terrain feature learning.
Moreover, the proposed MTSNet enables terrain-specific text as prompts resolving the efficiency issue of existing methods that require costly annotation of bounding boxes or foreground points.
Experimental results on AI4Mars and ConeQuest datasets demonstrate that our proposed MTSNet can effectively learns the unique Martian terrain feature distribution and achieves state-of-the-art performance on multi-view terrain segmentation from both the perspectives of the Mars rover and the satellite remote sensing.
Code is available at https://github.com/raoxuefeng/mtsnet…
Language-Guided Visual Prompt Compensation for Multi-Modal Remote Sensing Image Classification with Modality Absence
文章解读: http://www.studyai.com/xueshu/paper/detail/440d5c5e71
文章链接: (https://openreview.net/forum?id=REwjoWjVQm)
摘要
Joint classification of multi-modal remote sensing images has achieved great success thanks to complementary advantages of multi-modal images.
However, modality absence is a common dilemma in real world caused by imaging conditions, which leads to a breakdown of most classification methods that rely on complete modalities.
Existing approaches either learn shared representations or train specific models for each absence case so that they commonly confront the difficulty of balancing the complementary advantages of the modalities and scalability of the absence case.
In this paper, we propose a language-guided visual prompt compensation network (LVPCnet) to achieve joint classification in case of arbitrary modality absence using a unified model that simultaneously considers modality complementarity.
It embeds missing modality-specific knowledge into visual prompts to guide the model in capturing complete modal information from available ones for classification.
Specifically, a language-guided visual feature decoupling stage (LVFD-stage) is designed to extract shared and specific modal feature from multi-modal images, establishing a complementary representation model of complete modalities.
Subsequently, an absence-aware visual prompt compensation stage (VPC-stage) is proposed to learn visual prompts containing missing modality-specific knowledge through cross-modal representation alignment, further guiding the complementary representation model to reconstruct modality-specific features for missing modalities from available ones based on the learned prompts.
The proposed VPC-stage entails solely training visual prompts to perceive missing information without retraining the model, facilitating effective scalability to arbitrary modal missing scenarios.
Systematic experiments conducted on three public datasets have validated the effectiveness of the proposed approach…
MultiDAN: Unsupervised, Multistage, Multisource and Multitarget Domain Adaptation for Semantic Segmentation of Remote Sensing Images
文章解读: http://www.studyai.com/xueshu/paper/detail/4976e822bb
文章链接: (https://openreview.net/forum?id=shDRfGVRHP)
摘要
Unsupervised domain adaptation (UDA) has been a crucial way for cross-domain semantic segmentation of remote sensing images and reached apparent advents.
However, most existing efforts focus on single source single target domain adaptation, which don’t explicitly consider the serious domain shift between multiple source and target domains in real applications, especially inter-domain shift between various target domains and intra-domain shift within each target domain.
In this paper, to address simultaneous inter-domain shift and intra-domain shift for multiple target domains, we propose a novel unsupervised, multistage, multisource and multitarget domain adaptation network (MultiDAN), which involves multisource and multitarget domain adaptation (MSMTDA), entropy-based clustering (EC) and multistage domain adaptation (MDA).
Specifically, MSMTDA learns feature-level multiple adversarial strategies to alleviate complex domain shift between multiple target and source domains.
Then, EC clusters the various target domains into multiple subdomains based on entropy of target predictions of MSMTDA.
Besides, we propose a new pseudo label update strategy (PLUS) to dynamically produce more accurate pseudo labels for MDA.
Finally, MDA aligns the clean subdomains, including pseudo labels generated by PLUS, with other noisy subdomains in the output space via the proposed multistage adaptation algorithm (MAA).
The extensive experiments on the benchmark remote sensing datasets highlight the superiority of our MultiDAN against recent state-of-the-art UDA methods…
Continual Panoptic Perception: Towards Multi-modal Incremental Interpretation of Remote Sensing Images
文章解读: http://www.studyai.com/xueshu/paper/detail/51d7b03870
文章链接: (https://openreview.net/forum?id=h3UFAF6sdS)
摘要
Continual learning (CL) breaks off the one-way training manner and enables a model to adapt to new data, semantics and tasks continuously.
However, current CL methods mainly focus on single tasks.
Besides, CL models are plagued by catastrophic forgetting and semantic drift since the lack of old data, which often occurs in remote-sensing interpretation due to the intricate fine-grained semantics.
In this paper, we propose Continual Panoptic Perception (CPP), a unified continual learning model that leverages multi-task joint learning covering pixel-level classification, instance-level segmentation and image-level perception for universal interpretation in remote sensing images.
Concretely, we propose a collaborative cross-modal encoder (CCE) to extract the input image features, which supports pixel classification and caption generation synchronously.
To inherit the knowledge from the old model without exemplar memory, we propose a task-interactive knowledge distillation (TKD) method, which leverages cross-modal optimization and task-asymmetric pseudo-labeling (TPL) to alleviate catastrophic forgetting.
Furthermore, we also propose a joint optimization mechanism to achieve end-to-end multi-modal panoptic perception.
Experimental results on the fine-grained panoptic perception dataset validate the effectiveness of the proposed model, and also prove that joint optimization can boost sub-task CL efficiency with over 13% relative improvement on PQ…
Accurate and Lightweight Learning for Specific Domain Image-Text Retrieval
文章解读: http://www.studyai.com/xueshu/paper/detail/6cb25b6af4
文章链接: (https://openreview.net/forum?id=1t7RW2Ixps)
摘要
Recent advances in vision-language pre-trained models like CLIP have greatly enhanced general domain image-text retrieval performance.
This success has led scholars to develop methods for applying CLIP to Specific Domain Image-Text Retrieval (SDITR) tasks such as Remote Sensing Image-Text Retrieval (RSITR) and Text-Image Person Re-identification (TIReID).
However, these methods for SDITR often neglect two critical aspects: the enhancement of modal-level distribution consistency within the retrieval space and the reduction of CLIP’s computational cost during inference, resulting in suboptimal retrieval spaces and unnecessarily high inference computational loads.
To address these issues, this paper presents a novel framework, Accurate and lightweight learning for specific domain Image-text Retrieval (AIR), based on the CLIP architecture.
AIR incorporates a Modal-Level distribution Consistency Enhancement regularization (MLCE) loss and a Self-Pruning Distillation Strategy (SPDS) to improve retrieval precision and computational efficiency.
The MLCE loss harmonizes the sample distance distributions within image and text modalities, fostering a retrieval space closer to the ideal state.
Meanwhile, SPDS employs a strategic knowledge distillation process to transfer deep multimodal insights from CLIP to a shallower level, maintaining only the essential layers for inference, thus achieving model light-weighting.
Comprehensive experiments across various datasets in RSITR and TIReID demonstrate the effectiveness of both MLCE loss and SPDS.
The study also explores the limits of SPDS’s performance and compares it with conventional teacher-student distillation methods.
The findings reveal that MLCE loss secures optimal retrieval on several datasets, while SPDS achieves a favorable balance between accuracy and computational demand during testing…
Information Fusion with Knowledge Distillation for Fine-grained Remote Sensing Object Detection
文章解读: http://www.studyai.com/xueshu/paper/detail/77f69625a9
文章链接: (https://openreview.net/forum?id=IkELeZGD2U)
摘要
Fine-grained remote sensing object detection aims to locate and identify specific targets with variable scale and orientation from complex background in the high-resolution and wide-swath images, which needs requirement of high precision and real-time processing simultaneously.
Although traditional knowledge distillation technology show its effectiveness in model compression and accuracy preservation for natural images, the challenges of heavy background noise and intra-class similarity faced by remote sensing images limits the knowledge quality of teacher model and the learning ability of student model.
To address these issues, we propose an Information Fusion with Knowledge Distillation (IFKD) method that enhances the student model’s performance by integrating information from external images, frequency domain, and hyperbolic space.
Firstly, we propose an external interference enhancement (EDE) module, which utilizes MobileSAM introducing information from external to enrich teachers’ knowledge set, compete with teachers for the right to cultivate students, and weaken students’ dependence on teachers.
Secondly, to strengthen the representation of key features and improve the quality of knowledge, a frequency domain reconstruction (FDR) module is proposed, which is mainly performed by resampling the low-frequency background frequency to suppress the interference of background noise.
Finally, aiming at the problem of intra-class similarity, hyperbolic similarity mask (HSM) module is designed to magnify intra-class differences and guide students to analyze teachers’ knowledge based on the exponential growth of hyperbolic spatial ability.
Experiments on the optical ShipRSImageNet and SAR Aircraft-1.0 datasets verify that the IFKD method significantly enhances performance in fine-grained recognition tasks compared to existing distillation techniques.
Among them, 65.8%
A
P
50
AP_{50}
AP50 can be improved by 2.6% on ShipRSImageNet dataset, and 81.4%
A
P
50
AP_{50}
AP50 can be improved by 1.4% on SAR Aircraft-1.0…
Training pansharpening networks at full resolution using degenerate invariance
文章解读: http://www.studyai.com/xueshu/paper/detail/7b5dc31e0e
文章链接: (https://openreview.net/forum?id=KFbRc2bvaJ)
摘要
Pan-sharpening is an important technique for remote sensing imaging systems to obtain high resolution multispectral images.
Existing deep learning-based methods mostly rely on using pseudo-groundtruth multi-spectral images for supervised learning.
The whole training process only remains at the scale of reduced resolution, which means that the impact of the degradation process is ignored and high-quality images cannot be guaranteed at full resolution.
To address the challenge, we propose a new unsupervised framework that does not rely on pseudo-groundtruth but uses the invariance of the degradation process to build a consistent loss function on the original scale for network training.
Specifically, first, we introduce the operator learning method to build an exact mapping function from multi-spectral to panchromatic images and decouple spectral features and texture features.
Then, through joint training, operators and convolutional networks can learn the spatial degradation process and spectral degradation process at full resolution, respectively.
By introducing them to build consistency constraints, we can train the pansharpening network at the original full resolution.
Our approach could be applied to existing pansharpening methods, improving their usability on original data, which is matched to practical application requirements.
The experimental results on different kinds of satellite datasets demonstrate that the new network outperforms state-of-the-art methods both visually and quantitatively…
Eliminate Before Align: A Remote Sensing Image-Text Retrieval Framework with Keyword Explicit Reasoning
文章解读: http://www.studyai.com/xueshu/paper/detail/a2b6324eda
文章链接: (https://openreview.net/forum?id=n6dMs3Qpax)
摘要
Mountains of researches center around the Remote Sensing Image-Text Retrieval (RSITR), aiming at retrieving the corresponding targets based on the given query.
Among them, the transfer of Foundation Models (FMs), such as CLIP, to remote sensing domain shows promising results.
However, existing FM-based approaches neglect the negative impact of weakly correlated sample pairs and the key distinctions among remote sensing texts, leading to biased and superficial exploration of sample pairs.
To address these challenges, we propose a novel Eliminate Before Align strategy with Keyword Explicit Reasoning framework (EBAKER) for RSITR.
Specifically, we devise an innovative Eliminate Before Align (EBA) strategy to filter out the weakly correlated sample pairs to mitigate their deviations from optimal embedding space during alignment.
Moreover, we introduce a Keyword Explicit Reasoning (KER) module to facilitate the positive role of subtle key concept differences.
Without bells and whistles, our method achieves a one-step transformation from FM to RSITR task, obviating the necessity for extra pretraining on remote sensing data.
Extensive experiments on three popular benchmark datasets validate that our proposed EBAKER method outperform the state-of-the-art methods with fewer training data.
Our source code will be released soon…
Multi-scale Change-Aware Transformer for Remote Sensing Image Change Detection
文章解读: http://www.studyai.com/xueshu/paper/detail/affba04dff
文章链接: (https://openreview.net/forum?id=kMQ3LAiWpx)
摘要
Change detection identifies differences between images captured at different times.
Real-world change detection faces challenges from the diverse and intricate nature of change areas, while current datasets and algorithms are often limited to simpler, uniform changes, reducing their effectiveness in practical application.
Existing dual-branch methods process images independently, risking the loss of change information due to insufficient early interaction.
In contrast, single-stream approaches, though improving early integration, lack efficacy in capturing complex changes.
To address these issues, we introduce a novel single-stream architecture, the Multi-scale Change-Aware Transformer (MACT), which features the Dynamic Change-Aware Attention module and the Multi-scale Change-Enhanced Aggregator.
The Dynamic Change-Aware Attention module, integrating local self-attention and cross-temporal attention, conducts dynamic iteration on images differences, thereby targeting feature extraction of change areas.
The Multi-scale Change-Enhanced Aggregator enables the model to adapt to various scales and complex shapes through local change enhancement and multiscale aggregation strategies.
To overcome the limitations of existing datasets regarding the scale diversity and morphological complexity of change areas, we construct the Mining Area Change Detection dataset.
The dataset offers a diverse array of change areas that span multiple scales and exhibit complex shapes, providing a robust benchmark for change detection.
Extensive experiments demonstrate that the our model outperforms existing methods, especially for irregular and multi-scale changes…
UrbanCross: Enhancing Satellite Image-Text Retrieval with Cross-Domain Adaptation
文章解读: http://www.studyai.com/xueshu/paper/detail/bd0b1643d0
文章链接: (https://openreview.net/forum?id=TMeLmiQOTk)
摘要
Urbanization challenges underscore the necessity for effective satellite image-text retrieval methods to swiftly access specific information enriched with geographic semantics for urban applications.
However, existing methods often overlook significant domain gaps across diverse urban landscapes, primarily focusing on enhancing retrieval performance within single domains.
To tackle this issue, we present UrbanCross, a new framework for cross-domain satellite image-text retrieval.
UrbanCross leverages a high-quality, cross-domain dataset enriched with extensive geo-tags from three countries to highlight domain diversity.
It employs the Large Multimodal Model (LMM) for textual refinement and the Segment Anything Model (SAM) for visual augmentation, achieving a fine-grained alignment of images, segments and texts, yielding a 10% improvement in retrieval performance.
Additionally, UrbanCross incorporates an adaptive curriculum-based source sampler and a weighted adversarial cross-domain fine-tuning module, progressively enhancing adaptability across various domains.
Extensive experiments confirm UrbanCross’s superior efficiency in retrieval and adaptation to new urban environments, demonstrating an average performance increase of 15% over its version without domain adaptation mechanisms, effectively bridging the domain gap.
Our code is publicly accessible, and the dataset will be made available at https://anonymous.4open.science/r/UrbanCross/…
PSM: Learning Probabilistic Embeddings for Multi-scale Zero-shot Soundscape Mapping
文章解读: http://www.studyai.com/xueshu/paper/detail/e1fc57d409
文章链接: (https://openreview.net/forum?id=qnW0LQXY5L)
摘要
A soundscape is defined by the acoustic environment a person perceives at a location.
In this work, we propose a framework for mapping soundscapes across the Earth.
Since soundscapes involve sound distributions that span varying spatial scales, we represent locations with multi-scale satellite imagery and learn a joint representation among this imagery, audio, and text.
To capture the inherent uncertainty in the soundscape of a location, we additionally design the representation space to be probabilistic.
We also fuse ubiquitous metadata (including geolocation, time, and data source) to enable learning of spatially and temporally dynamic representations of soundscapes.
We demonstrate the utility of our framework by creating large-scale soundscape maps integrating both audio and text with temporal control.
To facilitate future research on this task, we also introduce a large-scale dataset, GeoSound, containing over 300k geotagged audio samples paired with both lowand high-resolution satellite imagery.
We demonstrate that our method outperforms the existing state-of-the-art on both GeoSound and the existing SoundingEarth dataset.
Our dataset and code will be made available at TBD…
Selection and Reconstruction of Key Locals: A Novel Specific Domain Image-Text Retrieval Method
文章解读: http://www.studyai.com/xueshu/paper/detail/ed1c3f2a73
文章链接: (https://openreview.net/forum?id=Z8ViXPfcUr)
摘要
In recent years, Vision-Language Pre-training (VLP) models have demonstrated rich prior knowledge for multimodal alignment, prompting investigations into their application in Specific Domain Image-Text Retrieval (SDITR) such as Text-Image Person Re-identification (TIReID) and Remote Sensing Image-Text Retrieval (RSITR).
Due to the unique data characteristics in specific scenarios, the primary challenge is to leverage discriminative fine-grained local information for improved mapping of images and text into a shared space.
Current approaches interact with all multimodal local features for alignment, implicitly focusing on discriminative local information to distinguish data differences, which may bring noise and uncertainty.
Furthermore, their VLP feature extractors like CLIP often focus on instance-level representations, potentially reducing the discriminability of fine-grained local features.
To alleviate these issues, we propose an Explicit Key Local information Selection and Reconstruction Framework (EKLSR), which explicitly selects key local information to enhance feature representation.
Specifically, we introduce a Key Local information Selection and Fusion (KLSF) that utilizes hidden knowledge from the VLP model to select interpretably and fuse key local information.
Secondly, we employ Key Local segment Reconstruction (KLR) based on multimodal interaction to reconstruct the key local segments of images (text), significantly enriching their discriminative information and enhancing both inter-modal and intra-modal interaction alignment.
To demonstrate the effectiveness of our approach, we conducted experiments on five datasets across TIReID and RSITR.
Notably, our EKLSR model achieves state-of-the-art performance on two RSITR datasets…