0% found this document useful (0 votes)
7 views

The Segment Anything Model Sam For Remote Sensing Applications

Uploaded by

dennisboachie9
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

The Segment Anything Model Sam For Remote Sensing Applications

Uploaded by

dennisboachie9
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

International Journal of Applied Earth Observation and Geoinformation 124 (2023) 103540

Contents lists available at ScienceDirect

International Journal of Applied Earth Observation and


Geoinformation
journal homepage: www.elsevier.com/locate/jag

The Segment Anything Model (SAM) for remote sensing applications: From
zero to one shot
Lucas Prado Osco a ,∗, Qiusheng Wu b , Eduardo Lopes de Lemos c , Wesley Nunes Gonçalves c ,
Ana Paula Marques Ramos d , Jonathan Li e , José Marcato Junior c
a
University of Western São Paulo (UNOESTE), Rod. Raposo Tavares, km 572, Limoeiro, Presidente Prudente, 19067-175, Brazil
b University of Tennessee (UT), 1331 Circle Park Drive, Knoxville, 37996-0925, United States
c Federal University of Mato Grosso do Sul (UFMS), Av. Costa e Silva-Pioneiros, Cidade Universitária, Campo Grande, 79070-900, Brazil
d São Paulo State University (UNESP), Centro Educacional, R. Roberto Simonsen, 305, Presidente Prudente, 19060-900, Brazil
e
University of Waterloo (UW), 200 University Avenue West, Waterloo, N2L 3G1, Canada

ARTICLE INFO ABSTRACT

Dataset link: GitHub: AI-RemoteSensing, GitHu Segmentation is an essential step for remote sensing image processing. This study aims to advance the
b: Segment-Geospatial application of the Segment Anything Model (SAM), an innovative image segmentation model by Meta AI,
in the field of remote sensing image analysis. SAM is known for its exceptional generalization capabilities
Keywords:
Artificial intelligence
and zero-shot learning, making it a promising approach to processing aerial and orbital images from diverse
Image segmentation geographical contexts. Our exploration involved testing SAM across multi-scale datasets using various input
Multi-scale datasets prompts, such as bounding boxes, individual points, and text descriptors. To enhance the model’s performance,
Text-prompt technique we implemented a novel automated technique that combines a text-prompt-derived general example with
one-shot training. This adjustment resulted in an improvement in accuracy, underscoring SAM’s potential for
deployment in remote sensing imagery and reducing the need for manual annotation. Despite the limitations,
encountered with lower spatial resolution images, SAM exhibits promising adaptability to remote sensing
data analysis. We recommend future research to enhance the model’s proficiency through integration with
supplementary fine-tuning techniques and other networks. Furthermore, we provide the open-source code of
our modifications on online repositories, encouraging further and broader adaptations of SAM to the remote
sensing domain.

1. Introduction The Segment Anything Model (SAM), developed by Meta AI, is


a groundbreaking approach to image segmentation that has demon-
The field of remote sensing deals with capturing images of the strated exceptional generalization capabilities across a diverse range
Earth’s surface from airborne or satellite sensors. Analyzing these im- of image datasets, requiring no additional training for unfamiliar ob-
ages allows us to monitor environmental changes, manage disasters, jects (Kirillov et al., 2023). This approach enables it to make accurate
and plan urban areas efficiently (Gómez et al., 2016; Song et al., 2023; predictions with little to no training data. However, its potential can
Yuan et al., 2020). A critical part of this analysis is the ability to be limited when facing specific domain conditions. To overcome this
accurately identify and segment various objects or regions within these limitation, SAM can be modified by a re-learning approach (Zhang
images, a process known as image segmentation. Segmentation allows
et al., 2023b), feeding it with a single example of a new class or object
us to isolate specific objects or areas within an image for further study
for better results.
or monitoring (Kotaridis and Lazaridou, 2021). Traditional segmenta-
Zero-shot learning pertains to a model’s capability to accurately
tion techniques often require extensive human input and intervention
process and act upon input data that it has not explicitly encountered
for accurate results. However, with the advent of advanced artificial
during training (Alayrac et al., 2022; Sun et al., 2021). This ability is
intelligence (AI) and deep learning methods (Bai et al., 2022; Aleissaee
et al., 2023), the segmentation process has become more automated, derived from gaining a generalized understanding of the data rather
albeit still facing challenges, particularly in the effective segmentation than specific instances. Zero-shot learning systems can recognize ob-
of images with minimal human input. jects or understand tasks they have never seen before based on learning

∗ Corresponding author.
E-mail address: [email protected] (L.P. Osco).

https://doi.org/10.1016/j.jag.2023.103540
Received 5 July 2023; Received in revised form 19 October 2023; Accepted 26 October 2023
Available online 1 November 2023
1569-8432/© 2023 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
L.P. Osco et al. International Journal of Applied Earth Observation and Geoinformation 124 (2023) 103540

underlying concepts or relationships. In contrast, one-shot learning such as clustering and thresholding, involve grouping pixels with sim-
denotes a model’s ability to interpret and make accurate inferences ilar characteristics, while object-based techniques focus on segmenting
from just a single example of a new class (Zhang et al., 2023b). By images based on properties of larger regions or objects (Hossain and
feeding SAM with a single example (or ‘shot’) of this new class, we can Chen, 2019; Wang et al., 2020b). However, these methods can be
potentially enhance its performance, as it has more specific information limited in their ability to handle the complexity, variability, and high
to work with. spatial resolution of modern remote sensing imagery (Kotaridis and
The best-known one-shot methods for SAM are named PerSAM and Lazaridou, 2021).
PerSAM-F, both being training-free personalization approaches (Zhang Segmentation involves various methods designed to separate or
et al., 2023b). Given a single image with a reference mask, PerSAM group portions of an image based on certain criteria (Zhang et al.,
localizes the target concept using a location prior to an initial estimate 2021). Each method has a unique approach and application. Interactive
of where the object of interest is likely to be. The second method Segmentation, for example, is a niche within image segmentation that
is PerSAM-F, a variant of PerSAM that uses one-shot fine-tuning to actively incorporates user input to improve the segmentation process,
reduce mask ambiguity. In this case, the entire SAM is frozen (i.e., its making it more precise and tailored to specific requirements (Li et al.,
parameters are not updated during the fine-tuning process), and two 2020; Wu et al., 2021). Different interactive segmentation methods
learnable weights are introduced for multi-scale masks. This one-shot utilize various strategies to include human intelligence in the loop.
fine-tuning variant requires training only two parameters and can be This makes interactive segmentation particularly useful in tasks where
done in as little as ten seconds to enhance performance (Zhang et al., high precision is required, and generic segmentation methods may not
2023b). Both are capable of improving SAM, making it a flexible model. suffice.
Another important aspect relates to SAM’s ability to perform seg- Super Pixelization is another method that groups pixels in an image
mentation with minimal input, requiring only a bounding box or a into larger units, or ‘‘superpixels’’, based on shared characteristics such
single point as a reference, or even a prompt text as guidance (Kirillov as color or texture (Gharibbafghi et al., 2018). This grouping can
et al., 2023). This capability has the potential to reduce human labor simplify the image data while preserving the essential structure of the
during the annotation process. Many existing techniques require inten- objects. Object Proposal Generation goes a step further by suggesting
sive annotations for each new object of interest, resulting in significant potential object bounding boxes or regions within an image (Hossain
computational overhead and potential delays in time-sensitive applica- and Chen, 2019; Su et al., 2019). These proposals serve as a guide for a
tions. SAM, on the other hand, presents an opportunity to alleviate this more advanced model to identify and classify the actual objects’ pixels.
time-intensive task. Foreground Segmentation, also known as background subtraction, is a
Since SAM’s release in April 2023, the geospatial community has technique primarily used to separate the main subjects or objects of
shown strong interest in adapting SAM for remote sensing image seg- interest (the foreground) from the backdrop (the background) in an
mentation. However, a more in-depth investigation is needed. In this image (Zheng et al., 2020; Ma et al., 2022).
context, we present a first-of-its-kind evaluation of SAM, developing Semantic Segmentation is a more comprehensive approach where
both its zero and one-shot learning performance on segmenting remote every pixel in an image is assigned to a specific class, effectively group-
sensing imagery. We adapted SAM to our data structure, benchmarked ing regions of the image based on semantic interest (Zhang et al., 2020;
it against multiple datasets, and assessed its potential to segment Adam et al., 2023). Instance Segmentation identifies each pixel recog-
multiscale images. We then evolved SAM’s zero-shot characteristic to nizes distinct objects of the same class and recognizes the individual
a one-shot approach and demonstrated that with only one example objects as separate entities or instances (Gao et al., 2021; Qurratulain
of a new class, SAM’s segmentation performance can be significantly et al., 2023). Panoptic Segmentation merges the concepts of semantic
improved. and instance segmentation, assigning every pixel in the image a class
Our proposal’s innovation is within the one-shot technique, which label and a unique instance identifier (Hua et al., 2021; de Carvalho
involves using a prompt-text-based segmentation as a training sample et al., 2022). This method aims to give a complete understanding of
(instead of a human-labeled sample), making it an automated process the image by identifying and classifying every detail.
for refining SAM on remote sensing imagery. In this study, we also All these methods have been intensively studied, but one that
discuss the implications, limitations, and potential future directions of surged in recent years, with the advancements of Visual Foundation
our findings. Understanding the effectiveness of SAM in this domain Models (VFM) and Large Multimodal Models (LMM), is known as
is of paramount importance for novel development. In short, with its ‘‘Promptable Segmentation’’, an approach that aims to create a versatile
promise of zero-shot and one-shot learning, SAM has the potential model capable of adapting to a variety of segmentation tasks (Mialon
to transform current practices by significantly reducing the time and et al., 2023; Zhang et al., 2023a). This is achieved through ‘‘prompt
resources needed for training and annotating data, thereby enabling a engineering’’, where prompts are carefully designed to guide the model
quicker, more efficient approach. toward generating the desired output (Lobry et al., 2020; Sun et al.,
2021). This concept is a departure from traditional multi-task systems
2. Remote sensing image segmentation: A brief summary where a single model is trained to perform a fixed set of tasks. The
unique feature of a promptable segmentation model is its ability to
The remote sensing field has experienced impressive advancements take on new tasks at the time of inference, serving as a component in
in recent years, largely driven by improvements in aerial and or- a larger system (Sun et al., 2021; Mialon et al., 2023). For instance,
bital platform technologies, sensor capabilities, and computational re- to perform instance segmentation, a promptable segmentation model
sources (Toth and Jóźków, 2016; Osco et al., 2021a). One of the most could be combined with an existing object detector.
critical tasks in remote sensing is image segmentation, which involves Object detection is a crucial task in computer vision, focusing on
partitioning images into multiple segments or regions, each, ideally, identifying and locating objects within images. This task is foundational
corresponding to a specific object or class (Kotaridis and Lazaridou, for various applications such as surveillance, autonomous vehicles, and
2021). In this section, we focus on providing comprehensive informa- many others. In the realm of object detection and image segmentation,
tion regarding segmentation processes, deep learning-based methods, different techniques have been employed. Traditional methods often
and techniques, and explain the overall importance of conducting focus on detecting objects that the model has been specifically trained
zero-to-one shot learning. on, known as closed-set detection. However, real-world applications
Traditional image segmentation techniques in remote sensing often demand more flexibility and the ability to detect and classify objects
rely on pixel-based or object-based approaches. Pixel-based methods, not seen during training, known as open-set detection.

2
L.P. Osco et al. International Journal of Applied Earth Observation and Geoinformation 124 (2023) 103540

One state-of-the-art open-set object detector that stands out is applications (Kirillov et al., 2023). By using minimal human input, such
Grounding DINO (GroundDINO), an enhanced transformer-based object as bounding boxes, reference points, or simply text-based prompts, SAM
detector capable of identifying a broader range of objects based on var- can perform segmentation tasks without requiring extensive ground-
ious human inputs (Liu et al., 2023b). This system is an enhancement truth data. This capability can reduce the labor-intensive process of
of the Transformer-based object detector called DINO (Zhang et al., manual annotation and be incorporated into the image processing
2022a), enriched with grounded pre-training to be able to identify pipeline, potentially accelerating its workflow.
a broader range of objects based on human inputs, such as category SAM has been trained on an enormous dataset, of 11 million images
names or referring expressions. An open-set detector is meant to iden- and 1.1 billion masks, and it boasts impressive zero-shot performance
tify and classify objects that were not part of the model’s training data, on already a variety of segmentation tasks (Kirillov et al., 2023). Foun-
as opposed to a closed-set detector that can only recognize objects it has dation models such as this, which have shown promising advancements
been specifically trained on. The information from Grounding DINO can in NLP and, more recently, in computer vision, can carry out zero-shot
potentially be used to guide the segmentation process, providing class learning. This means they can learn from new datasets and perform
labels or object boundaries that the segmentation model could use. new tasks often by utilizing ‘prompting’ techniques, even with little to
Most NLMs incorporate deep-learning-based networks and, with the no previous exposure to these tasks. In the field of NLP, ‘‘foundation
rise of these methods, more advanced segmentation techniques have models’’ refer to large-scale models that are pre-trained on a vast
been developed for remote sensing applications. Convolutional Neural amount of data and are then fine-tuned for specific tasks. These models
Networks (CNNs), which emerged as a popular choice due to their abil- serve as the ‘‘foundation’’ for various applications (Mai et al., 2023;
ity to capture local and hierarchical patterns in images (Martins et al., Mialon et al., 2023; Wu et al., 2023).
2021; Bressan et al., 2022), have widely been used as the backbone for SAM’s ability to generalize across a wide range of objects and
these tasks. CNNs consist of multiple convolutional layers that apply images makes it particularly appealing for remote sensing applications.
filters to learn increasingly complex features, making them well-suited That it can be retrained with a single example of each new class at
for segmenting objects in many remote sensing images (Yuan et al., the time of prediction (Zhang et al., 2023b), demonstrates the models’
2021; Bai et al., 2022). However, they are computationally intensive high flexibility and adaptability. The implementation of a one-shot
and may require substantial training data. approach may assist in designing models that learn useful information
Generative Adversarial Networks (GANs) have also shown potential from a small number of examples — in contrast to traditional models
in the field of image processing. GANs consist of a generator and a which usually require large amounts of data to generalize effectively.
discriminator network, where the generator tries to create synthetic This could potentially revolutionize how we process remote-sensing
data to fool the discriminator, and the discriminator aims to distinguish imagery. As such, by investigating SAM’s innovative technology, we
between real and synthetic data (Jozdani et al., 2022). For image may be able to provide more interactive and adaptable remote sensing
segmentation, GANs can be used to generate realistic images and their systems.
corresponding segmentations, which can supplement the training data
and improve the robustness of the segmentation models (Benjdira et al.,
3. Materials and methods
2019).
Vision Transformer (ViT), on the other hand, is a recent develop-
In this section, we describe how we evaluated the performance
ment in deep learning that has shown promise in image segmentation
of the Segment Anything Model (SAM), for both zero and one-shot
tasks. Unlike CNNs, which rely on convolutional operations, ViT em-
approach, in the context of remote sensing imagery. The method im-
ploys self-attention mechanisms that allow it to model long-range
dependencies and global context within images (Li et al., 2023b,a). This plemented in this study is summarized in Fig. 1. The data for this
approach has demonstrated competitive performance in various com- study consisted of multiple aerial and satellite datasets. These datasets
puter vision tasks, including remote sensing image segmentation (Aleis- were selected to ensure diverse scenarios and a large range of objects
saee et al., 2023), and it is currently outperforming CNNs in remote and landscapes. This helped in assessing the robustness of SAM and its
sensing data (Gonçalves et al., 2023). adaptability to different situations and geographical regions.
Another capability of deep learning that can enhance the segmen- The study particularly investigated SAM’s segmentation capacity
tation process is transfer learning. With it, a model pre-trained on a under different prompting conditions. First, we used the general seg-
large dataset is adapted for a different but related task (Tong et al., mentation approach, in which SAM was tasked to segment objects and
2020). For instance, a CNN or ViTr trained on a large-scale image landscapes without any guiding prompts. This provided a baseline for
recognition dataset like ImageNet can be fine-tuned for the task of SAM’s inherent segmentation capabilities with zero-shot. For this, we
remote sensing image segmentation (Osco et al., 2020, 2021b). The only evaluated its visual quality, since it segments every possible object
advantage of transfer learning is that it can leverage the knowledge in the image, instead of just the ones with ground-truth labels. It also
gained from the initial task to improve performance on the new task, is not guided by any means, thus resulting in the segmentation of
especially when the amount of labeled data for the new task is limited. unknown classes, serving as just a traditional segmentation filter.
One of the main challenges in applying deep learning techniques In the second scenario, bounding boxes were provided. These rect-
to remote sensing image segmentation is the need for large volumes of angular boxes, highlighting specific areas within the images, were used
labeled ground-truth data (Chi et al., 2016). Acquiring and annotating to restrict SAM’s segmentation per object and see its proficiency in
this data can be time-consuming and labor-intensive, requiring expert recognizing and segmenting them. Next, we conducted segmentation
knowledge and resources that may not be readily available. Further- using points as prompts. In this setup, a series of specific points within
more, the variability and complexity of remote sensing imagery can the images were provided to guide SAM’s processing. It allowed us
make the labeling process even more difficult (Amani et al., 2020). As to test the precision potential of SAM. Finally, we experimented with
such, it becomes imperative to develop robust, efficient, and accessible the segmentation process using only textual descriptions as prompts.
solutions that can aid in the processing and analysis of such data. This was conducted with an implementation of SAM alongside Ground-
A model that can perform segmentation with zero domain-specific ingDINO’s method (Liu et al., 2023b). This permitted an evaluation
information may offer an important advantage for this process. of these models’ capabilities to understand, interpret, and transform
In this sense, the Segment Anything Model (SAM) has emerged as textual inputs into precise segmentation outputs.
a potential tool for assisting in the segmentation process of remote To measure SAM’s adaptability and potential to deal with remote
sensing images. SAM design enables it to generalize to new image sensing imagery, we then devised a one-shot implementation. For each
distributions and tasks effectively and already resulted in numerous of the datasets, we presented an example of the target class to SAM.

3
L.P. Osco et al. International Journal of Applied Earth Observation and Geoinformation 124 (2023) 103540

Fig. 1. Schematic representation of the step-by-step process undertaken in this study to evaluate the efficacy of SAM’s approach in remote sensing image processing tasks.

Table 1
Overview of the distinct attributes and specifications of the datasets employed in this study.
# Platform Resolution (m) Area (ha) Target General Box Point Text Reference
00 UAV 0.04 70 Tree Yes Yes Centroid Tree
01 UAV 0.04 70 House Yes Yes Centroid House
02 UAV 0.01 4 Plantation Crop Yes No Multiple Plantation Osco et al. (2021a)
03 UAV 0.04 40 Plantation Crop Yes No Multiple Plantation
04 UAV 0.09 90 Building Yes Yes Centroid Building Gao et al. (2021)
05 UAV 0.09 90 Car Yes Yes Centroid Car
06 Airborne 0.20 120 Tree Yes Yes Centroid Tree
07 Airborne 0.20 120 Vehicle Yes Yes Centroid Vehicle
08 Airborne 0.45 190 Lake Yes Yes Centroid Lake
09 Satellite 0.30 – Building; Road; Water; Barren; Forest; Farm Yes Yes Multiple Building; Road; Water; Barren; Forest; Farm LoveDA
10 Satellite 0.50 480 Building; Street; Water; Vehicle; Tree Yes Yes Yes Building; Street; Water; Vehicle; Tree SkySat ESA

For that, we adapted the model with a novel combination of the text- datasets with the same approach as with the UAV images since they
prompt approach and the one-shot learning method. Specifically, we also consisted of binary problems. The total quantifiable size of these
selected the best possible example (highest logits) of the target object, datasets surpasses 90 Gigabytes and comprises more than 10,000 im-
using textual prompts to define the object for mask generation. This ages and image patches. Part of the dataset, specifically the aerial one
example was then presented to SAM as the sole representative of the (UAV and Airborne), is currently being made public in the following
class, effectively guiding its learning process. The rationale behind link for others to use: GeomaticsandComputerVision/Datasets. These
this combined approach was to leverage the context provided by the datasets cover different area sizes and their corresponding ground-truth
text prompts and the efficacy of the one-shot learning method to the masks were generated and validated by specialists in the field.
adaptability of SAM to an automated enhancement process. The third category consists of Satellite data, which provides the
widest coverage and is focused on multi-class problems. The spatial
3.1. Description of the datasets resolution of satellite data is generally lower than that of UAV and
Airborne data. Furthermore, the quality of the images is more affected
We begin by separating our dataset into three categories related to by atmospheric conditions, with differing illumination conditions, thus
the platform used for capturing the images: 1. Unmanned Aerial Vehicle providing additional challenges for the model. These datasets consist
(UAV); 2. Airborne, and; 3. Satellite. Each of these categories provides of publicly available images from the LoveDA dataset (Wang et al.,
unique advantages and challenges in terms of spatial resolution and 2022) and from the SkySat ESA archive (European Space Agency,
coverage. In our study, we aim to evaluate the performance of SAM
2023) and present a multi-class segmentation problem. To facilitate’s
across these sources to understand its applicability and limitations in
SAM evaluation, specifically with the guided prompts (bounding box,
diverse contexts. Their characteristics are summarized in Table 1. We
point, and text), we conducted a one-against-all approach, in which we
also provided illustrative examples from these datasets in Fig. 2 as in
separated the classes into individual classifications (‘‘specified class’’
bounding boxes and point prompts.
versus ‘‘background’’).
The UAV category comprises data that have the advantage of very-
high spatial resolution, returning images and targets with fine details.
This makes them particularly suitable for local-scale studies and ap- 3.2. Protocol for promptable image segmentation
plications that require high-precision data. However, the coverage
area of UAV datasets is limited compared to other data sources. The In this section, we explain how we adapted SAM to the remote
images comprised particularly single-class objects per dataset, so they sensing domain and how we conducted the promptable image segmen-
were tackled in binary form. In the case of linear objects, specifically tation with it. All of the implemented code, specifically designed for this
continued plantation crops cover, we used multi-points spread within paper, is made publicly available in an under-construction educational
its extremes, to ensure that the model was capable of understating it repository (Osco, 2023). Also, as part of our work, we are focusing
better. For more condensed targets such as houses and trees, we used on developing the ‘‘segment-geospatial’’ package (Wu and Osco, 2023),
the centered position of the object as a point prompt. which implements features that will simplify the process of using SAM
The second category is Airborne data, which includes data collected models for geospatial data analysis. This is a work in progress, but it
by manned aircraft. These datasets typically offer a good compromise is publicly available and offers a suite of tools for performing general
between spatial resolution and coverage area. We processed these segmentation on remote-sensing images using SAM. The goal is to

4
L.P. Osco et al. International Journal of Applied Earth Observation and Geoinformation 124 (2023) 103540

Fig. 2. Collection of image samples utilized in our research. The top row features UAV-based imagery with bounding boxes and point labels, serving as prompts for SAM. The
middle row displays airborne-captured data representing larger regions, with both points and a rectangular box provided as model inputs. The bottom row reveals satellite imagery,
again with bounding boxes and points as prompt inputs, offering a trade-off between lower spatial resolution and wider area coverage.

enable users to engage with this technology with a minimum of coding advanced and complex model currently available, bringing most of the
effort. SAM capabilities to our tests.
To perform the general prompting, we used the generate method of
Our geospatial analysis was conducted with the assistance of a
the SamGeo instance. This operation is simple enough since it segments
custom tool, namely ‘‘SamGeo’’, which is a component of the original
the entire image and stores it as an image mask file, which contained
module. SAM possesses different models to be used, namely: ViT-H, the segmentation masks. Each mask delineates the foreground of the
ViT-L, and ViT-B (Kirillov et al., 2023). These models have different image, with each distinct mask allocated a unique value. This allowed
computational requirements and are distinct in their underlying archi- us to segment different geospatial features. The result is a non-classified
tecture. In this study, we used the ViT-H SAM model, which is the most segmented image that can also be converted into a vector shape. As

5
L.P. Osco et al. International Journal of Applied Earth Observation and Geoinformation 124 (2023) 103540

mentioned, we only evaluated this approach visually, since it was not These thresholds are critical for ensuring the balance between pre-
possible to appropriately assign the segmented regions outside of our cision and recall based on specific data and user requirements. The
reference class. optimal values may diverge depending on the nature and quality of
For the bounding box prompt, we used the SamGeo instance in the images and the specificity of text prompts, warranting user experi-
conjunction with the objects’ shapefile. Bounding boxes are extracted mentation for optimal performance. The segmented individual images
from any multipart polygon geometry returning a, which returned a list and their corresponding boxes are subsequently generated, while the
of geometric boundaries for our image data based on its coordinates. resulting segmentation mask is saved and mosaicked.
To efficiently process these boundaries, we initialized the predictor
instance. In this process, the image was segmented and passed through 3.3. One-shot text-based approach
the predictor along with a designated model checkpoint. Once es-
tablished, the predictor processed each clip box, creating the masks The one-shot training was conducted following the recommendation
for the segmented regions. This process enabled each bounding box’s in Zhang et al. (2023b) by using its PerSAM and PerSAM-F approaches.
contents to be individually examined as instance segmentation masks. We begin by adapting the text-based approach of the combination of
These binary masks were then merged and saved as a single mosaic the GroundDINO (Liu et al., 2023b) and SAM (Kirillov et al., 2023)
raster to create a comprehensive visual representation of the segmented methods to return the overall most probable object belonging to the
regions. Although not focused on remote sensing data, the official specified class in its description. By doing so, we enable an automated
implementation is named Grounded-SAM (IDEA-Research, 2023). process of identifying a single object and including it on a personalized
The single-point feature prompt was implemented similarly to the pipeline for training SAM with this novel knowledge. In this section, we
bounding-box method. For that, we first defined functions to convert describe the procedures involved in the one-shot training mechanism as
the geodata frame into a list of coordinates [x, y] instead of the previous well as the methods used for object identification and personalization.
[x1, y1, x2, y2] ones. We utilized SamGeo again for model prediction To summarize the whole process, we illustrate the main phases in
but with the distinction of setting its automatic parameter to ‘False’ Fig. 3.
and applying the predictor to individual coordinates instead of the Following Fig. 3, the initial phase of the one-shot training mech-
bounding boxes. This approach was conducted by iterating through anism involves the model derived from the object with the highest
each point, predicting its features in instances, and saving the resulting logits calculated from the text-based segmentation. This ensures the
object is accurately recognized and selected for further steps. It is this
mask into a unique file per point (also resulting in instance segmen-
aspect of the process that the text-based approach starts, capitalizing
tation masks). After the mask files were generated, we proceeded to
on GroundDINO’s capabilities for zero-shot visual grounding combined
merge these masks into a single mosaic raster file, giving us a com-
with SAM’s object segmentation for pre-trained model retrieval. As
plete representation of all the segmented regions from the single-point
such, the selected object becomes the ‘‘sample’’ of the one-shot training
feature prompt.
process due to its high probability of belonging to the specified class by
The text-based prompt differentiates from the previous approach
the text.
since it required additional steps to be implemented. This method
Once the object has been identified through this method, the next
combines GroundingDINO’s (Liu et al., 2023b) capabilities for zero-
phase involves creating a single-segmented object mask. This mask is
shot visual grounding with SAM’s object segmentation functionality for
used for the retraining of SAM in a one-shot manner. The text-based
retrieving the pre-trained models. For instance, once Grounding DINO
approach adds value by helping SAM distinguish between the different
has detected and classified an object, SAM is used to isolate that object
object instances present in the remote sensing imagery, such as multiple
from the rest. As a result, we have been able to identify and segment ‘‘houses’’, ‘‘cars’’, or ‘‘trees’’, for example. Each object is identified based
objects within our images based on a specified textual prompt. This on its individual likelihood, leading to the creation of a unique mask
procedure opens up a new paradigm in geospatial analysis, harnessing for retraining SAM. The third phase starts once the object with the
the power of state-of-the-art models to extract image features based highest probability has been identified and its mask has been used for
only on natural language input. SAM’s one-shot training. The selected input object is removed from
Since remote sensing imagery often contained multiple instances the original image, making the remaining objects ready for further
of the same object (e.g., several ‘houses’, ‘cars’, ‘trees’, etc.), we have segmentation.
added a looping procedure. The loop identifies the object with the The final phase involves a dynamic, interactive loop, where the
highest probability in the image (i.e. logits), creates a mask for it, remaining objects are continuously segmented until no more objects
removes it from the image, and then restarts the process to identify the are detectable by the PerSAM approach (Zhang et al., 2023b). This
next highest probable object. This process continues until the model phase is critical as it ensures that every potential object within the
reaches a defined minimum threshold for both detection, based on a image is identified and segmented. Here again, the loop approach aids
box threshold, and text prompt association, also based on an specific the process, using a procedure that identifies the next highest probable
threshold. The precise balancing of these thresholds (ranging from 0 to object, as it creates a mask, removes it from the image, and repeats.
1) is crucial, with implications for the accuracy of the model, so we This cycle continues until a breakpoint is reached, where it detects the
manually set them for each dataset based on trial and error tentatively: previous position again.
Another important aspect of the one-shot approach regards the
• Box Threshold: Utilized for object detection in images. A higher choice of the method for its training. An early exploration of both
value augments model selectivity, isolating only those instances PerSAM and PerSAM-F methods (Zhang et al., 2023b) was conducted to
the model identifies with high confidence. A lower value, con- assess their utility in the context of remote sensing imagery. Our inves-
versely, expands model tolerance, enhancing overall detections tigations have shown that PerSAM-F emerges as a more suitable choice
but possibly including less certain ones. for this specific domain. PerSAM, in its original formulation, leverages
• Text Threshold: Utilized for associating detected objects with one-shot data through a series of techniques such as target-guided at-
provided text prompts. An elevated value mandates a robust asso- tention, target-semantic prompting, and cascaded post-refinement, de-
ciation between the object and text, ensuring precision but poten- livering favorable personalized segmentation performance for subjects
tially limiting associations. A diminished value permits broader in a variety of poses or contexts. However, there were occasional failure
associations, potentially boosting the number of associations but cases, notably where the subjects comprised hierarchical structures to
potentially compromising precision. be segmented.

6
L.P. Osco et al. International Journal of Applied Earth Observation and Geoinformation 124 (2023) 103540

Fig. 3. Visual representation of the one-shot-based text segmentation process in action. The figure provides a step-by-step illustration of how the model identifies and segments
the most probable object based on a text prompt with ‘‘car’’ and ‘‘tree’’ as examples.

Fig. 4. Comparative illustration of tree segmentation using PerSAM and PerSAM-F. On the left, the PerSAM model segments not only the tree but also its shadow and a part of
the car underneath it. On the right, the PerSAM-F model, fine-tuned for hierarchical structures and varying scales, accurately segments only the tree, demonstrating its improved
ability to discern and isolate the target object in remote sensing imagery.

Examples of such cases in traditional images are discussed in Zhang To address this challenge, we used PerSAM-F, the fine-tuning vari-
et al. (2023b), where ambiguity provides a challenge for PerSAM in ant of PerSAM. As previously mentioned, PerSAM-F freezes the entire
determining the scale of the mask as output (e.g. a ‘‘dog wearing a SAM to preserve its pre-trained knowledge and only fine-tunes two
hat’’ may be segmented entirely, instead of just the ‘‘dog’’). In the parameters within a ten seconds training window (Zhang et al., 2023b).
context of remote sensing imagery, such hierarchical structures are Crucially, it enables SAM to produce multiple segmentation results with
commonly encountered. An image may contain a tree over a house, different mask scales, thereby allowing for a more accurate represen-
a car near a building, a river flowing through a forest, and so forth. tation of hierarchical structures commonly found in remote sensing
These hierarchical structures pose a challenge to the PerSAM method, imagery. PerSAM-F employs learnable relative weights for each scale,
as it struggles to determine the appropriate scale of the mask for the which adaptively select the best scale for varying objects. This strategy
segmentation output. An example of such a case, where a tree covers a offers an efficient way to handle the complexity of segmentation tasks
car, can be seen in Fig. 4. in remote sensing imagery, particularly when dealing with objects that

7
L.P. Osco et al. International Journal of Applied Earth Observation and Geoinformation 124 (2023) 103540

exhibit a range of scales within a single image. This, in turn, preserves ranges from 0 (no overlap) to 1 (perfect overlap). The equation to
the characteristics of the segmented objects more faithfully. perform it is given as follows:
As such, PerSAM-F exhibited better segmentation accuracy in our 𝑇𝑃
𝐷𝑖𝑐𝑒 𝐶𝑜𝑒𝑓 𝑓 𝑖𝑐𝑖𝑒𝑛𝑡 = 2 ∗ (3)
early experiments, thus being the chosen method to be incorporated 2 ∗ 𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁
with the text-based approach. In our training phase with PerSAM-F, We also utilized other metrics, particularly, True Positive Rate (TPR)
the DICE loss and Sigmoid Focal Loss are computed, and their summa- and False Positive Rate (FPR) to measure the effectiveness of SAM,
tion forms the final loss that is backpropagated to update the model juxtaposed with the accurately labeled class from each dataset. The
weights. The learning rate is scheduled using the Cosine Annealing interpretation of these metrics as per (Powers, 2020) is: The True
method (Loshchilov and Hutter, 2017), and the model is trained for Positive Rate (TPR) denotes the fraction of TP cases among all actual
1000 epochs. With hardware acceleration incorporated, the model can positive instances, while the False Positive Rate (FPR) signifies the
be trained within a reasonable time frame without requiring excessive fraction of FP instances out of all negative instances. A model with a
computational resources. This careful setup ensures the extraction of higher TPR is proficient at correctly pinpointing lines and edges and
meaningful features from the reference image, contributing to the performs better at avoiding incorrect detections of lines and edges when
effectiveness of our one-shot text-based approach. the FPR is lower. Both metrics are calculated as:
𝑇𝑃
To evaluate the performance and utility of the text-based one- TPR = (4)
(𝑇 𝑃 + 𝐹 𝑁)
shot learning method, we conduct a comparative analysis against a
traditional one-shot learning approach. The traditional method used 𝐹𝑃
FPR = (5)
(𝐹 𝑃 + 𝑇 𝑁)
for comparison follows the typical approach of one-shot learning, pro-
viding the model with a single example from the ground-truth mask, In alignment with the inherent structure of SAM, a transformer
manually labeled by human experts. To ensure fairness, we provided network, our objective was to maintain the comprehensive context
the model with multiple random samples from each dataset, and mimic of our images to fully harness the model’s attention mechanism. This
consideration led to our decision to process larger image crops or entire
the image inputs to return a direct comparison for both approaches.
orthomosaics as a single unit, rather than fragmenting them into fixed-
We calculated the evaluation metrics from each input and returned
sized smaller patches. While this approach enhances the model’s con-
its average value alongside with its standard deviation. Since the text
textual understanding, it understandably augments the computational
approach always uses the same input (i.e. the highest logits object), we
time.
were able to return a single measurement of their accuracies. For most larger patches or quartered orthomosaics, the inference
duration on a GPU was kept under 10 min, providing a balance between
3.4. Model evaluation computational load and contextual analysis. When processing entire
datasets as a whole, the time requirement extended to approximately 1
The performance of both zero-shot and one-shot models was mea- to 2 h. Despite the augmented processing time for larger datasets, the
sured by evaluating their prediction accuracy on a ground-truth mask. assurance of comprehensive contextual analysis justifies this computa-
For that, we used metrics like Intersection over Union (IoU), Pixel tional investment. Still, in fixed-sized patches such as the ones from the
Accuracy, and Dice Coefficient. These metrics are commonly used in publicly available datasets, the inference time was under a second for
evaluating imaging segmentation, as they provide a more nuanced each patch. These inferences were executed on an NVIDIA RTX 3090
understanding of model performance. For that, we compared pairs of equipped with 24 GB GDDR6X video memory and 10,496 CUDA cores,
predicted and ground-truth masks. operating on Ubuntu 22.04.
Intersection over Union (IoU) is a common evaluation metric for 4. Results and discussion
object detection and segmentation problems. It measures the overlap
between the predicted segmentation and the ground truth (Rahman and 4.1. General segmentation
Wang, 2016). The IoU is the area of overlap divided by the area of
the union of the predicted and ground truth segmentation. A higher Our exploration of SAM for remote sensing tasks involved an eval-
IoU means a more accurate segmentation. The equation to achieve it is uation of its performance across various datasets and scenarios. This
presented as: section presents the results and discusses their implications for SAM’s
𝑇𝑃 role in remote sensing image analysis. This process commenced with an
𝐼𝑜𝑈 = (1) investigation of SAM’s general segmentation approach, which requires
𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁
no prompts. By merely feeding SAM with remote sensing images, we
Here, TP represents True Positives (the correctly identified pos-
aimed to observe its inherent ability to detect and distinguish objects on
itives), FP represents False Positives (the incorrectly identified pos- the surface. Examples of different scales are illustrated in Fig. 5, where
itives), and FN represents False Negatives (the positives that were we converted the individual regions to vector format. This approach
missed). demonstrates its adaptability and suitability for various applications.
Pixel Accuracy is the simplest used metric and it measures the However, as this method is not guided by a prompt, it is not return-
percentage of pixels that were accurately classified (Minaee et al., ing specific segmentation classes, making it difficult to measure its
2021). It is calculated by dividing the number of correctly classified accuracy based on our available labels.
pixels by the total number of pixels. This metric can be misleading if As depicted in Fig. 5, the higher the spatial resolution of an im-
the classes are imbalanced. The following equation returns it: age, the more accurately SAM segmented the objects. An interesting
𝑇𝑃 + 𝑇𝑁 observation pertained to the processing of satellite images where SAM
𝑃 𝑖𝑥𝑒𝑙 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (2) encountered difficulties in demarcating the boundaries between con-
𝑇𝑃 + 𝐹𝑃 + 𝑇𝑁 + 𝐹𝑁
tiguous objects (like large fragments of trees or roads). Despite this
Here, TN represents True Negatives (the correctly identified nega-
limitation, SAM exhibited an ability to distinguish between different
tives). regions when considering very-high spatial resolution imagery, indica-
Dice Coefficient (also known as the Sørensen–Dice index) is another tive of an effective segmentation capability that does not rely on any
metric used to gauge the performance of image segmentation methods. prompts. This approach offers value for additional applications that are
It is particularly useful for comparing the similarity of two samples. The based on object regions, such as classification algorithms. Moreover,
Dice Coefficient is twice the area of overlap of the two segmentations SAM can expedite the process of object labeling for refining other
divided by the total number of pixels in both images (the sum of the models, thereby significantly reducing the time and manual effort
areas of both segmentations) (Minaee et al., 2021). The Dice Coefficient required for this purpose.

8
L.P. Osco et al. International Journal of Applied Earth Observation and Geoinformation 124 (2023) 103540

Fig. 5. Examples of segmented objects using SAM’s general segmentation method, drawn from diverse datasets based on their platforms. Objects are represented in random colors.
As the model operates without any external inputs, it deduces object boundaries leveraging its zero-shot learning capabilities.

4.2. Zero-shot segmentation canopy covering entire rows, bounding box prompts outperformed the
others. This outcome suggests that, for certain objects, the type of
Following this initial evaluation, we proceeded to test SAM’s input prompt can greatly influence detection and segmentation in the
promptable segmentation abilities using bounding boxes, points, and zero-shot approach.
text features. The resulting metrics for each dataset are summarized in With the airborne platform, point prompts were highly effective at
Table 2. Having compiled a dataset across diverse platforms, including segmenting trees and vehicles at a 0.20 m resolution. This trend con-
UAVs, aircraft devices, and satellites with varying pixel sizes, we noted tinued for the segmentation of lakes at a 0.45 m resolution. It raises the
that SAM’s segmentation efficacy is also quantitatively influenced by question of whether the robust performance of point prompts in these
the image’s spatial resolution. These findings underscore the significant scenarios is a testament to their adaptability to very high-resolution
influence of spatial resolution on the effectiveness of different prompt imagery or a reflection of the target object’s specific characteristics.
types. These objects primarily consist of very defined features (like cars and
For instance, on the UAV platform, text prompts showed superior vehicles) or share similar characteristics (as in bodies of water).
performance for object segmentation tasks such as trees, with higher In the context of satellite-based remote sensing imagery, point
Dice and IoU values. However, bounding box prompts were more prompts proved most efficient for multi-class segmentation at the ex-
effective for delineating geometrically well-defined and larger objects amined resolutions of 0.30 m and 0.50 m. This can be attributed to the
like houses and buildings. The segmentation of plantation crops was a fact that bounding box prompts tend to overshoot object boundaries,
unique case. Point prompts performed well at a finer 0.01 m resolution producing more false positives compared to point prompts. This finding
for individual plants. However, as the resolution coarsened to 0.04 m indicates the strong ability of point prompts to manage a diverse
and the plantation types changed, becoming denser with the plant set of objects and categories at coarser resolutions, making them a

9
L.P. Osco et al. International Journal of Applied Earth Observation and Geoinformation 124 (2023) 103540

Table 2
Summary of metrics for the image segmentation task across different platforms, targets, and resolutions, and using different prompts for SAM
in zero-shot mode. The values in red indicate the best performance for a particular target under specific conditions.
# Platform Target Resolution (m) Prompt Dice (%) IoU (%) Pixel Acc. (%) TPR (%) FPR (%)
00 UAV Tree 0.04 Box 88.8 79.9 96.0 94.2 3.6
Point 91.8 84.8 97.6 91.6 1.4
Text 92.2 85.2 98.1 92.1 1.2
01 UAV House 0.04 Box 92.7 86.3 98.4 97.4 1.5
Point 70.8 54.8 84.0 96.6 19.2
Text 89.2 79.8 95.6 97.1 10.1
02 UAV Plantation 0.01 Box 86.2 82.8 85.5 88.2 11.1
Point 95.8 92.0 95.0 98.0 9.2
Text 67.1 64.4 66.5 68.6 12.0
03 UAV Plantation 0.04 Box 80.1 68.9 95.2 94.4 10.4
Point 72.7 57.1 93.5 93.4 6.5
Text 44.1 32.8 49.9 45.0 6.1
04 UAV Building 0.09 Box 69.7 53.5 81.3 95.5 22.8
Point 69.1 52.8 84.2 91.1 17.5
Text 66.3 50.9 77.2 90.7 24.0
05 UAV Car 0.09 Box 78.8 65.0 97.0 66.0 0.2
Point 90.0 81.9 99.1 86.7 0.3
Text 92.7 84.3 97.3 89.3 0.1
06 Airborne Tree 0.20 Box 68.8 52.4 91.2 84.4 7.9
Point 91.7 84.7 93.5 88.3 2.9
Text 89.0 82.2 90.7 85.6 3.7
07 Airborne Vehicle 0.20 Box 86.1 75.6 99.5 86.9 0.3
Point 86.3 75.9 99.1 78.5 0.1
Text 84.6 74.4 97.1 76.9 0.2
08 Airborne Lake 0.45 Box 57.4 40.3 98.3 98.8 1.7
Point 97.2 94.5 99.9 99.1 0.1
Text 89.4 86.9 91.9 91.2 0.8
09 Satellite Multiclass 0.30 Box 39.1 22.5 94.5 22.6 0.4
Point 82.3 56.7 87.8 67.8 3.7
Text 74.0 51.0 79.1 61.0 3.9
10 Satellite Multiclass 0.50 Box 26.1 15.0 93.6 15.1 0.5
Point 54.9 37.8 87.0 45.2 4.2
Text 49.4 34.0 78.3 40.7 4.4

promising tool for satellite remote sensing applications. The text-based be relatively comparable in performance with the point and bounding
approach was found to be the least effective, primarily due to the box prompts for the aerial datasets, the text prompt approach had
model’s difficulty in associating low-resolution objects with words. notable limitations when used with lower spatial resolution images. The
Still, it is important to notice that, from all the datasets, the satellite text-based approach also returned worse predictions on the plantation
multiclass problem proved to be the most difficult task for the model,
with 0.04 m. This may be associated with the models’ limitation on
with generally lower metrics than the others.
understanding the characteristics of specific targets, especially when
Qualitatively, our observations also revealed that bounding boxes
were particularly effective for larger objects (Fig. 6). However, for considering the bird’s eye view of remote sensing images. Since it relies
smaller objects, SAM tended to overestimate the object size by includ- on GroundDINO to interpret the text, it may be more of a limitation on
ing shadows in the segmented regions. Despite this overestimation, the it than on SAM, mostly because, when applying the general segmenta-
bounding box approach still offers a useful solution for applications tion, the results visually returned overall better segmentation on these
where an approximate estimate of such larger objects suffices. For these datasets (Fig. 5).
types of objects, a single point or central location does not suffice,
Text prompts, though generally trailing behind in performance, still
they are defined by a combination of features within a particular
demonstrated commendable results, often closely following the top-
area. Bounding boxes provide a more spatially comprehensive prompt,
performing prompt type. Text prompts offer ease of implementation
encapsulating the entire object, which makes them more efficient in
these instances. as their primary advantage. They do not necessitate specific spatial
The point-based approach outperformed the others across our annotations, which are often time-consuming and resource-intensive
dataset, specifically for distinct objects. By focusing on a singular point, to produce, especially for extensive remote sensing datasets. However,
SAM was able to provide precise segmentation results, thus proving their effectiveness hinges on the model’s ability to translate text to
its capability to work in detail (Fig. 7). In the plantation dataset with image information. Currently, their key limitation is that they are
0.01 m resolution, for instance, when considering individual small typically not trained specifically on remote sensing images, leading
plants, the point approach returned better results than bounding boxes. to potential inaccuracies when encountering remote sensing-specific
This approach may hold particular relevance for applications requiring
terms or concepts. Improving the effectiveness of text prompts can
precise identification and segmentation of individual objects in an
be achieved through fine tuning models on remote sensing-specific
image. Also, when isolating entities like single trees and vehicles, these
precise spatial hints might suffice for the model to accurately identify datasets and terminologies. This could enable them to better interpret
and segment the object. the nuances of remote sensing imagery, potentially enhancing their
The textual prompt approach also yielded promising results, partic- performance to match or even surpass spatial prompts like boxes and
ularly with very high-resolution images (Fig. 8). While it was found to points.

10
L.P. Osco et al. International Journal of Applied Earth Observation and Geoinformation 124 (2023) 103540

Fig. 6. Illustrations of images processed using bounding-box prompts. The first column consists of the RGB image, while the second column demonstrates how the prompt was
handled. The ground-truth mask is presented in the third column and the prediction result from SAM in the fourth. The last column indicates the false positive (FP) pixels from
the prediction.

4.3. One-shot segmentation approach with a human-sampled example may be more appropriate
than the proposed text-based approach, this may not always be the
Regarding our one-shot approach, we noticed that the models’ case when considering the metric’s standard deviation. This opens up
performance is improved in most cases, as evidenced by the segmen- the potential for adopting the automated process instead. However,
tation metrics calculated on each dataset. Table 3 presents a detailed in some instances, specifically where GroundDINO’s not capable of
comparison of the different models’ performance providing a summary identifying the object, to begin with, the human-labeling provides a
of the segmentation results. Fig. 9 offers a visual illustration of example more appropriate result.
results obtained from both approaches, particularly highlighting the In its zero-shot form, SAM tends to favor selecting shadows in some
performance of the model. The metrics indicate that, while the PerSAM instances alongside its target, which can lower its performance in tasks

11
L.P. Osco et al. International Journal of Applied Earth Observation and Geoinformation 124 (2023) 103540

Fig. 7. Illustrations of images processed using point prompts. The first column presents the RGB image, while the second column demonstrates the handling of the point prompt.
The third column showcases the ground-truth mask, and the fourth column shows the prediction result from SAM. The final column highlights the false positive (FP) pixels from
the prediction.

like tree detection. Segmenting objects with similar surrounding ele- The text-based one-shot learning approach, on the other hand,
ments, especially when dealing with construction materials like streets automates the process of selecting the example. It uses the text-based
and sidewalks, can be challenging for SAM, as noticed in our multi-class prompt to choose the object with the highest probability (highest logits)
problem. Moreover, its performance with larger grouped instances, from the image as the training example. This not only reduces the
particularly when using the single-point mode, can be unsatisfactory. need for manual input but also ensures that the selected object is
Also, the segmentation of smaller and irregular objects poses difficulties highly representative of the specified class due to its high probability.
for SAM independently from the given prompt. SAM may generate Additionally, while the text-based approach is capable of handling
disconnected components that do not correspond to actual features, multiple instances of the same object class in a more streamlined
specifically in satellite imagery where the spatial resolution is lower. manner, thanks to the looping mechanism that iteratively identifies and

12
L.P. Osco et al. International Journal of Applied Earth Observation and Geoinformation 124 (2023) 103540

Fig. 8. Examples of images processed through text-based prompts. The first column contains the RGB image, while the second column indicates the text prompt used for the
model. The ground-truth mask is shown in the third column, with the prediction result from SAM in the fourth. The last column indicates the false positive (FP) pixels from the
prediction.

segments objects based on their probabilities. The one-example policy, language processing capabilities for efficient and accurate geospatial
however, excluded some of the objects in the image to favoring only analysis. Nevertheless, it is important to remember that the optimal
the objects similar to the given sample. choice between these methods may vary depending on the specific
In summary, upon comparing these two methods, we found that context and requirements of a given task.
the traditional one-shot learning approach outperforms the zero-shot
learning approach in all datasets. Additionally, the combination of text- 5. Future perspectives on SAM for remote sensing
based with one-shot learning also, even when not improving on it,
gets close enough in most cases. This comparison underscores the ben- SAM has several advantages that make it an attractive option for
efits and potential of integrating state-of-the-art models with natural remote sensing applications. First, it offers zero-shot generalization

13
L.P. Osco et al. International Journal of Applied Earth Observation and Geoinformation 124 (2023) 103540

Table 3
Comparison of segmentation results on different platforms and targets when considering both the one-shot and the text-based one-shot approaches. The baseline
values are referent to the best metric obtained by the previous zero-shot investigation, be it from a bounding box, a point, or a text prompt. The red colors
indicate the best result for each scenario.
# Platform Target Resolution (m) Sample Dice (%) IoU (%) Pixel Acc. (%) TPR (%) FPR (%)
00 UAV Tree 0.04 Baseline 92.2 85.2 98.1 92.1 1.2
PerSAM-F 94.5 ± 4.2 87.4 98.8 94.4 1.1
Text PerSAM-F 95.0 ± 4.9 87.8 99.3 96.3 0.9
01 UAV House 0.04 Baseline 92.7 86.3 98.4 97.4 1.5
PerSAM-F 95.4 ± 2.1 88.9 99.3 98.1 1.1
Text PerSAM-F 95.0 ± 2.7 88.5 98.8 99.8 1.4
02 UAV Plantation Crop 0.01 Baseline 80.1 68.9 95.2 94.4 10.4
PerSAM-F 82.1 ± 6.4 70.6 98.8 96.8 9.6
Text PerSAM-F 64.1 ± 7.2 55.1 76.2 75.5 15.6
03 UAV Plantation Crop 0.04 Baseline 95.8 92.0 95.0 98.0 9.2
PerSAM-F 98.2 ± 1.1 94.3 98.8 100.4 8.5
Text PerSAM-F 76.7 ± 1.3 73.6 76.0 78.4 13.8
04 UAV Building 0.09 Baseline 69.7 53.5 81.3 95.5 22.8
PerSAM-F 87.2 ± 6.2 66.9 98.0 96.6 21.0
Text PerSAM-F 73.2 ± 6.7 54.9 94.3 97.9 21.1
05 UAV Car 0.09 Baseline 92.7 84.3 97.3 89.3 0.1
PerSAM-F 95.0 ± 2.4 86.4 98.8 91.5 0.1
Text PerSAM-F 95.5 ± 3.0 86.9 99.3 93.3 0.1
06 Airborne Tree 0.20 Baseline 91.7 84.7 93.5 88.3 2.9
PerSAM-F 94.0 ± 1.3 86.8 98.8 90.5 2.7
Text PerSAM-F 94.5 ± 1.5 87.3 99.3 92.3 2.1
07 Airborne Vehicle 0.20 Baseline 86.3 75.9 99.1 78.5 0.1
PerSAM-F 88.4 ± 5.6 77.8 99.8 80.4 0.2
Text PerSAM-F 86.7 ± 6.5 76.3 99.6 78.9 0.1
08 Airborne Lake 0.45 Baseline 97.2 94.5 99.9 99.1 0.1
PerSAM-F 97.6 ± 1.5 94.9 99.9 99.5 0.1
Text PerSAM-F 97.3 ± 1.3 94.6 99.8 99.2 0.1
09 Satellite Multiclass 0.30 Baseline 82.3 56.7 87.8 67.8 3.7
PerSAM-F 90.5 ± 5.2 68.0 96.6 74.5 3.5
Text PerSAM-F 89.7 ± 5.3 61.8 95.8 73.9 3.5
10 Satellite Multiclass 0.50 Baseline 54.9 37.8 87.0 45.2 4.2
PerSAM-F 60.3 ± 10.4 45.3 95.7 49.7 3.9
Text PerSAM-F 59.8 ± 12.3 41.2 94.8 49.2 4.0

to unfamiliar objects and images without requiring additional train- Regardless, the current model can be effectively used in various
ing (Kirillov et al., 2023). This capability allows SAM to adapt to the remote sensing applications. For instance, we verified that SAM can be
diverse and dynamic nature of remote sensing data, which often con- easily employed for land cover mapping, where it can segment forests,
sists of varying land cover types, resolutions, and imaging conditions. urban areas, and agricultural fields. It can also be used for monitoring
Second, SAM’s interactive input process can significantly reduce the urban growth and land use changes, enabling policymakers and urban
time and labor required for manual image segmentation. The model’s planners to make informed decisions based on accurate and up-to-date
ability to generate segmentation masks with minimal input, such as a information. Furthermore, SAM can be applied in a pipeline process to
text prompt, a single point, or a bounding box, accelerates the anno- monitor and manage natural resources. Its efficiency and speed make
tation process and improves the overall efficiency of remote sensing it suitable for real-time monitoring, providing valuable information
data analysis. Lastly, the decoupled architecture of SAM, comprising to decision-makers. This is also a feature that could be potentially
a one-time image encoder and a lightweight mask decoder, makes explored by research going forward with its implementation.
it computationally efficient. This efficiency is crucial for large-scale
Nevertheless, it is crucial to underscore a significant limitation
remote sensing applications, where processing vast amounts of data on
concerning the complexity of our data. While our primary objective was
time is of utmost importance.
to analyze results across varying spatial resolutions and broad remote
However, our study consists of an initial exploration of this model,
sensing segmentation tasks, the limited regional diversity of our data
where there is still much to be investigated. In this section, we discuss
may not fully capture the range of object characteristics encountered
future perspectives on SAM and how it can be improved upon. Despite
worldwide. Future research, therefore, could emphasize utilizing and
its potential, SAM has some limitations when applied to remote sensing
adapting to a more diverse array of the same object, thereby bolstering
imagery. One challenge is that remote sensing data often come in
different formats, resolutions, and spectral bands. SAM, which has the robustness and applicability of the model or its adaptations. For
been trained primarily on RGB images, may not perform optimally instance, in the detection of buildings and water bodies, exploration
with multispectral or hyperspectral data, which are common in remote of publicly available datasets from diverse regions (Boguszewski et al.,
sensing applications. A possible approach to this issue consists of either 2022; Zhang et al., 2023c) could provide a more comprehensive under-
adapting SAM to read in multiple bands by performing rotated 3-band standing of these objects’ varied characteristics, and contribute to the
combinations or performing a fine-tuning to domain adaption. In our enhancement of algorithmic performance across varied geographical
early experiments, a simple example run on different multispectral contexts.
datasets demonstrated that, although the model has the potential to For the one-shot technique based on SAM, which is the capacity to
segment different regions or features, it still needs further exploration. generate accurate segmentation from a single example (Zhang et al.,
This is something that we intend to explore in future research, but 2023b). Our experimental results indicate an improvement in perfor-
expect that others may look into it as well. mance across most investigated datasets, especially considering the

14
L.P. Osco et al. International Journal of Applied Earth Observation and Geoinformation 124 (2023) 103540

Fig. 9. Visual illustration of the segmentation results using PerSAM and text-based PerSAM. from The last two columns highlights the difference in pixels the PerSAM prediction
and the text-based PerSAM prediction to its ground truth. The graphic compares the range from the Dice values of both PerSAM and text-based PerSAM, illustrating how the
proposed approach remains similar to the traditional PerSAM approach, underscoring the potential of most practices to adopt the automated process in such cases.

border of the objects. However, it is essential to note that one-shot as the training of transformers typically requires large amounts of data,
learning may pose challenges to the generalization capability of the SAM can provide fast and relatively accurate labeled regions for it.
model. This may be an issue of remote sensing data that often exhibit Furthermore, one of the key challenges to tackle would be im-
a high degree of heterogeneity and diversity (Zia et al., 2022). For proving SAM’s performance when applied to low spatial resolution
instance, a ‘‘healthy’’ tree can be a good sample for the model, but it can imagery. Thus, as the original training data of SAM primarily consisted
bias it to ignore ‘‘unhealthy’’ trees or canopies with different structures. of high-resolution images, it is inherently more suitable for similar
Expanding the one-shot learning to a few-shot scenario could po- high-resolution conditions, even in the remote sensing domain. The
tentially improve the model’s adaptability to different environments or noticeable decrease in accuracy at resolutions above 30 cm, noted
tasks by enabling it to learn from more than one example (2 to 10) in our tests, further substantiates this observation. This shortcoming
can be further explored by coupling SAM with a Super-Resolution
instead of a single one. This would involve using a small set of labeled
(SR) technique (Yang et al., 2015), for instance, creating a two-step
objects for each land cover type during the training process (Sun et al.,
process, where the first step involves using an SR model to increase the
2021; Li et al., 2022a). A more robust learning approach, which uses
spatial resolution of the imagery, and the second step involves using the
a larger number of examples for each class, could further enhance the
enhanced resolution image as an input to SAM. It is acknowledged that
model’s ability to capture the nuances and variations within each class.
while this method can theoretically enhance the performance of SAM
This approach, however, may require more computational resources
with low-resolution images, the Super-Resolution techniques them-
and training data, and thus may not be suitable for all applications. selves can introduce errors, potentially offsetting the benefits (Yang
Additionally, While SAM is a powerful tool for image segmentation, et al., 2015). Therefore, the proposed two-step process should be
its effectiveness can be boosted when combined with other techniques. approached with caution, ensuring meticulous testing and validation.
For example, integrating SAM into another ViT framework in a weakly- A dedicated exploration into refining and optimizing SAM for lower-
supervised manner could potentially improve the segmentation result, resolution images, possibly involving adaptation and training of the
better handling the spatial-contextual information. However, it is worth model on lower-resolution data, will be integral to ensuring its effective
noting that integrating it might also bring new challenges (Wang et al., and reliable application in diverse remote sensing scenarios.
2020a). One potential issue could be the increased model complexity As we explored the integration of SAM with other types of methods,
and computational requirements, which might limit its feasibility. But, such as GroundDINO (Liu et al., 2023b), we noticed both strengths and

15
L.P. Osco et al. International Journal of Applied Earth Observation and Geoinformation 124 (2023) 103540

limitations that were already discussed in the previous section. This 6. Conclusions
combination demonstrates a high degree of versatility and accuracy in
tasks such as instance segmentation, where GroundDINO’s object detec- In this study, we conducted a comprehensive analysis of both the
tion and classification guided SAM’s segmentation process. However, zero and one-shot capabilities of the Segment Anything Model (SAM)
the flexibility of this approach extends beyond these specific models. in the domain of remote sensing imagery processing, benchmarking
Any similar models could be swapped in as required, expanding the it against aerial and satellite datasets. Our analysis provided insights
applications and robustness of the system. Alternatives such as GLIP (Li into the operational performance and efficacy of SAM in the sphere
et al., 2022b) or CLIP (Liu et al., 2023a) may replace GroundDINO, of remote sensing segmentation tasks. We concluded that, while SAM
allowing for further experimentation and optimization (Zhang et al., exhibits notable promise, there is a tangible scope for improvement,
2022b). Furthermore, integrating language models like ChatGPT (Ope- specifically in managing its limitations and refining its performance for
nAI, 2023) could offer additional layers of interaction and nuances of task-specific implementations.
understanding, demonstrating the far-reaching potential of combining In summary, our data indicated that SAM delivers notable perfor-
these expert models. This modular approach underpins a potent and mance when contrasted with the ground-truth masks, thereby under-
adaptable workflow that could reshape our capabilities in handling scoring its potential efficacy as a significant resource for remote sensing
remote sensing tasks. applications. Our evaluation reveals that the prompt capabilities of
The integration of Geographical Information Systems (GIS) with SAM (text, point, box, and general), combined with its ability to
models like SAM holds significant promise for enhancing the annotation perform object segmentation with minimal human supervision, can also
process for training specific segmentation and change detection models. contribute to a significant reduction in annotation workload. This de-
A fundamental challenge often lies in the discrepancy between training crease in human input during the labeling phase may lead to expedited
data and the image data employed due to different acquisition times training schedules for other methods, thus promoting more streamlined
and since the data used could be marred with annotator errors, leading and cost-effective workflows.
to a compatibility issue with the used image. The integration with The chosen datasets were also selected with the express purpose
SAM could help users optimize the creation of annotations and, when of representing a broad and diverse context at varying scales, rather
suitable, improve its results with editing, thus creating a quicker and than exemplifying complex or challenging scenarios. By focusing on
more robust dataset. Lastly, a topic which is not discussed in this paper, more straightforward datasets, the study went in on the fundamental
but which is an important issue for applications particularly in the aspects of segmentation tasks, without the additional noise of overly
area of geospatial intelligence is AI security. A recent survey paper on complicated or intricate scenarios. In this sense, future research should
this topic is Xu et al. (2023). It discusses issues such as that it can be be oriented towards improving SAM’s capabilities and exploring its
unclear based on which data a (foundation) model has been trained potential integration with other methods to address more complex and
and what deficits may arise from this. Particularly, an adversary might challenging remote sensing scenarios.
have contaminated the training data. Nevertheless, despite the demonstrated generalization, there are
In short, our study focused on demonstrating the potential of SAM certain limitations to be addressed. Under complex scenarios, the model
adaptability for the remote sensing domain, as well as presenting a faces challenges, leading to less optimal segmentation outputs, by
novel, automated approach, to retrain the model with one example overestimating most of the objects’ boundaries. Additionally, SAM’s
from the text-based approach. While there is much to be explored, it performance metrics display variability contingent on the spatial reso-
is important to understand how the model works and how it could lution of the input imagery (i.e., being prone to increase mistakes as the
be improved upon. To summarize this discussion, there are many po- spatial resolution of the imagery is lowered). Consequently, identifying
tential research directions and applications for SAM in remote sensing and rectifying these constraints is essential for further enhancing SAM’s
applications, which can be condensed as follows: applicability within the remote sensing domain.
• Examining the most effective approaches and techniques for
adapting SAM to cater to a variety of remote sensing data, CRediT authorship contribution statement
including multispectral and hyperspectral data.
• Analyzing the potential of coupling SAM with few-shot or multi- Lucas Prado Osco: Conceptualization, Methodology, Software, For-
shot learning, to enhance its adaptability and generalization ca- mal analysis, Investigation, Writing – original draft, Visualization.
pability across diverse remote sensing scenarios. Qiusheng Wu: Methodology, Software, Writing – review & editing. Ed-
• Investigating potential ways to integrate SAM with prevalent re- uardo Lopes de Lemos: Data curation, Methodology. Wesley Nunes
mote sensing tools and platforms, such as Geographic Information Gonçalves: Methodology, Writing – review & editing. Ana Paula
Systems (GIS), to augment the versatility and utility of these Marques Ramos: Validation, Visualization, Writing – review & editing.
systems. Jonathan Li: Validation, Visualization, Writing – review & editing.
• An issue particularly important for applications in the area of José Marcato Junior: Validation, Supervision, Funding acquisition,
geospatial intelligence is AI security, where an adversary might, Writing – review & editing.
e.g., contaminate the training data for a (foundation) model.
• Assessing the performance and efficiency of SAM in real-time Declaration of competing interest
or near-real-time remote sensing applications to understand its
capabilities for timely data processing and analysis. The authors declare that they have no known competing finan-
• Exploring how domain-specific knowledge and expertise can be cial interests or personal relationships that could have appeared to
integrated into SAM to enhance its ability to understand and influence the work reported in this paper.
interpret remote sensing data.
• Evaluating the potential use of SAM as an alternative to tradi- Data availability
tional labeling processes and its integration with other image
classification and segmentation techniques in a weakly-supervised Here, we provide an open-access repository designed to facilitate the
manner to boost its accuracy and reliability. application of the Segment Anything Model (SAM) within the domain of
• Integrating SAM with super resolution approach to enhance its remote sensing imagery. The incorporated codes and packages provide
capability to handle low-resolution imagery, thereby expanding users the means to implement point and bounding box-based shapefiles
the range of remote sensing imagery it can effectively analyze. in combination with the SAM. The repositories also include notebooks

16
L.P. Osco et al. International Journal of Applied Earth Observation and Geoinformation 124 (2023) 103540

that demonstrate how to apply the text-based prompt approach, along- Gonçalves, D.N., Marcato, J., Carrilho, A.C., Acosta, P.R., Ramos, A.P.M., Gomes, F.D.G.,
side one-shot modifications of SAM. These resources aim to bolster Osco, L.P., da Rosa Oliveira, M., Martins, J.A.C., Damasceno, G.A., de Araújo, M.S.,
Li, J., Roque, F., de Faria Peres, L., Gonçalves, W.N., Libonati, R., 2023. Transform-
the usability of the SAM approach in diverse remote sensing contexts,
ers for mapping burned areas in Brazilian pantanal and amazon with PlanetScope
and can be accessed via the following online repositories: GitHub:AI- imagery. Int. J. Appl. Earth Obs. Geoinf. 116, 103151. http://dx.doi.org/10.1016/
RemoteSensing (Osco, 2023) and; GitHub:Segment-Geospatial (Wu and j.jag.2022.103151.
Osco, 2023). Hossain, M.D., Chen, D., 2019. Segmentation for object-based image analysis (OBIA):
A review of algorithms and challenges from remote sensing perspective. ISPRS
J. Photogramm. Remote Sens. 150, 115–134. http://dx.doi.org/10.1016/j.isprsjprs.
Acknowledgments 2019.02.009.
Hua, X., Wang, X., Rui, T., Shao, F., Wang, D., 2021. Cascaded panoptic segmentation
method for high resolution remote sensing image. Appl. Soft Comput. 109, 107515.
This study was financed in part by the Coordenação de Aperfeiçoa- http://dx.doi.org/10.1016/j.asoc.2021.107515.
mento de Pessoal de Nível Superior (CAPES), Brazil - Finance Code 001. IDEA-Research, 2023. Grounded-segment-anything. URL https://github.com/IDEA-
The authors are funded by the Support Foundation for the Development Research/Grounded-Segment-Anything.
Jozdani, S., Chen, D., Pouliot, D., Johnson, B.A., 2022. A review and meta-analysis
of Education, Science, Technology of the State of Mato Grosso do Sul,
of generative adversarial networks and their applications in remote sensing. Int.
Brazil (FUNDECT; 71/009.436/2022), the Brazilian National Council J. Appl. Earth Obs. Geoinf. 108, 102734. http://dx.doi.org/10.1016/j.jag.2022.
for Scientific and Technological Development (CNPq; 433783/2018- 102734.
4, 310517/2020-6; 405997/2021-3; 308481/2022-4; 305296/2022-1), Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T.,
Whitehead, S., Berg, A.C., Lo, W.-Y., Dollár, P., Girshick, R., 2023. Segment
and CAPES Print, Brazil (88881.311850/2018-01).
anything. arXiv:2304.02643.
Kotaridis, I., Lazaridou, M., 2021. Remote sensing image segmentation advances: A
References meta-analysis. ISPRS J. Photogramm. Remote Sens. 173, 309–322. http://dx.doi.
org/10.1016/j.isprsjprs.2021.01.020.
Li, X., Deng, J., Fang, Y., 2022a. Few-shot object detection on remote sensing images.
Adam, J.M., Liu, W., Zang, Y., Afzal, M.K., Bello, S.A., Muhammad, A.U., Wang, C., IEEE Trans. Geosci. Remote Sens. 60, 1–14. http://dx.doi.org/10.1109/tgrs.2021.
Li, J., 2023. Deep learning-based semantic segmentation of urban-scale 3D meshes 3051383.
in remote sensing: A survey. Int. J. Appl. Earth Obs. Geoinf. 121, 103365. http: Li, X., Ding, H., Zhang, W., Yuan, H., Pang, J., Cheng, G., Chen, K., Liu, Z., Loy, C.C.,
//dx.doi.org/10.1016/j.jag.2023.103365. 2023a. Transformer-based visual segmentation: A survey. arXiv:2304.09854.
Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Men- Li, K., Hu, X., Jiang, H., Shu, Z., Zhang, M., 2020. Attention-guided multi-scale
sch, A., Millican, K., Reynolds, M., Ring, R., Rutherford, E., Cabi, S., Han, T., segmentation neural network for interactive extraction of region objects from high-
Gong, Z., Samangooei, S., Monteiro, M., Menick, J., Borgeaud, S., Brock, A., resolution satellite imagery. Remote Sens. 12 (5), 789. http://dx.doi.org/10.3390/
Nematzadeh, A., Sharifzadeh, S., Binkowski, M., Barreira, R., Vinyals, O., Zisser- rs12050789.
man, A., Simonyan, K., 2022. Flamingo: a visual language model for few-shot Li, K., Wang, Y., Zhang, J., Gao, P., Song, G., Liu, Y., Li, H., Qiao, Y., 2023b. UniFormer:
learning. arXiv:2204.14198. Unifying convolution and self-attention for visual recognition. arXiv:2201.09450.
Aleissaee, A.A., Kumar, A., Anwer, R.M., Khan, S., Cholakkal, H., Xia, G.-S., Khan, F.S., Li, L.H., Zhang, P., Zhang, H., Yang, J., Li, C., Zhong, Y., Wang, L., Yuan, L., Zhang, L.,
2023. Transformers in remote sensing: A survey. Remote Sens. 15 (7), 1860. Hwang, J.-N., Chang, K.-W., Gao, J., 2022b. Grounded language-image pre-training.
http://dx.doi.org/10.3390/rs15071860. arXiv:2112.03857.
Amani, M., Ghorbanian, A., Ahmadi, S.A., Kakooei, M., Moghimi, A., Mirma- Liu, F., Chen, D., Guan, Z., Zhou, X., Zhu, J., Zhou, J., 2023a. RemoteCLIP:A vision
zloumi, S.M., Moghaddam, S.H.A., Mahdavi, S., Ghahremanloo, M., Parsian, S., language foundation model for remote sensing. arXiv:2306.11029.
Wu, Q., Brisco, B., 2020. Google earth engine cloud computing platform for remote Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Li, C., Yang, J., Su, H., Zhu, J.,
sensing big data applications: A comprehensive review. IEEE J. Select. Top. Appl. Zhang, L., 2023b. Grounding DINO: Marrying DINO with grounded pre-training for
Earth Observations Remote Sens. 13, 5326–5350. http://dx.doi.org/10.1109/jstars. open-set object detection. arXiv:2303.05499.
2020.3021052. Lobry, S., Marcos, D., Murray, J., Tuia, D., 2020. RSVQA: Visual question answering
Bai, Y., Zhao, Y., Shao, Y., Zhang, X., Yuan, X., 2022. Deep learning in different remote for remote sensing data. IEEE Trans. Geosc. Remote Sens. 58 (12), 8555–8566.
sensing image categories and applications: status and prospects. Int. J. Remote Sens. http://dx.doi.org/10.1109/tgrs.2020.2988782.
43 (5), 1800–1847. http://dx.doi.org/10.1080/01431161.2022.2048319. Loshchilov, I., Hutter, F., 2017. SGDR: Stochastic gradient descent with warm restarts.
Benjdira, B., Bazi, Y., Koubaa, A., Ouni, K., 2019. Unsupervised domain adaptation arXiv:1608.03983.
using generative adversarial networks for semantic segmentation of aerial images. Ma, A., Wang, J., Zhong, Y., Zheng, Z., 2022. FactSeg: Foreground activation-
Remote Sens. 11 (11), 1369. http://dx.doi.org/10.3390/rs11111369. driven small object semantic segmentation in large-scale remote sensing imagery.
Boguszewski, A., Batorski, D., Ziemba-Jankowska, N., Dziedzic, T., Zambrzycka, A., IEEE Trans. Geosci. Remote Sens. 60, 1–16. http://dx.doi.org/10.1109/tgrs.2021.
2022. LandCover.ai: Dataset for automatic mapping of buildings, woodlands, water 3097148.
and roads from aerial imagery. arXiv:2005.02264. Mai, G., Huang, W., Sun, J., Song, S., Mishra, D., Liu, N., Gao, S., Liu, T., Cong, G.,
Hu, Y., Cundy, C., Li, Z., Zhu, R., Lao, N., 2023. On the opportunities and challenges
Bressan, P.O., Junior, J.M., Martins, J.A.C., de Melo, M.J., Gonçalves, D.N., Fre-
of foundation models for geospatial artificial intelligence. arXiv:2304.06798.
itas, D.M., Ramos, A.P.M., Furuya, M.T.G., Osco, L.P., de Andrade Silva, J.,
Martins, J.A.C., Nogueira, K., Osco, L.P., Gomes, F.D.G., Furuya, D.E.G.,
Luo, Z., Garcia, R.C., Ma, L., Li, J., Gonçalves, W.N., 2022. Semantic segmentation
Gonçalves, W.N., Sant’Ana, D.A., Ramos, A.P.M., Liesenberg, V., dos Santos, J.A.,
with labeling uncertainty and class imbalance applied to vegetation mapping. Int.
de Oliveira, P.T.S., Junior, J.M., 2021. Semantic segmentation of tree-canopy in
J. Appl. Earth Obs. Geoinf. 108, 102690. http://dx.doi.org/10.1016/j.jag.2022.
urban environment with pixel-wise deep learning. Remote Sens. 13 (16), 3054.
102690.
http://dx.doi.org/10.3390/rs13163054.
Chi, M., Plaza, A., Benediktsson, J.A., Sun, Z., Shen, J., Zhu, Y., 2016. Big data for
Mialon, G., Dessí, R., Lomeli, M., Nalmpantis, C., Pasunuru, R., Raileanu, R., Rozière, B.,
remote sensing: Challenges and opportunities. Proce. IEEE 104 (11), 2207–2219.
Schick, T., Dwivedi-Yu, J., Celikyilmaz, A., Grave, E., LeCun, Y., Scialom, T., 2023.
http://dx.doi.org/10.1109/jproc.2016.2598228.
Augmented language models: a survey. arXiv:2302.07842.
de Carvalho, O.L.F., Júnior, O.A.d., e Silva, C.R., de Albuquerque, A.O., Santana, N.C.,
Minaee, S., Boykov, Y.Y., Porikli, F., Plaza, A.J., Kehtarnavaz, N., Terzopoulos, D.,
Borges, D.L., Gomes, R.A.T., Guimarães, R.F., 2022. Panoptic segmentation meets
2021. Image segmentation using deep learning: A survey. IEEE Trans. Pattern Anal.
remote sensing. Remote Sens. 14 (4), 965. http://dx.doi.org/10.3390/rs14040965.
Machine Intell. 1. http://dx.doi.org/10.1109/tpami.2021.3059968.
European Space Agency, 2023. SkySat - EOGateway. URL https://earth.esa.int/ OpenAI, 2023. GPT-4 technical report. arXiv:2303.08774.
eogateway/missions/SkySat. Osco, L., 2023. AI-RemoteSensing: a collection of jupyter and google colaboratory
Gao, K., Chen, M., Narges Fatholahi, S., He, H., Xu, H., Marcato Junior, J., notebooks dedicated to leveraging artificial intelligence (AI) in remote sensing
Nunes Gonçalves, W., Chapman, M.A., Li, J., 2021. A region-based deep learning applications. http://dx.doi.org/10.5281/zenodo.8092269.
approach to instance segmentation of aerial orthoimagery for building rooftop Osco, L.P., dos Santos de Arruda, M., Junior, J.M., da Silva, N.B., Ramos, A.P.M.,
extraction. Geomatica 75 (3), 148–164. http://dx.doi.org/10.1139/geomat-2021- Moryia, É.A.S., Imai, N.N., Pereira, D.R., Creste, J.E., Matsubara, E.T., Li, J.,
0009, arXiv:https://cdnsciencepub.com/doi/pdf/10.1139/geomat-2021-0009. Gonçalves, W.N., 2020. A convolutional neural network approach for counting
Gharibbafghi, Z., Tian, J., Reinartz, P., 2018. Modified superpixel segmentation for and geolocating citrus-trees in UAV multispectral imagery. ISPRS J. Photogramm.
digital surface model refinement and building extraction from satellite stereo Remote Sens. 160, 97–106. http://dx.doi.org/10.1016/j.isprsjprs.2019.12.010.
imagery. Remote Sens. 10 (11), 1824. http://dx.doi.org/10.3390/rs10111824. Osco, L.P., Junior, J.M., Ramos, A.P.M., de Castro Jorge, L.A., Fatholahi, S.N., de
Gómez, C., White, J.C., Wulder, M.A., 2016. Optical remotely sensed time series data Andrade Silva, J., Matsubara, E.T., Pistori, H., Gonçalves, W.N., Li, J., 2021a. A
for land cover classification: A review. ISPRS J. Photogram. Remote Sens. 116, review on deep learning in UAV remote sensing. Int. J. Appl. Earth Obs. Geoinf.
55–72. http://dx.doi.org/10.1016/j.isprsjprs.2016.03.008. 102, 102456. http://dx.doi.org/10.1016/j.jag.2021.102456.

17
L.P. Osco et al. International Journal of Applied Earth Observation and Geoinformation 124 (2023) 103540

Osco, L.P., Nogueira, K., Ramos, A.P.M., Pinheiro, M.M.F., Furuya, D.E.G., Wu, C., Yin, S., Qi, W., Wang, X., Tang, Z., Duan, N., 2023. Visual ChatGPT: Talking,
Gonçalves, W.N., de Castro Jorge, L.A., Junior, J.M., dos Santos, J.A., 2021b. Se- drawing and editing with visual foundation models. arXiv:2303.04671.
mantic segmentation of citrus-orchard using deep neural networks and multispectral Xu, Y., Bai, T., Yu, W., Chang, S., Atkinson, P.M., Ghamisi, P., 2023. AI security for
UAV-based imagery. Precis. Agricul. 22 (4), 1171–1188. http://dx.doi.org/10.1007/ geoscience and remote sensing: Challenges and future trends. IEEE Geosci. Remote
s11119-020-09777-5. Sens. Mag. 11 (2), 60–85. http://dx.doi.org/10.1109/mgrs.2023.3272825.
Powers, D.M.W., 2020. Evaluation: from precision, recall and F-measure to ROC, Yang, D., Li, Z., Xia, Y., Chen, Z., 2015. Remote sensing image super-resolution:
informedness, markedness and correlation. arXiv:2010.16061. Challenges and approaches. In: 2015 IEEE International Conference on Digital
Qurratulain, S., Zheng, Z., Xia, J., Ma, Y., Zhou, F., 2023. Deep learning instance Signal Processing (DSP). IEEE, pp. 196–200. http://dx.doi.org/10.1109/icdsp.2015.
segmentation framework for burnt area instances characterization. Int. J. Appl. 7251858.
Earth Obs. Geoinf. 116, 103146. http://dx.doi.org/10.1016/j.jag.2022.103146. Yuan, Q., Shen, H., Li, T., Li, Z., Li, S., Jiang, Y., Xu, H., Tan, W., Yang, Q.,
Rahman, M.A., Wang, Y., 2016. Optimizing intersection-over-union in deep neural Wang, J., Gao, J., Zhang, L., 2020. Deep learning in environmental remote sensing:
networks for image segmentation. In: Advances in Visual Computing. Springer In- Achievements and challenges. Remote Sens. Environ. 241, 111716. http://dx.doi.
ternational Publishing, pp. 234–244. http://dx.doi.org/10.1007/978-3-319-50835- org/10.1016/j.rse.2020.111716.
1_22. Yuan, X., Shi, J., Gu, L., 2021. A review of deep learning methods for semantic
Song, Y., Kalacska, M., Gašparović, M., Yao, J., Najibi, N., 2023. Advances in geocom- segmentation of remote sensing imagery. Expert Syst. Appl. 169, 114417. http:
putation and geospatial artificial intelligence (GeoAI) for mapping. Int. J. Appl. //dx.doi.org/10.1016/j.eswa.2020.114417.
Earth Obs. Geoinf. 120, 103300. http://dx.doi.org/10.1016/j.jag.2023.103300. Zhang, J., Huang, J., Jin, S., Lu, S., 2023a. Vision-language models for vision tasks: A
Su, H., Wei, S., Yan, M., Wang, C., Shi, J., Zhang, X., 2019. Object detection survey. arXiv:2304.00685.
and instance segmentation in remote sensing imagery based on precise mask R- Zhang, R., Jiang, Z., Guo, Z., Yan, S., Pan, J., Dong, H., Gao, P., Li, H., 2023b.
CNN. In: IGARSS 2019 - 2019 IEEE International Geoscience and Remote Sensing Personalize segment anything model with one shot. arXiv:2305.03048.
Symposium. IEEE, pp. 1454–1457. http://dx.doi.org/10.1109/igarss.2019.8898573. Zhang, X., Jin, J., Lan, Z., Li, C., Fan, M., Wang, Y., Yu, X., Zhang, Y., 2020. ICENET:
Sun, X., Wang, B., Wang, Z., Li, H., Li, H., Fu, K., 2021. Research progress on few- A semantic segmentation deep network for river ice by fusing positional and
shot learning for remote sensing image interpretation. IEEE Journal of Selected channel-wise attentive features. Remote Sens. 12 (2), 221. http://dx.doi.org/10.
Topics in Applied Earth Observations and Remote Sensing 14, 2387–2402. http: 3390/rs12020221.
//dx.doi.org/10.1109/jstars.2021.3052869. Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L.M., Shum, H.-Y., 2022a. DINO:
Tong, X.-Y., Xia, G.-S., Lu, Q., Shen, H., Li, S., You, S., Zhang, L., 2020. Land-cover DETR with improved DeNoising anchor boxes for end-to-end object detection.
classification with high-resolution remote sensing images using transferable deep arXiv:2203.03605.
models. Remote Sens. Environ. 237, 111322. http://dx.doi.org/10.1016/j.rse.2019. Zhang, R., Li, G., Wunderlich, T., Wang, L., 2021. A survey on deep learning-based
111322. precise boundary recovery of semantic segmentation for images and point clouds.
Toth, C., Jóźków, G., 2016. Remote sensing platforms and sensors: A survey. ISPRS J. Int. J. Appl. Earth Obs. Geoinf. 102, 102411. http://dx.doi.org/10.1016/j.jag.2021.
Photogramm. Remote Sens. 115, 22–36. http://dx.doi.org/10.1016/j.isprsjprs.2015. 102411.
10.004. Zhang, H., Zhang, P., Hu, X., Chen, Y.-C., Li, L.H., Dai, X., Wang, L., Yuan, L.,
Wang, S., Chen, W., Xie, S.M., Azzari, G., Lobell, D.B., 2020a. Weakly supervised deep Hwang, J.-N., Gao, J., 2022b. GLIPv2: Unifying localization and vision-language
learning for segmentation of remote sensing imagery. Remote Sens. 12 (2), 207. understanding. arXiv:2206.05836.
http://dx.doi.org/10.3390/rs12020207. Zhang, Z., Zhang, Q., Hu, X., Zhang, M., Zhu, D., 2023c. On the automatic quality
Wang, Y., Lv, H., Deng, R., Zhuang, S., 2020b. A comprehensive survey of optical assessment of annotated sample data for object extraction from remote sensing
remote sensing image segmentation methods. Can. J. Remote Sens. 46 (5), 501–531. imagery. ISPRS J. Photogramm. Remote Sens. 201, 153–173. http://dx.doi.org/10.
http://dx.doi.org/10.1080/07038992.2020.1805729. 1016/j.isprsjprs.2023.05.026.
Wang, J., Zheng, Z., Ma, A., Lu, X., Zhong, Y., 2022. LoveDA: A remote sensing Zheng, Z., Zhong, Y., Wang, J., Ma, A., 2020. Foreground-aware relation network for
land-cover dataset for domain adaptive semantic segmentation. arXiv:2110.08733. geospatial object segmentation in high spatial resolution remote sensing imagery.
Wu, Z., Hou, B., Ren, B., Ren, Z., Wang, S., Jiao, L., 2021. A deep detection network arXiv:2011.09766.
based on interaction of instance segmentation and object detection for SAR images. Zia, U., Riaz, M.M., Ghafoor, A., 2022. Transforming remote sensing images to textual
Remote Sens. 13 (13), 2582. http://dx.doi.org/10.3390/rs13132582. descriptions. Int. J. Appl. Earth Obs. Geoinf. 108, 102741. http://dx.doi.org/10.
Wu, Q., Osco, L.P., 2023. samgeo: A Python package for segmenting geospatial data 1016/j.jag.2022.102741.
with the Segment Anything Model (SAM). Zenodo, http://dx.doi.org/10.5281/
ZENODO.7966658.

18

You might also like