0% found this document useful (0 votes)
1 views12 pages

2308.05938v1

Uploaded by

張耕齊
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views12 pages

2308.05938v1

Uploaded by

張耕齊
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO.

8, AUGUST 2021 1

FoodSAM: Any Food Segmentation


Xing Lan, Jiayi Lyu, Hanyu Jiang, Kun Dong, Zehai Niu, Yi Zhang, Jian Xue
School of Engineering Science, University of Chinese Academy of Sciences
{lanxing19, lyujiayi21, jianghanyu23, dongkun22, niuzehai18, zhangyi214}@mails.ucas.ac.cn
[email protected]

Abstract—In this paper, we explore the zero-shot capability SAM for the domain of food image segmentation, a pivotal
of the Segment Anything Model (SAM) for food image seg- task within the field of food computing [8]–[10]. However,
mentation. To address the lack of class-specific information in the vanilla mask generated by SAM alone does not produce
arXiv:2308.05938v1 [cs.CV] 11 Aug 2023

SAM-generated masks, we propose a novel framework, called


FoodSAM. This innovative approach integrates the coarse se- satisfactory results, primarily due to its inherent deficiency
mantic mask with SAM-generated masks to enhance semantic in capturing class-specific information within the generated
segmentation quality. Besides, we recognize that the ingredients masks. Moreover, compared to semantic segmentation on
in food can be supposed as independent individuals, which general object images, food image segmentation is more
motivated us to perform instance segmentation on food im- challenging due to the large diversity in food appearances and
ages. Furthermore, FoodSAM extends its zero-shot capability
to encompass panoptic segmentation by incorporating an object the imbalanced distribution of ingredient categories [11]. Con-
detector, which renders FoodSAM to effectively capture non- sequently, accurately distinguishing the category and attributes
food object information. Drawing inspiration from the recent of food items via SAM becomes a challenging task.
success of promptable segmentation, we also extend FoodSAM To this end, we propose a novel zero-shot framework, named
to promptable segmentation, supporting various prompt variants. FoodSAM1 , to incorporate the original semantic mask with
Consequently, FoodSAM emerges as an all-encompassing solution
capable of segmenting food items at multiple levels of granularity. SAM-generated category-agnostic masks. While the SAM
Remarkably, this pioneering framework stands as the first-ever demonstrates remarkable food image segmentation capabili-
work to achieve instance, panoptic, and promptable segmentation ties, it is impeded by the absence of class-specific information.
on food images. Extensive experiments demonstrate the feasibility Conversely, the conventional segmentation method preserves
and impressing performance of FoodSAM, validating SAM’s category information, albeit with a trade-off in segmentation
potential as a prominent and influential tool within the domain
of food image segmentation. quality. To improve semantic segmentation quality, we advo-
cate for the fusion of the original segmentation output with
Index Terms—Segment Anything Model, Food Recognition,
SAM’s generated masks. Ascertaining the mask’s category
Promptable segmentation, Zero-Shot Segmentation
through the identification of its predominant elements is a
novel and effective approach to enhance the semantic segmen-
I. I NTRODUCTION tation process.
The landscape of natural language processing [1]–[3] has Moreover, since the ingredients in the food are randomly
been revolutionized by the emergence of large language mod- cut and placed, they are supposed as independent individuals,
els [4]–[6] trained on vast web datasets. Notably, these mod- which gives us the motivation to implement instance segmen-
els showcase impressive zero-shot generalization capabilities, tation on food images. Therefore, the masks produced by SAM
enabling them to transcend their original training domain are inherently associated with individual instances, forming the
and exhibit proficiency across a spectrum of tasks and data foundation upon which we execute instance segmentation for
distributions. When coming to the domain of computer vision, food images.
the recent unveiling of the Segment Anything Project (SAM) Additionally, it is noteworthy that food images contain
by the Meta AI [7] introduces a groundbreaking promptable various non-food objects, such as forks, spoons, glasses, and
segmentation task, which is designed to train a robust vision dining tables. Those objects are not the ingredients of the
foundation model. This ambitious work represents a signifi- food, but they are also important for food, which reflects the
cant advancement towards achieving comprehensive cognitive attributes of the food. To accomplish this purpose, FoodSAM
recognition of all objects in the world. The SAM project introduces object detection methodologies [12]–[14] to detect
aims to investigate interactive segmentation challenges while the non-food objects in the background. By combining the no-
effectively accommodating real-world constraints. used background masks generated by SAM, FoodSAM brings
SAM demonstrates significant performance on various seg- the object category labels for those masks as semantic labels.
mentation benchmarks, showcasing its impressing zero-shot In this manner, when coupled with our established instance
transfer capabilities on 23 diverse segmentation datasets [7]. segmentation methodology, the proposed framework enables
In this paper, we investigate the zero-shot capability of the the successful achievement of panoptic segmentation on food
This work was supported by the National Natural Science Foundation of
images.
China (62032022, 61929104, 62027827, 61972375) and Scientific Research Drawing inspiration from the SAM project, a noteworthy
Program of Beijing Municipal Education Commission (KZ201911417048). addition to our study involves the ”prompt to food image
(Corresponding authors: Jian Xue.)
1 https://github.com/jamesjg/FoodSAM
0000–0000/00$00.00 © 2021 IEEE
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 2

input semantic instance panoptic promptable


Fig. 1. FoodSAM emerges as an all-encompassing solution capable of segmenting food items at multiple levels of granularity. The different segmentation
visualization is shown from left to right: input image, semantic, instance, panoptic and promptable, respectively.

segmentation”, a fresh task proposed by SAM that augments • We present a novel zero-shot framework, named Food-
the breadth of our investigation. We design a simple but SAM, which exhibits the unique capacity to perform food
effective method for naive object detection to make FoodSAM segmentation across diverse granularity levels. Signifi-
support promptable segmentation. In object detection, we cantly, our work represents the first exploration of apply-
convert the prompt learning method to prompt-prior selection ing the SAM to the domain of food image segmentation,
and extend support to diverse prompt variants, such as point, effectively extending its zero-shot capabilities.
box, and mask prompts. We select interest objects according • To the best of our knowledge, this study stands as the
to whether the point is located, the box is covered, or the first work to accomplish instance segmentation, panop-
mask is overlapped. At last, combined with SAM promptable tic segmentation, and promptable segmentation on food
segmentation and the original semantic mask, we also achieve images.
promptable segmentation across multiple levels of granularity • Experiments demonstrate the feasibility of our FoodSAM,
on both food and non-food objects. which outperforms the state-of-the-art methods on both
Consequently, we present an all-encompassing method that the FoodSeg103 and UECFoodPix Complete datasets.
can achieve any segmentation for food as shown in Fig.1. In Furthermore, FoodSAM performs better than other SAM
the literature on food image segmentation, this is the first work variants on any food segmentation.
to accomplish instance segmentation, panoptic segmentation,
and promptable segmentation. Through a comprehensive eval- II. R ELATED W ORK
uation on the FoodSeg103 [11] and UECFoodPix Complete A. Foundation Model
[15] benchmarks, FoodSAM outperforms the state-of-the-art Foundation models, trained on broad data for adapting to
methods on both datasets, highlighting the exceptional poten- diverse downstream tasks, have driven recent advances in
tial of SAM as an influential tool for food image segmentation. machine learning. This paradigm often incorporates techniques
In instance segmentation, our method achieves high-quality like self-supervised learning, transfer learning, and prompt
segmentation of individual ingredients. And in panoptic seg- tuning. Natural language processing has particularly bene-
mentation, non-food objects are also well-segmented with fited, with the Generative Pre-trained Transformers [16], [17]
semantic labels. series pre-trained on massive text corpora enabling models
The contributions of the paper are summarized as follows: [18], [19] to tackle translation, question answering, and other
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 3

applications. Contrastive Language-Image Pre-training [20], C. Food Segmentation


[21] trained on image-text pairs can effectively retrieve im-
Food image segmentation constitutes an essential and in-
ages given text prompts, enabling image classification and
dispensable technology for enabling health applications such
generation applications. While most extract knowledge from
as estimating recipes, nutrition [48], and caloric content
available data, the Segment Anything Model [7] uniquely co-
[49]. Compared to semantic segmentation of general objects,
develops alongside model-in-the-loop annotation to construct
food image segmentation poses greater challenges due to
a custom data engine with over 1 billion masks, conferring
the immense diversity in food appearances [50] and often
strong generalization. These foundation models [22]–[24] have
imbalanced distribution [51] of ingredient categories. For
achieved state-of-the-art performance across domains. The
images containing multiple foods, segmentation represents a
paradigm shows immense promise to profoundly advance
necessary precursor for dietary assessment systems. Segment-
machine learning across disciplines.
ing overlapping, amorphous, or low-contrast foods lacking
B. Segmentation Task distinct color or texture features proves highly challenging.
Furthermore, lighting conditions introducing shadows and
Image segmentation [25], [26] encompasses various tech- reflections can adversely impact segmentation performance.
niques to delineate distinct regions based on specified cri- Overall, the complex visual properties and arrangements of
teria. Each approach possesses unique characteristics and foods render this domain exceptionally demanding. Advanced
applications. Interactive segmentation leverages user guidance segmentation techniques capable of handling food variability
to enhance accuracy [27], [28] with users providing fore- in unconstrained environments remain imperative for practical
ground/background markers to steer the algorithm. Superpixel deployment.
methods group pixels into larger units called superpixels based Wu et al. [11] proposed ReLeM to reduce the high intra-
on shared attributes like color and texture [29]. This simpli- class variance of ingredients stemming from diverse cooking
fies the image representation while retaining key structural methods by integrating it into semantic segmentation models
aspects. Additional prevalent methods include thresholding, [37], [52]. Wang et al. [53] combined a Swin Transformer
edge-based, region-based, and graph-based segmentation, ex- and PPM module in their STPPN model to achieve state-
ploiting intensity, discontinuity, similarity, and connectivity of-the-art food segmentation performance. Honbu et al. [54]
cues respectively [30]. The optimal technique depends on explored zero-shot and few-shot segmentation techniques in
factors like task goals, data properties, and required outputs. USFoodSeg for unseen food categories. Sinha et al. [55]
Segmentation [31] remains an active area of research with ex- benchmarked Transformer and convolutional backbones for
isting methods limited in fully handling real-world complexity. transferring visual knowledge to food segmentation. While
Semantic segmentation represents a comprehensive ap-
these approaches have driven progress, semantic segmentation
proach in which each pixel is classified into a particular cate-
provides only categorical predictions without differentiating
gory, effectively partitioning the image according to semantic
individual food items.
entities [32]–[34]. Recent methods like DANet [35], OCNet
However, to the best of our knowledge, no prior work
[36], and SETR [37] focus on extracting richer feature repre-
has explored promptable or instance segmentation for food
sentations. Building upon semantic segmentation, instance seg-
images and no existing datasets have supported instance-
mentation [38] also delineates individual objects of the same
level or panoptic segmentation for food images. Semantic
class as distinct instances [39], [40]. Panoptic segmentation
segmentation only provides categorical predictions without
unifies semantic and instance segmentation, assigning each
differentiating individual food items. In contrast, instance
pixel both a class label and a unique instance identifier if part
segmentation can delineate and count distinct food objects,
of a segmented object [41]–[43]. This provides a holistic scene
enabling more accurate nutrition and calorie estimation. Fur-
understanding by categorizing and differentiating all present
thermore, panoptic segmentation can characterize the sur-
elements. Each formulation provides unique information - se-
rounding environment, discerning attributes like the food’s
mantic conveys categorical regions, instance delineates object
container and utensils. Such contextual cues provide mean-
quantities, and panoptic characterizes both detailed semantics
ingful signals about food properties and consumption habits.
and individual entity counts.
Therefore, advancing instance and panoptic food segmentation
Promptable segmentation has emerged as a versatile
represent an important direction, as these task formulations are
paradigm leveraging recent advancements in natural language
more informative than semantic alone for downstream food
models [44], [45]. This approach employs careful ”prompt
computing applications.
engineering” to guide the model toward desired outputs [46],
[47], departing from traditional multi-task frameworks with
predefined tasks. At inference, promptable models adapt to III. M ETHODOLOGY
new tasks using natural language prompts as context [44].
A. Preliminary
The Segment Anything Model [7] exemplifies this, training
on 11 million images with 1.1 billion masks to segment Revisit of SAM: The Segment Anything Model (SAM)
objects specified in user prompts without additional fine- [7] represents the first application of foundation models to
tuning. While SAM shows promising zero-shot generalization, the image segmentation task domain. As shown in Fig.2 The
a key limitation is the lack of semantic meaning in its mask proposed model is comprised of three key components - an
predictions. image encoder, a prompt encoder, and a lightweight mask
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 4

x2
image to token attn. dot product
2x per mask
conv. masks
mlp trans.
output token
token to image attn. per mask
MLP
image ViT token to IoU output
self attn. image attn. token
MLP IoU scores

Image Embedding conv

IOU Token Prompt Encoder


Mask
Mask Token
rice
Sparse Prompt Embedding pizza
sushi

Fig. 2. The overview of Segment Anything Model (SAM) [7]. SAM contains three components: image encoder, prompt encoder, and mask decoder.

decoder module. The image encoder implements a computa- model based on SAM. Trained on a large set of image-text
tionally intensive vision transformer architecture with millions pairs, this model efficiently recognizes common categories,
of parameters to effectively extract salient visual features from enabling the acquisition of a vast amount of image tags without
the input image. SAM provides three scale-specific pre-trained the need for manual annotations.
image encoder configurations: ViT-B (91M parameters), ViT-L SEEM [60] is an interactive segmentation model that can
(308M parameters), and ViT-H (636M parameters) [56], [57]. perform image segmentation in an all-pixel, all-semantics
The prompt encoder enables four types of textual or spatial manner simultaneously while supporting interactive segmenta-
inputs: points, boxes, freeform text, and existing masks. Points tion with various prompt types(including click, box, polygon,
and boxes are represented using positional encodings [58], scribble, text, and referring region from another image). Their
and the text is encoded using a pre-trained text encoder from experiments demonstrate that SEEM achieves remarkable per-
the CLIP model [20], while mask inputs are embedded using formance in open-vocabulary segmentation and interactive
convolutions. These prompt embeddings are summed element- segmentation tasks, and exhibits robust generalization to di-
wise with the image features. The mask decoder module em- verse user intents.
ploys a Transformer-based architecture, applying self-attention SSA [61] is a novel open framework that pioneers the
to the prompt and cross-attention between the prompt and application of SAM in the domain of semantic segmentation
image encoder outputs. This is followed by a dynamic mask tasks. Its primary aim is to facilitate the seamless integration of
prediction head that outputs pixel-level mask probabilities and users’ pre-existing semantic segmenters with SAM, obviating
predicted Intersection over Union (IoU) metrics. Transposed the necessity for retraining or fine-tuning SAM’s parameters.
convolutions upsample the mask decoder features. Critically, This integration empowers users to achieve enhanced gener-
the mask decoder can generate multiple mask outputs to alization capabilities and finer delineation of mask boundaries
account for inherent ambiguity in the prompt. The default in semantic segmentation tasks.
configuration predicts 3 masks per prompt input. Notably,
the image encoder extracts the feature only once per input
B. Overview
image, allowing the cached image embedding to be reused
across different prompts for the same image. This separation We explore applying SAM, a powerful mask generator,
of the expensive image inference from the lightweight prompt to food image segmentation. Although SAM segments food
interaction enables novel interactive use cases like real-time images with high quality, the generated masks lack categorical
mobile Augmented Reality prompting. SAM was trained on semantics. Standard semantic segmentation provides category
a large-scale dataset of over 11 million images with 1 billion labels but with low quality in food images.
masks, yielding strong zero-shot transfer capabilities. As the To address this, we proposed FoodSAM, a framework that
name suggests, SAM can segment virtually any concept, even merges the benefits of both approaches. We assign SAM’s
completely novel objects unseen during training. high-quality masks with semantic labels based on the mask-
category match. Furthermore, due to the ingredients of food
Variants of SAM: Very recently, there are serval concurrent
being randomly cut and placed when cooking, they are sup-
works proposed to address the limitations of SAM. Here we
posed as independent individuals, which motivate us to achieve
introduce them by following.
instance segmentation for food images. Besides, to segment
RAM [59] is an innovative image tagging foundational fine-grained objects apart from the background, FoodSAM
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 5

Binary Masks With Class


cake cake

Semantic Mask Category


Segmenter Match
Background Label: Cake Label: Cake

Coarse Semantic Mask

Segment
Not Food Class Food Class
Anything Merge Strategy
Model

Binary Masks
Food Image

Prompt Piror
Selection

Prompt
Non-food Object Mask Enhanced Semantic Mask Food Instance Mask

Object
Detector

Boxes Panoptic Segmentation

Fig. 3. The overview of our framework. FoodSAM contains three basic models: SAM, semantic segmenter, and object detector. SAM generates many class-
agnostic binary masks, the semantic segmenter provides food category labels via mask-category match, and the object detector provides the non-food class
for background masks. It then enhances the semantic mask via merge strategy and produces instance and panoptic results. Moreover, a seamless prompt-prior
selection is integrated into the object detector to achieve promptable segmentation.

contains an object detector for detecting non-food objects, such In the mask-category match, we use the i-th binary mask
as a table, plates, spoons, etc. Therefore, FoodSAM also can mia as the indices to get the i-th local semantic value set Di =
achieve high-quality panoptic segmentation for food images. ms [mia ]. Then, we use the voting scheme to choose the highest
Inspired by prompt learning, we also introduce promptable frequency semantic value si in Di as the semantic label of
segmentation to food images, which is a novel task proposed the i-th mask mia . Meanwhile, we also calculate the confused
by SAM. In the object detector, we convert the SAM’s prompt degree of the category mask by di = Count(Si , Di )/Len(Di ),
learning way to the prompt-prior selection, which also supports where di denotes the i-th confused degree of category mask
the point, box, and mask prompts. In this way, FoodSAM sup- whose semantic label is si . And Count function summarizes
ports interactive prompt-based segmentation across multiple the number of that label si in the set Di , Len function
levels of granularity on food images. summarizes the number of the positive values Di in the mask.
We also will filter the mask whose di is lower than a threshold
τ , which means the mask label is confused and unstable.
C. FoodSAM Therefore, with the merge strategy, we can get the enhanced
FoodSAM consists of three main models: the segment semantic mask mes by merging the K binary masks mia with
anything model Ma , a semantic segmentation module Ms , and the semantic label si . In this strategy, we take consideration
an object detector Md , which also designs novel and seamless into the conflict area when those masks exist overlap. We
integration methods including the mask-category match, merge first sort them by the area size Len(Di ) and put them on
strategy, and prompt-prior selection. As shown in Fig.3, we the original semantic mask from large to small, which can
here describe them by the order of the pipeline: enhance seg- preserve fine-object results This process fully takes advantage
mentation, semantic-to-instance segmentation, and instance- of the high-quality mask of SAM and the semantic label of
to-panoptic segmentation. And promptable segmentation is the semantic segmenter.
inserted in this pipeline with its corresponding prompt. Semantic-to-Instance Segmentation: In the enhanced pro-
Enhance Semantic Segmentation: Suppose the input food cess, we get the semantic labels for each binary mask. Since
image I ∈ RH×W forward by the semantic segmentation the food is randomly cut and placed when cooking, we suppose
module model Ms and output the semantic mask ms = the ingredients of food as independent individuals. Here we
Ms (I), which has the same shape as I and the pixel value merge small masks into the nearby mask which has the same
is corresponding to the category of the pixel. SAM Ma category label, and filter very small masks if their location is
forward the image I and output the binary candidate masks separate. After filtering the binary masks which correspond
ma = Ma (I), ma ∈ RK×H×W , where K is the number of to the background category, we obtain the foreground masks.
masks and the pixel value is True or False to indicate the pixel For those foreground semantic labels, the t-th binary mask
is foreground or background. mta with the voted semantic label st is supposed as the t-th
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 6

instance’s mask mti , where t is the index of the foreground IV. E XPERIMENTS
mask (instance id). Finally, we get the instance mask mi by A. Experiment Settings
merging the T masks mti as well as the instance id.
Datasets: UECFoodPix Complete [15] was released in 2020
Instance-to-Panoptic Segmentation: In the food image, the by the University of Electro-Communications. It includes 102
non-food objects, such as a table, plate, spoon, etc, are also dishes and comprises 9000 training images as well as 1000 test
important for the food image, which indicates the attributes images. And semantic labels are provided for each food item
of the food. Here we also suppose the non-food objects as with 103 class labels. The segmentation masks were obtained
independent individuals by introducing the object detector Md . semi-automatically using GrabCut, which segments images
The object detector Md forward the image I and outputs the based on user-initialized seeding [63]. The automatically-
non-food bounding boxes Bd with the corresponding class generated masks were further refined by human annotators
labels Cd . In whole mask segmentation, SAM uses the dense based on a set of predefined rules [64].
gride points as the prompt, and remain the K points which FoodSeg103 [11] is a recent dataset designed for food
corresponding to ma . Here, we take each point of SAM- image segmentation, consisting of 7,118 images depicting
generated mask to select the non-food object candidates by 730 dishes. FoodSeg103 aims to annotate dishes at a more
judging if located in the bounding box Bd . We next collect fine-grained level, capturing the characteristics of each dish’s
the binary masks whose semantic label is background, and individual ingredients. Specifically, the training set contains
then calculate the IoUs of the non-food candidate bounding 4983 images with 29530 ingredient masks, while the test-
boxes and that binary mask. When the highest IoU is large ing set contains 2135 images with 12567 ingredient masks.
than another threshold, we suppose the corresponding class is They were obtained through manual annotations. Compared
the category of that mask. Otherwise, it will be preserved its to UECFoodPixComplete, FoodSeg103 proves to be a more
background label. And non-food masks also will be merged challenging benchmark for food image segmentation. Further,
followed by semantic-to-instance practice. Finally, we get the unlike UECFoodPixComplete, which covers entire dishes but
panoptic mask mp by merging the instance mask mi and the lacks fine-grained annotation for individual dish components,
non-food object’s semantic masks. Implementation Details: We have conducted experiments on
Promptable Segmentation: SAM proposes a novel task the two datasets mentioned above with NVIDIA GeForce RTX
promptable segmentation, which can achieve interactive 3090 GPU. For the components of FoodSAM, we use the ViT-
prompt-based segmentation. In the FoodSAM, we also in- h [56] as SAM’s image encoder, and same hyperparameters as
troduce promptable segmentation to food images and further the original paper. For object detector, we use UniDet [14] with
estimate the food category. By leveraging the original semantic Unified learned COIM RS200. For semantic segmentation,
segmentation, we first segment interest food objects by SAM we use SETR [37] as the baseline, which is with ViT-16/B
and can achieve interactive prompt-based segmentation at as encoder, MLA as a decoder in FoodSeg103, and the
semantic and instance levels. Moreover, we propose a novel checkpoint is provided by the GitHub repo. In UECFoodPix
design for the object detector Md to support promptable detec- Complete, we take the deeplabv3+ [65] as the baseline, and
tion across multiple levels of granularity. In the object detector, the checkpoint is retraining with the same hyperparameters as
we convert the SAM’s prompt learning way to prompt-prior the paper reported.
selection for the interest object. The concurrent work FastSAM Evaluation Metrics: We evaluate the performance of our
[62] also introduces a selection strategy, but it is used after model with some common metrics, e.g. mIoU(mean IoU
finished segmentation. over each class), mAcc(mean accuracy over all classes), and
aAcc(over all pixels). mIoU is a standard indicator in semantic
Different from FastSAM, we use the prompt-prior selection segmentation to assess the overlap and union between infer-
to segment the non-food objects in the background. It mainly ence and ground truth, which is depicted by the following.
involves the utilization of point prompt, box prompt, and mask N
prompt. Those non-food objects in the background are all 1 X TPi
mIoU = (1)
assigned the category label from object detector. For the point N i=1 TPi + FPi + FNi
prompt, we first remain object’s box information as prior boxes where N is the number of classes, TPi , FPi , and FNi are
and use the point prompt to check the points whether in the described as follows.
prior boxes or not and most center to select the prior box. • True Positive (TPi ) represents the number of pixels that are
For the box prompt, we take IoU matching between the box correctly classified as class i.
prompt and the detected box from the object detector. The • False Positive (FPi ) denotes the number of pixels that are
goal is to find the prior box with the highest IoU with the wrongly classified as class i.
detected box. For the mask prompt, we sample points in the • False Negative (FNi ) is the number of pixels that are
mask and check the points whether in the detected box, then wrongly classified as other classes while their true labels are
followed by the same way as the point prompt. Specifically, class i.
for the regular prompt (the mask shape is the same as the input mAcc is the average accuracy of all categories. For a dataset
image), it uses near all non-food information which contain K with N classes, it can be formulated as:
points for the next segmentation. Consequently, our FoodSAM N
achieves promptable segmentation on both food and non-food 1 X TPi
mAcc = (2)
objects across multiple levels of granularity. N i=1 TPi + FNi
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 7

input ground-truth coarse enhance difference


Fig. 4. Visualization comparison with baseline and ground-truth on semantic segmentation. The difference is calculated between the enhanced and coarse.

TABLE I TABLE II
C OMPARISON WITH S TATE - OF - THE - ART M ETHODS ON F OOD S EG 103 C OMPARISON WITH S TATE - OF - THE - ART M ETHODS ON U ECFOODPIX
C OMPLETE
Metrics
Methods
mIoU (%) aAcc (%) mAcc (%) Metrics
Methods
FPN [41] 27.28 75.23 36.7 mIoU (%) aAcc (%) mAcc (%)
CCNet [52] 28.6 78.9 47.8 deeplabV3+ [15], [65] 55.50 66.80 –
ReLeM-CCNet [11] 29.2 79.3 47.5 YOLACT [68] 54.85 – –
ReLeM-FPN-Finetune [11] 30.8 78.9 40.7 GourmetNet [69] 62.88 87.07 75.87
Window Attention [66] 31.4 77.62 40.3 BayesianDeeplabv3+ [70] 64.21 87.29 76.15
Upernet [67] 39.8 82.02 52.37 deeplabV3+ (baseline) 65.61 88.20 77.56
STPPN [53] 40.3 82.13 53.98 FoodSAM (ours) 66.14 88.47 78.01
CCNet-Finetune 41.3 87.7 53.8
SETR [37] (baseline) 45.1 83.53 57.44
FoodSAM (ours) 46.42 84.10 58.27
our FoodSAM with SAMs in promptable segmentation tasks.
We next discuss the results in detail.
Evaluation on Semantic Segmentation: As shown in Tab.
And aAcc directly calculates the ratio of all pixels that are I, we have achieved SOTA performance in FoodSeg103,
correctly classified, which can be described as: As shown in Tab. II, we have achieved SOTA performance
PN in UECFoodPix Complete. Specifically, FoodSAM achieves
TPi
aAcc = PN i=1 (3) 46.42 mIoU, 58.27 mAcc, 84.10 aAcc on FoodSeg103 as well
i=1 TPi + FNi as 66.14 mIoU, 78.01 mAcc and 88.47 aAcc on UECFoodPix.
And we also test the performance with other zero-shot methods
as depicted in Tab. III, FoodSAM outperforms recent SAM
B. Comparison with State-of-the-art Methods variants. Notably, the performance of these zero-shot methods
To compare with state-of-the-art methods, we have con- all reach under 30 mIoU, which even performs the best
ducted experiments across multiple levels of granularity con- supervision method at 45.1 mIoU.
taining semantic, instance, and panoptic. Also, we compare As depicted in Fig.4, we also conduct the qualitative anal-
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 8

input ground-truth SSA RAM ours


Fig. 5. Visualization comparison with SSA and RAM on semantic segmentation. SSA and RAM may output the instance information and their results are
obtained by the public repository, here we only discuss semantic results on food.

TABLE III
C OMPARISON WITH Z ERO - SHOT M ETHODS ON F OOD S EG 103

Methods Best sup.(UB) ZSSeg-B ZegFormer-B X-Decoder-T SAN-B OpenSeeD-T OVSeg-L SAN-L CAT-Seg-H Gr.-SAM-H FoodSAM (ours)
mIoU (%) 45.1 8.17 10.01 2.61 19.27 9.0 16.43 24.46 29.06 9.97 46.42

ysis for the original semantic mask, enhanced semantic mask, ability for non-food objects. Specifically, for the same input
and ground-truth mask. By leveraging the impressive segment image, it shows the additional bowl and plate compared with
capacity of SAM, FoodSAM presents a powerful semantic instance segmentation that focuses on the food. Compared
segmentation method to compensate for the loss by the original with other methods, that cannot effectively distinguish the fine-
segmenter. We also visualize the comparison with other SAM grained difference such as a bowl with the ingredients and a
variants SSA and RAM as shown in Fig.5, which shows that glass with milk.
FoodSAM can segment finer ingredients and RAM has more Evaluation on Promptable Segmentation: Drawing inspira-
mis-segmentation cases. tion from SAM, we also extend the promptable segmentation
Evaluation on Instance Segmentation: Due to that no pre- to FoodSAM. The regular or mask prompt is discussed in
vious work has implemented instance segmentation in that the above subsections. And here we discuss the point prompt
area, and no related benchmark has been released, we here and box prompt in this subsection, we direct use panoptic
compare the qualitative results with SAM variants. As shown segmentation to compare granularity for it contains all fine-
in Fig. 6, FoodSAM shows an impressive performance that grained information, including food and non-food objects.
can recognize well the instance identity of food with a pred As shown in Fig.8, FoodSAM can identify the category of
semantic mask. And compared with other methods, FoodSAM the food ingredients, which addresses the limitation of SAM.
can segment and recognize the special food ingredients that Moreover, FoodSAM also recognizes and segments the non-
other works lack. There exist some mis-segmentation cases in food object within the coarse background with an impressing
RAM [59], e.g. the mis-segmentation of the fruit and bread. result.
And the strawberry is segmented as a whole not fine-grained
enough in RAM, but FoodSAM can segment the strawberry C. Improvement of FoodSAM
into each small piece. In this subsection, we also explore the improvement by
Evaluation on Panoptic Segmentation: Also, since no work leveraging the SAM segment performance. As shown in
and datasets have been public in this task, we conduct the Tab.I and Tab.II, FoodSAM achieves obvious improvement
qualitative analysis with SAM variants RAM and SEEM. As compared with baselines, which indicates that SAM performs
shown in Fig. 7, FoodSAM performs an excellent segment impressing segment performance and compensates for the
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 9

Image

RAM

Ours

Fig. 6. Visualization comparison with RAM on instance segmentation. RAM may output the non-food instance information, here we only discuss the semantic
results on food.

TABLE IV TABLE VI
R ESULTS ABOUT MERGING ORIGINAL SEMANTIC MASK AND TOP K SAM A BLATION S TUDY ON I NDIVIDUAL PARTS
MASKS ON F OOD S EG 103
Baseline WSM FCC mIoU(%) mAcc(%) aAcc(%)
Top K mIoU (%) mAcc (%) aAcc (%) ✓ 45.10 57.44 83.53
Baseline 45.10 57.44 83.53 ✓ 44.40 54.48 82.27
+10 45.22 57.19 83.63 ✓ ✓ 46.33 58.19 84.05
+30 46.19 58.01 84.03 ✓ ✓ ✓ 46.42 58.27 84.10
+40 46.32 58.15 84.06
+80 46.42 58.27 84.10

TABLE V
filtering the mask with a confused category label (FCC) or
R ESULTS ABOUT MERGING ORIGINAL SEMANTIC MASK AND TOP K SAM not, and with SAM-generated mask (WSM) or not. In only
MASKS ON U ECFOODPIX C OMPLETE WSM, we use zero-mask as the init background mask and
Top K mIoU (%) mAcc (%) aAcc (%) the label from the original semantic mask, while in the case
Baseline 65.61 77.56 88.2 of combining the baseline, we also use the original semantic
+10 66.01 77.89 88.39 mask as the init background. As shown in Tab.VI, only
+30 66.12 77.99 88.46
+40 66.13 78.00 88.46 WSM also could achieve comparable performance and the
+80 66.14 78.00 88.47 combination could get significant improvement compared to
the baseline. With FCC, we merge with useful category label,
and the mask with confused category use the corresponding
original semantical segmenter limitation. SAM produces many area with semantic meanings.
masks for each may exist object, we here experiment with the And in the merging process, it needs to deal with the conflict
improvement of the number of merging masks. area when masks exist the overlap. We also conduct an ablation
Specifically, we choose the top-k area masks from Food- study to verify which way is most effective, which considers
SAM to merge with the original semantic mask from the two factors: SAM-predicted IoU and the number of positive
baseline. As shown in Tab.IV and Tab.V, the performance values in a mask. As shown in Tab.VII, sorted by area, from
increase with the increasing k. When using 80 masks, Food- large to small, achieves the best performance. It will cover
SAM achieves 66.14 mIoU, 78.00 mAcc, 88.47 aAcc on those small fine areas when using large areas at the latter. We
UECFoodPix Complete and 46.42 mIoU, 58.27 mAcc, 84.10 explore the cues that the SAM predicted IoU has a similar
aAcc on FoodSeg103. These experiments demonstrate that it value close to 1, without strong distinguish.
is feasible for the fusion between SAM and food segmenter, Before the merging process, if the number of the most cate-
which shows significant improvement with FoodSAM. gory label is relatively low in a mask, we suppose the category
of that mask is confused and filter them to the later merging
D. Ablation Study process. Therefore, we also experiment with which threshold
To verify the function of each part, we also implement is effective for FoodSAM. When the threshold τ equals zero, it
the ablation study in these parts on FoodSeg103 benchmark: means not using this practice. The experiments on FoodSeg103
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 10

Image

RAM

SEEM

Ours

Fig. 7. Visualization comparison with RAM, SEEM and ours on panoptic segmentation. The visualization results are obtained from their public code repository.

TABLE VII TABLE IX


A BLATION S TUDY ON THE S ORTED WAY OF SAM M ASKS A BLATION S TUDY ON THE C ONFUSED THRESHOLD OF MASK CATEGORY
ON U ECFOODPIX C OMPLETE .
Sorted Way mIoU (%) mAcc (%) aAcc (%)
IoU(large → small) 45.70 57.63 83.70 Confused threshold mIoU(%) mAcc(%) aAcc(%)
IoU(small → large) 45.96 57.79 83.89 0 65.61 77.56 88.20
Area(large → small) 46.42 58.27 84.10 0.3 65.78 78.20 87.81
Area(small → large) 45.18 57.12 83.46 0.5 65.81 78.21 87.85
0.7 66.06 78.15 88.27
0.8 66.14 78.01 88.47
TABLE VIII 0.9 66.08 77.96 88.45
A BLATION S TUDY ON THE C ONFUSED THRESHOLD OF MASK CATEGORY
ON F OOD S EG 103.

Confused threshold mIoU (%) mAcc (%) aAcc (%) limitation, we propose FoodSAM, a novel zero-shot frame-
0 46.33 58.19 84.05 work that combines original semantic masks with SAM-
0.3 46.33 58.19 84.05
0.5 46.42 58.27 84.1 generated category-agnostic masks to enhance semantic seg-
0.7 46.25 58.18 84.05 mentation quality. Additionally, we leverage SAM’s inherent
0.9 45.71 57.75 83.92 instance-based masks to perform instance segmentation on
food images. FoodSAM also incorporates object detection
methodologies to detect non-food objects, allowing for panop-
in shown in Tab.VIII and UECFoodPix Complete shownn in tic segmentation. Furthermore, we extend our investigation
IX. Using the area of the original semantic mask to represent to promptable segmentation, supporting various prompt vari-
the confused mask can make a minor improvement in Food- ants. Our comprehensive evaluation on benchmark datasets
Seg103, and a significant increase in UECFoodPix Complete. demonstrates FoodSAM’s state-of-the-art performance, affirm-
The reason behind this is that FoodSeg103 is the dataset for ing SAM’s potential as an influential tool for food image
fine-grained ingredients, while UECFoodPix Complete only segmentation. This work encompasses the first exploration of
contains the food label, its label is coarse-grained. Therefore, SAM in food segmentation, accomplishing instance, panoptic,
when the number of confused labels is larger, our method and promotable segmentation on food images, and surpassing
achieves higher improvement. existing methods in performance.

V. C ONCLUSION R EFERENCES
This paper investigates the zero-shot capability of SAM for [1] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi,
food image segmentation, a challenging task in the domain P. Cistac, T. Rault, R. Louf, M. Funtowicz et al., “Huggingface’s trans-
of food computing. The vanilla mask generation method of formers: State-of-the-art natural language processing,” arXiv preprint
arXiv:1910.03771, 2019.
SAM alone falls short in capturing class-specific information, [2] S. Ruder, M. E. Peters, S. Swayamdipta, and T. Wolf, “Transfer learning
hindering accurate food item categorization. To address this in natural language processing,” in Proceedings of the 2019 conference
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 11

input point prompt point prompt box prompt box prompt


Fig. 8. Visualization results on promptable segmentation. From left to right: input, double point prompts, double box prompts

of the North American chapter of the association for computational Workshops and Challenges: Virtual Event, January 10–15, 2021, Pro-
linguistics: Tutorials, 2019, pp. 15–18. ceedings, Part V. Springer, 2021, pp. 647–659.
[3] X. Qiu, T. Sun, Y. Xu, Y. Shao, N. Dai, and X. Huang, “Pre-trained [16] S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Ka-
models for natural language processing: A survey,” Science China mar, P. Lee, Y. T. Lee, Y. Li, S. Lundberg et al., “Sparks of artificial
Technological Sciences, vol. 63, no. 10, pp. 1872–1897, 2020. general intelligence: Early experiments with gpt-4,” arXiv preprint
[4] E. Kasneci, K. Seßler, S. Küchemann, M. Bannert, D. Dementieva, arXiv:2303.12712, 2023.
F. Fischer, U. Gasser, G. Groh, S. Günnemann, E. Hüllermeier et al., [17] L. Floridi and M. Chiriatti, “Gpt-3: Its nature, scope, limits, and
“Chatgpt for good? on opportunities and challenges of large language consequences,” Minds and Machines, vol. 30, pp. 681–694, 2020.
models for education,” Learning and Individual Differences, vol. 103, [18] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal,
p. 102274, 2023. A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language mod-
[5] J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, els are few-shot learners,” Advances in neural information processing
D. Yogatama, M. Bosma, D. Zhou, D. Metzler et al., “Emergent abilities systems, vol. 33, pp. 1877–1901, 2020.
of large language models,” arXiv preprint arXiv:2206.07682, 2022. [19] Z. Hu, Y. Dong, K. Wang, K.-W. Chang, and Y. Sun, “Gpt-gnn:
[6] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, Generative pre-training of graph neural networks,” in Proceedings of the
D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large 26th ACM SIGKDD International Conference on Knowledge Discovery
language models,” Advances in Neural Information Processing Systems, & Data Mining, 2020, pp. 1857–1867.
vol. 35, pp. 24 824–24 837, 2022. [20] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal,
[7] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable
T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment anything,” visual models from natural language supervision,” in International
arXiv preprint arXiv:2304.02643, 2023. conference on machine learning. PMLR, 2021, pp. 8748–8763.
[21] Y. Li, F. Liang, L. Zhao, Y. Cui, W. Ouyang, J. Shao, F. Yu, and J. Yan,
[8] J. Song, Z. Li, W. Min, and S. Jiang, “Towards food image retrieval
“Supervision exists everywhere: A data efficient contrastive language-
via generalization-oriented sampling and loss function design,” ACM
image pre-training paradigm,” arXiv preprint arXiv:2110.05208, 2021.
Trans. Multimedia Comput. Commun. Appl., may 2023, just Accepted.
[Online]. Available: https://doi.org/10.1145/3600095 [22] R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von
Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill et al.,
[9] L. Zhou, C. Zhang, F. Liu, Z. Qiu, and Y. He, “Application of deep
“On the opportunities and risks of foundation models,” arXiv preprint
learning in food: a review,” Comprehensive reviews in food science and
arXiv:2108.07258, 2021.
food safety, vol. 18, no. 6, pp. 1793–1811, 2019.
[23] L. Yuan, D. Chen, Y.-L. Chen, N. Codella, X. Dai, J. Gao, H. Hu,
[10] L. Zhu, P. Spachos, E. Pensini, and K. N. Plataniotis, “Deep learning X. Huang, B. Li, C. Li et al., “Florence: A new foundation model for
and machine vision for food processing: A survey,” Current Research computer vision,” arXiv preprint arXiv:2111.11432, 2021.
in Food Science, vol. 4, pp. 233–249, 2021.
[24] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux,
[11] X. Wu, X. Fu, Y. Liu, E.-P. Lim, S. C. Hoi, and Q. Sun, “A large-scale T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al.,
benchmark for food image segmentation,” in Proceedings of the 29th “Llama: Open and efficient foundation language models,” arXiv preprint
ACM International Conference on Multimedia, 2021, pp. 506–515. arXiv:2302.13971, 2023.
[12] R. Padilla, S. L. Netto, and E. A. Da Silva, “A survey on performance [25] N. R. Pal and S. K. Pal, “A review on image segmentation techniques,”
metrics for object-detection algorithms,” in 2020 international confer- Pattern recognition, vol. 26, no. 9, pp. 1277–1294, 1993.
ence on systems, signals and image processing (IWSSIP). IEEE, 2020, [26] H.-D. Cheng, X. H. Jiang, Y. Sun, and J. Wang, “Color image segmen-
pp. 237–242. tation: advances and prospects,” Pattern recognition, vol. 34, no. 12, pp.
[13] Z. Zou, K. Chen, Z. Shi, Y. Guo, and J. Ye, “Object detection in 20 2259–2281, 2001.
years: A survey,” Proceedings of the IEEE, 2023. [27] K. McGuinness and N. E. O’connor, “A comparative evaluation of
[14] X. Zhou, V. Koltun, and P. Krähenbühl, “Simple multi-dataset detection,” interactive segmentation algorithms,” Pattern Recognition, vol. 43, no. 2,
in Proceedings of the IEEE/CVF Conference on Computer Vision and pp. 434–444, 2010.
Pattern Recognition, 2022, pp. 7571–7580. [28] E. N. Mortensen and W. A. Barrett, “Interactive segmentation with
[15] K. Okamoto and K. Yanai, “Uec-foodpix complete: A large-scale food intelligent scissors,” Graphical models and image processing, vol. 60,
image segmentation dataset,” in Pattern Recognition. ICPR International no. 5, pp. 349–384, 1998.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12

[29] M.-Y. Liu, O. Tuzel, S. Ramalingam, and R. Chellappa, “Entropy rate [52] Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu, “Ccnet:
superpixel segmentation,” in CVPR 2011. IEEE, 2011, pp. 2097–2104. Criss-cross attention for semantic segmentation,” in Proceedings of the
[30] R. M. Haralick and L. G. Shapiro, “Image segmentation techniques,” IEEE/CVF international conference on computer vision, 2019, pp. 603–
Computer vision, graphics, and image processing, vol. 29, no. 1, pp. 612.
100–132, 1985. [53] Q. Wang, X. Dong, R. Wang, and H. Sun, “Swin transformer based
[31] S. Minaee, Y. Boykov, F. Porikli, A. Plaza, N. Kehtarnavaz, and pyramid pooling network for food segmentation,” in 2022 IEEE 2nd
D. Terzopoulos, “Image segmentation using deep learning: A survey,” International Conference on Software Engineering and Artificial Intel-
IEEE transactions on pattern analysis and machine intelligence, vol. 44, ligence (SEAI). IEEE, 2022, pp. 64–68.
no. 7, pp. 3523–3542, 2021. [54] Y. Honbu and K. Yanai, “Unseen food segmentation,” in Proceedings of
[32] Y. Guo, Y. Liu, T. Georgiou, and M. S. Lew, “A review of semantic the 2022 International Conference on Multimedia Retrieval, 2022, pp.
segmentation using deep neural networks,” International journal of 19–23.
multimedia information retrieval, vol. 7, pp. 87–93, 2018. [55] G. Sinha, K. Parmar, H. Azimi, A. Tai, Y. Chen, A. Wong, and
[33] P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou, and G. Cottrell, P. Xi, “Transferring knowledge for food image segmentation using
“Understanding convolution for semantic segmentation,” in 2018 IEEE transformers and convolutions,” arXiv preprint arXiv:2306.09203, 2023.
winter conference on applications of computer vision (WACV). Ieee, [56] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
2018, pp. 1451–1460. T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al.,
[34] S. Hao, Y. Zhou, and Y. Guo, “A brief survey on semantic segmentation “An image is worth 16x16 words: Transformers for image recognition
with deep learning,” Neurocomputing, vol. 406, pp. 302–321, 2020. at scale,” arXiv preprint arXiv:2010.11929, 2020.
[35] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu, “Dual attention [57] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked au-
network for scene segmentation,” in Proceedings of the IEEE/CVF toencoders are scalable vision learners,” in Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition, 2019, pp. 3146– conference on computer vision and pattern recognition, 2022, pp.
3154. 16 000–16 009.
[36] Y. Yuan, L. Huang, J. Guo, C. Zhang, X. Chen, and J. Wang, [58] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
“Ocnet: Object context network for scene parsing,” arXiv preprint Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in
arXiv:1809.00916, 2018. neural information processing systems, vol. 30, 2017.
[37] S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, [59] Y. Zhang, X. Huang, J. Ma, Z. Li, Z. Luo, Y. Xie, Y. Qin, T. Luo, Y. Li,
T. Xiang, P. H. Torr et al., “Rethinking semantic segmentation from a S. Liu et al., “Recognize anything: A strong image tagging model,”
sequence-to-sequence perspective with transformers,” in Proceedings of arXiv preprint arXiv:2306.03514, 2023.
the IEEE/CVF conference on computer vision and pattern recognition, [60] X. Zou, J. Yang, H. Zhang, F. Li, L. Li, J. Wang, L. Wang, J. Gao, and
2021, pp. 6881–6890. Y. J. Lee, “Segment everything everywhere all at once,” 2023.
[38] A. M. Hafiz and G. M. Bhat, “A survey on instance segmentation: state [61] J. Chen, Z. Yang, and L. Zhang, “Semantic segment anything,” https:
of the art,” International journal of multimedia information retrieval, //github.com/fudan-zvg/Semantic-Segment-Anything, 2023.
vol. 9, no. 3, pp. 171–189, 2020. [62] X. Zhao, W. Ding, Y. An, Y. Du, T. Yu, M. Li, M. Tang, and J. Wang,
“Fast segment anything,” arXiv preprint arXiv:2306.12156, 2023.
[39] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in
[63] C. Rother, V. Kolmogorov, and A. Blake, “” grabcut” interactive
Proceedings of the IEEE international conference on computer vision,
foreground extraction using iterated graph cuts,” ACM transactions on
2017, pp. 2961–2969.
graphics (TOG), vol. 23, no. 3, pp. 309–314, 2004.
[40] D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee, “Yolact: Real-time instance
[64] T. Ege, W. Shimoda, and K. Yanai, “A new large-scale food image
segmentation,” in Proceedings of the IEEE/CVF international conference
segmentation dataset and its application to food calorie estimation based
on computer vision, 2019, pp. 9157–9166.
on grains of rice,” in Proceedings of the 5th international workshop on
[41] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollár, “Panoptic
multimedia assisted dietary management, 2019, pp. 82–87.
segmentation,” in Proceedings of the IEEE/CVF conference on computer
[65] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-
vision and pattern recognition, 2019, pp. 9404–9413.
decoder with atrous separable convolution for semantic image segmen-
[42] Y. Xiong, R. Liao, H. Zhao, R. Hu, M. Bai, E. Yumer, and R. Urtasun, tation,” in Proceedings of the European conference on computer vision
“Upsnet: A unified panoptic segmentation network,” in Proceedings of (ECCV), 2018, pp. 801–818.
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, [66] X. Dong, W. Wang, H. Li, and Q. Cai, “Windows attention based
2019, pp. 8818–8826. pyramid network for food segmentation,” in 2021 IEEE 7th International
[43] O. Elharrouss, S. Al-Maadeed, N. Subramanian, N. Ottakath, N. Al- Conference on Cloud Computing and Intelligent Systems (CCIS). IEEE,
maadeed, and Y. Himeur, “Panoptic segmentation: A review,” arXiv 2021, pp. 213–217.
preprint arXiv:2111.10250, 2021. [67] T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun, “Unified perceptual
[44] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre- parsing for scene understanding,” in Proceedings of the European
train, prompt, and predict: A systematic survey of prompting methods conference on computer vision (ECCV), 2018, pp. 418–434.
in natural language processing,” ACM Computing Surveys, vol. 55, no. 9, [68] E. Battini Sönmez, S. Memiş, B. Arslan, and O. Z. Batur, “The
pp. 1–35, 2023. segmented uec food-100 dataset with benchmark experiment on food
[45] T. Lüddecke and A. Ecker, “Image segmentation using text and image detection,” Multimedia Systems, pp. 1–9, 2023.
prompts,” in Proceedings of the IEEE/CVF Conference on Computer [69] U. Sharma, B. Artacho, and A. Savakis, “Gourmetnet: Food segmen-
Vision and Pattern Recognition, 2022, pp. 7086–7096. tation using multi-scale waterfall features with spatial and channel
[46] X. Liu, Y. Zheng, Z. Du, M. Ding, Y. Qian, Z. Yang, and J. Tang, “Gpt attention,” Sensors, vol. 21, no. 22, p. 7504, 2021.
understands, too,” arXiv preprint arXiv:2103.10385, 2021. [70] E. Aguilar, B. Nagarajan, B. Remeseiro, and P. Radeva, “Bayesian deep
[47] K. Lu, A. Grover, P. Abbeel, and I. Mordatch, “Pretrained transformers learning for semantic segmentation of food images,” Computers and
as universal computation engines,” arXiv preprint arXiv:2103.05247, Electrical Engineering, vol. 103, p. 108380, 2022.
vol. 1, 2021.
[48] W. Min, C. Liu, L. Xu, and S. Jiang, “Applications of knowledge
graphs for food science and industry,” Patterns, vol. 3, no. 5,
p. 100484, 2022. [Online]. Available: https://www.sciencedirect.com/
science/article/pii/S2666389922000691
[49] W. Min, Z. Wang, Y. Liu, M. Luo, L. Kang, X. Wei, X. Wei, and
S. Jiang, “Large scale visual food recognition,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, 2023.
[50] W. Min, L. Liu, Z. Wang, Z. Luo, X. Wei, X. Wei, and S. Jiang, “Isia
food-500: A dataset for large-scale food recognition via stacked global-
local attention network,” in Proceedings of the 28th ACM International
Conference on Multimedia, 2020.
[51] J. Klotz, V. Rengarajan, and A. C. Sankaranarayanan, “Fine-grain predic-
tion of strawberry freshness using subsurface scattering,” in Proceedings
of the IEEE/CVF International Conference on Computer Vision, 2021,
pp. 2328–2336.

You might also like