Lightweight_Food_Image_Recognition_With_Global_Shuffle_Convolution
Lightweight_Food_Image_Recognition_With_Global_Shuffle_Convolution
2, SEPTEMBER/OCTOBER 2024
Abstract—Consumer behaviors and habits in food choices im- to mitigating these issues at an individual level. Moreover, such
pact their physical health and have implications for climate change choices drive producers and supply chains to embrace more
and global warming. Efficient food image recognition can assist environmentally friendly practices. For instance, reducing meat
individuals in making more environmentally friendly and healthier
dietary choices using end devices, such as smartphones. Simultane- consumption can lower the greenhouse gas emissions associated
ously, it can enhance the efficiency of server-side training, thereby with livestock farming, while prioritizing local and seasonal
reducing carbon emissions. We propose a lightweight deep neural foods helps minimize carbon emissions during transportation.
network named Global Shuffle Net (GSNet) that can efficiently Efficient food image recognition plays a pivotal role as the
recognize food images. In GSNet, we develop a novel convolution initial step in empowering individuals to make such sustainable
method called global shuffle convolution, which captures the depen-
dence between long-range pixels. Merging global shuffle convolu- choices. Accurate dietary recommendations derived from this
tion with classic local convolution yields a framework that works recognition not only assist consumers in selecting environmen-
as the backbone of GSNet. Through GSNet’s ability to capture the tally friendly foods but also aid in choosing those that promote
dependence between long-range pixels at the start of the network, personal health. This capability can readily be harnessed through
by restricting the number of layers in the middle and rear, the the smartphones that people carry with them daily. However,
parameters and floating operation operations (FLOPs) can be min-
imized without compromising the performance, thus permitting given the constraints in power consumption and memory of
a lightweight goal to be achieved. Experimental results on four such end devices, it is imperative to optimize the neural network
popular food recognition datasets demonstrate that our approach utilized for food recognition.
achieves state-of-the-art performance with higher accuracy and Food image recognition occupies a pivotal position within the
fewer FLOPs and parameters. For example, in comparison to the rapidly evolving interdisciplinary realm of food computing [3],
current state-of-the-art model of MobileViTv2, GSNet achieved
87.9% accuracy of the top-1 level on the Eidgenössische Technische playing an indispensable role across various domains, such as
Hochschule Zürich (ETHZ) Food-101 dataset with 28% reduction dietary analysis, healthcare, and the food industry [4], [5], [6],
in the parameters, 37% reduction in the FLOPs, but a 0.7% more [7], [8], [9]. The proliferation of diverse cuisines and culinary
accuracy. techniques has led to a surge in food image datasets, posing
Index Terms—Climate change and global warming, deep challenges for sustainable expansion of server-side food image
learning, food recognition, global shuffle convolution, lightweight, recognition. Moreover, the substantial carbon footprint result-
long-range dependence. ing from large-scale training of artificial intelligence on server
infrastructure has emerged as a pressing concern. Furthermore,
I. INTRODUCTION food image recognition entails intricate fine-grained analysis,
offering valuable insights for refining similar models in the
LIMATE change and global warming have exhibited an
C alarming escalation in recent years, prompting growing
awareness of the impact of dietary choices on the environment
domain of fine-grained recognition [10]. Despite the widespread
adoption of deep learning methods in current approaches, char-
acterized by their high parameter count and extensive train-
among the global population of 7.7 billion people [1], [2].
ing and inference durations [11], [12], this article focuses on
Increasing numbers of consumers recognize that adopting eco-
developing lightweight deep neural network models tailored
friendly and sustainable food options can contribute significantly
specifically for food image recognition.
The rapid integration of artificial intelligence, particularly
Manuscript received 30 December 2023; revised 20 February 2024; ac-
cepted 6 April 2024. Date of publication 2 May 2024; date of current ver- deep learning, has permeated various sectors, including food
sion 10 October 2024. This article was recommended by Associate Editor C. and agriculture [13], [14], [15], [16], [58]. However, research on
Josephson. (Corresponding author: Yancun Yang.) lightweight approaches for food image recognition remains rel-
Guorui Sheng, Tao Yao, Jingru Song, Yancun Yang, and Lili Wang are with
the Department of Information and Electrical Engineering, Ludong University, atively sparse. Early endeavors primarily relied on lightweight
Yantai 264025, China (e-mail: [email protected]; [email protected]; convolutional neural network (CNN)-based methods for food
[email protected]; [email protected]; [email protected]). image analysis. However, the inherent challenge lay in extracting
Weiqing Min and Shuqiang Jiang are with the Key Laboratory of Intelligent
Information Processing, Institute of Computing Technology, Chinese Academy long-range information from images due to the dispersed nature
of Sciences, Beijing 100190, China, and also with the University of Chinese of ingredients. As illustrated in Fig. 1, the discriminative factors
Academy of Sciences, Beijing 100049, China (e-mail: [email protected]; in food identification often lie within the scattered arrangement
[email protected]).
Digital Object Identifier 10.1109/TAFE.2024.3386713 of ingredients, compounded by variations in size, shape, and
2771-9529 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on January 21,2025 at 13:33:27 UTC from IEEE Xplore. Restrictions apply.
SHENG et al.: LIGHTWEIGHT FOOD IMAGE RECOGNITION WITH GLOBAL SHUFFLE CONVOLUTION 393
Fig. 2. Comparison with SOTA CNN-based (MobileNetV2 [19] & V3 [20]) and Hybrid (MobileViTv2 [21]) lightweight models across different datasets.
(a): ETHZ Food-101 [41]. (b): Vireo Food-172 [51]; (c): UEC Food256 [52].
Fig. 3. GSNet. Here, Conv n × n in the GSNet represents a standard n × n convolution. In global shuffle convolution block, to illustrate the implementation
process, assume that both H and W are 6.
B. Lightweight Food Recognition SOTA performance. However, due to the multihead attention
Recently, Min et al. [3] gave a survey on food computing mechanism of the ViT, the model size is still large.
In contrast to the works that use ViT, we propose a simple
including food recognition. In earlier years, various handcrafted
features were utilized for recognition [40], [41]. For example, yet effective pure convolution network, which is based upon the
characteristics of food images and allows for better control over
Mehta and Rastegari [39] utilized random forests to mine dis-
parameters and calculations. In this architecture, a global shuffle
criminative image patches as a visual representation. Due to
the rise of deep learning technology, many recognition methods convolution is utilized to identify global features and a parallel
network structure along with CNN is fashioned to draw out local
based on deep learning have emerged [11], [12], [42], [43], [44],
features, resulting in SOTA performance.
[45].
Given the necessity of lightweight food image recognition,
a lot of related research work has been proposed. Early re- III. METHOD
searchers used the lightweight CNN method for food image
recognition [46], [47], [48], [49]. Tan et al. [49] recently pro- A. Brief Review of GSNet
posed a novel lightweight neural architecture search (LNAS) Our objective is to propose a network model that can not only
model to self-generate a thin CNN that can be executed on effectively deal with the dispersion and diversity of food image
mobile devices, achieving nearly 76% recognition accuracy features, but also realize lightweight so that it can be better ex-
on the Eidgenössische Technische Hochschule Zürich (ETHZ) tended on the server side and deployed on edge and end devices.
Food-101 dataset. The recognition accuracy of these CNN-based The proposed GSNet is shown in Fig. 3. We use global
lightweight food recognition is generally low. ViT provides a shuffle convolution to capture the long-range information of
new option for extracting global features of food images, Sheng food ingredients scattered in food images to enhance the model’s
et al. [18] tried to extract global and local features with a parallel expressiveness, and then form a parallel block with local convo-
structure composed of the ViT group and CNN and obtained the lution. This parallel block is used as the basic structure of GSNet,
Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on January 21,2025 at 13:33:27 UTC from IEEE Xplore. Restrictions apply.
SHENG et al.: LIGHTWEIGHT FOOD IMAGE RECOGNITION WITH GLOBAL SHUFFLE CONVOLUTION 395
C. Approach
which effectively improves the food image recognition accuracy.
Suppose the input size of a convolution layer is
Based on the fact that GSNet focuses on capturing long-range
[N, C in , IH in , IW in ], the output size is [N, C out , IH out , IW out ],
dependence among different spatial pixels in the front part of the
and convolution kernel size is [C out , C in , K H , K W ], where N
model, we reduce the number of network layers in the middle
denotes the batch size, C denotes the number of channels,
and rear parts, and correspondingly the effective reduction of the
IH and IW denotes the height and width of input images,
number of parameters and FLOPs is obtained. The experimental
respectively. During local convolution calculation, the value of a
results show that this strategy can effectively reduce the number
specific feature point X (t+1) of a specific channel of the output
of parameters and FLOPs on the premise of ensuring recognition
feature map with input X (t) is calculated as follows:
accuracy.
(t+1)
XNi ,C out ,IH out ,IW out = BCjout
j k l
B. Global Shuffle Convolution (t)
The global shuffle convolution method divides the image into + WCjout ,c,h,w × XNi ,c,h,w (1)
h w 0≤c<C in
several patches first, and then in each convolution operation,
corresponding position pixels are taken out from each patch to where B denotes bias parameters with size C out , W de-
participate in the convolution. Since patches cover the entire notes weight parameters with size [C out , C in , K H , K W ], h ∈
image, this convolution operation is to extract scattered corre- [IHkout , IHkout + K H − 1], and l ∈ [IWlout , IWlout + K W − 1].
lation information. No matter how far the same ingredient is in The complete formula of the global shuffle convolution calcu-
the dish, it can be captured by the global shuffle convolution lation method is more complicated. For simplicity, in an image
operation. As shown in Figs. 3 and 4(c), by first resetting plane, let the number of groups in the row direction be equal
the row and then resetting the column, the distant pixels are to K H , the number of groups in the column direction is K W ,
concentrated into 2 × 2 patches, and then a normal convolution and H in /K H = H out , W in /K W = W out , that is, the number of
with a kernel of 3 × 3 size is performed, so that not only four groups is consistent with the kernel size, and the size of each
elements are involved in the calculation of correlation pixel, but group is the same as the output plane, and the stride of the
five more elements from a greater distance. Through this, the convolution is taken as the group size. Then, the calculation
correlation information between long-range pixels is quickly of the value of the specified feature point Y of the output feature
obtained. Here, the convolution kernel size is set to 3 × 3 and map with input X is similar to formula (1) but
stride set to 1. In Fig. 4(c), the middle image is the intermediate
result after the rows and columns of the bottom image are reset, h ∈ IHkout + k ∗ IH out k = 0, . . . , K H − 1
and the spatial variation law can be seen through pixels of the
same color. The top plane is the result of a 3 × 3 convolution of w ∈ IWlout + k ∗ IW out k = 0, . . . , K W − 1 . (2)
the middle plane.
Compared with global shuffle convolution, local convolution Then, fold the 4-D matrix into a 2-D matrix for X (t) , combine
[see Fig. 4(a)] extracts the local correlation in the image through bias parameters into weight parameters, and add a constant row
the convolution operation on the local area, and then translates it to X (t) , the output of the local convolutional network
with a certain step size and performs the convolution operation
f (X (0) ) = f (T −1) (· · · f (1)
multiple times to achieve full coverage of local information. It
can build deeper and more nonlinear networks but ignores the (f (0) (X (0) W (0) )W (1) ) · · · W (T −1) ) (3)
correlation between pixel vectors in the global scope, leading
to information loss compared with the fully connected model. where X (t) (t = 0, . . . , T − 1) is the 2-D matrix of the input or
Dilated convolution is a variant of local convolution that expands 2-D matrix of the output layer, T is the number of layers, W (t)
the receptive field compared with local convolution. As shown is the parameter matrix of each layer, f (t) is the nonlinear ac-
in Fig. 4(b), under the same kernel size, the dilated convolution tivation function used by each layer. When nonlinear activation
skips some pixel positions to perform convolution operations, functions are not used: f (X (0) ) = X (0) W (0) W (1) · · · W (T −1) .
Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on January 21,2025 at 13:33:27 UTC from IEEE Xplore. Restrictions apply.
396 IEEE TRANSACTIONS ON AGRIFOOD ELECTRONICS, VOL. 2, NO. 2, SEPTEMBER/OCTOBER 2024
that is, in linear mode, global shuffle convolution and local between more complex plane pixel vectors at different positions
convolution are equivalent, but their parameter positions are can be represented.
adjusted. However, neural networks are nonlinear, only (1) is
true, i.e., the global shuffle convolution is a series of images E. Network Architecture
whose plane pixels are misaligned (the misalignment pattern is
fixed). Using only global shuffle convolutions in the network is This section introduces the basic parallel block, the hierar-
generally ineffective unless the dislocation results in clustered chical network layout, and the detailed network architecture of
color patches similar to normal images. GSNet.
In our work, the parallel network structure of global shuf- Parallel block: The basic block used in our network is parallel,
fle convolution and local convolution are both used, the local one branch uses local convolution and the other uses global
convolution represents the most features of the image, and the shuffle convolution, the outputs of the two branches are con-
global shuffle convolution assists in the collection of food infor- catenated and then propagated along the neural network. The
mation scattered around the image, which ultimately improves local convolution branch is the inverse residual model derived
the accuracy of recognition. from MobileNetV2, the other branch replaces the depth-wise
convolution part of it with our proposed global shuffle con-
volution. In parallel block, the local convolution branch is
D. Implementation Details
responsible for extracting the local features at the pixel level,
The essence of the global shuffle convolution method is to and the global shuffle convolution branch is responsible for
calculate the correlation between several pixels at any distance capturing the long-range dependence between pixels in different
on the image plane, that is, to convolve several pixel vectors spatial locations. The local convolution branch is the main bearer
selected at different positions in the entire image, which is since most of the image features are revealed through local
equivalent to adjusting these several pixel vectors to a local correlations. The global shuffle convolution branch provides
region, and then perform local convolution on this region. correlation features between pixels from entire image plane and
When implementing the global shuffle convolution calcula- is used to add long-range feature to improve the expressiveness
tion, our actual practice is to first perform the relocation in of image features.
the column direction of the image plane, then perform the Adjusted network layout: As shown in Fig. 5, we use a network
rearrangement in the row direction, and finally perform the local structure that differs from traditional models. In the traditional
convolution. As shown in Fig. 3, after the rearrangement in network, since the local convolution can only represent the local
column and row directions, several pixels (2, 2), (2, 5), (5, 2), and correlation of the image, the long-range features are obtained
(5, 5) are adjusted to be adjacent to each other, these pixels come after the multilayer local convolution. The characteristic of this
from the scattered positions of the entire image plane, and local network structure is that there are fewer layers in the front part
convolution on them is equivalent to global shuffle convolution. of the network and more layers in the back part. Due to the
This implementation achieves certain flexibility: the number large number of channels in the middle and rear, the network is
of groups does not have to be the same as the size of the heavily parameterized. By using the parallel block structure,
convolution kernel, and the stride of the convolution does not GSNet obtains the long-range correlation information at the
have to be the same as the size of the group so that correlations beginning of the network without relying on the shrinking part
Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on January 21,2025 at 13:33:27 UTC from IEEE Xplore. Restrictions apply.
SHENG et al.: LIGHTWEIGHT FOOD IMAGE RECOGNITION WITH GLOBAL SHUFFLE CONVOLUTION 397
TABLE I
NETWORK SPECIFICATION
in the rear of the network with more layers. Therefore, in this TABLE II
PERFORMANCE COMPARISON ON ETHZ FOOD-101[41]
work, we reduce the number of layers in the back of the network
drastically, effectively reduce the number of parameters and
computation, and then develop a lightweight food recognition
network. The following experimental results show that this
strategy is effective, reducing the number of parameters and
computations while achieving higher accuracy.
Network specification: The detailed network specification is
given in Table I. The network first obtains a image plane through
a local convolution and then passes through a series of parallel
block groups. In each parallel block group, the group number
is set according to the size of the current image resolution. At
the tail of the network, the number of channels is expanded by
convolution, then global pooling and dropout are performed to
obtain and adjust the single-pixel output, and finally, a fully
connected layer is used to map to the number of classes.
IV. EXPERIMENTS
A. Datasets
To evaluate the proposed model, we conduct experiments on
four food datasets: ETHZ Food-101 [41], Vireo Food-172 [51],
UEC Food-256 [52], and ISIA Food-500 [53]. ETHZ Food-101 we use the same data augmentation method as MobileViTv2 for
has 101 categories, we use 75 750 images for training and 25 250 image preprocessing.
for validation. Vireo Food-172 provides 172 categories, we use
66 071 images for training and 44 170 images for validation. C. Experiment Results
UEC Food-256 has 256 categories where 22 095 images are Results on ETHZ Food-101: Table II presents results on ETHZ
used for training and 9300 images are used for validation. ISIA Food-101. The results are grouped according to similar numbers
Food-500 is a comprehensive food dataset composed of 500 food of parameters. Our model surpasses all other models in three
types from Wikipedia, we use 239 378 images for training and parameter ranges. Among all models with around 1 M parame-
120 142 images for validation. ters, our model achieves 87.0% top-1 accuracy, which is 0.1%,
4.6%, and 4.6% higher than MobileViTv2, MobileNetV3, and
MobileNetV2, respectively. In around 2–3 M parameter budget
B. Training Settings
models, our model’s top-1 accuracy is 87.9%, which is 0.7%
We train our models using an input image resolution higher than MobileViTv2, and 2.4% higher than MobileNetV3
256×256, a batch size of 256, and SGD optimizer with 0.9 and MobileNetV2. Our model also achieves the highest top-1
momentum [54]. We use the initial learning rate of 0.1 for first accuracy of 88.4% in the parameter range of 3–5 M, surpassing
3000 iterations of linear warm-up and then a cosine schedule MobileViTv2, MobileNetV3, and MobileNetV2 by 0.8%, 2.2%,
with the learning rate ranging from 0.0004 to 0.8. Furthermore, and 1.9%, respectively. We also compare with recent lightweight
Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on January 21,2025 at 13:33:27 UTC from IEEE Xplore. Restrictions apply.
398 IEEE TRANSACTIONS ON AGRIFOOD ELECTRONICS, VOL. 2, NO. 2, SEPTEMBER/OCTOBER 2024
TABLE IV
PERFORMANCE COMPARISON ON UEC FOOD256[52] Results on ISIA Food-500: Table V presents experimental re-
sults on dataset ISIA Food-500. Because of its wide range, large
scale, and offering of both Chinese and western food, it is harder
for food recognition in Food-500. Even so, our proposed GSNet
still achieves competitive results: compared with SOTA ViT-
based lightweight network MobileViTv2, the FLOPs are greatly
reduced with almost the same recognition rate. Compared with
the SOTA CNN-based lightweight network MobileNetV2 and
V3, our model has significantly better performance with similar
parameters: GSNet -1.5/-2.0 obtain 64.3%/64.9% top-1 accu-
racy, which is +1.6%/1.1% higher than that of MobileNetv2/v3
(63.8%/62.7%) with a similar number of parameters.
The experimental results demonstrate the effectiveness and
the generalization of our design. With the proposed parallel
block, although we reduce the number of layers in the middle
and rear parts of the network, the proposed network provides
reasonable accuracy gains over the general network architecture.
Considering experiments on four different food datasets with
consistent results, the proposed model should be effective and
food recognition networks; the results show that the recognition efficient for general food vision tasks.
accuracy of our network (87.0%) is much higher than that of Comparison and Analysis with Results Based on Lightweight
LNAS-NET (75.9%) and LTBDNN (TD-192) (76.8%) in the Networks using ViT: Experimental results reveal that com-
case of much fewer parameters. pared with the SOTA lightweight model based on ViT, Mobile-
Results on Vireo Food-172: Table III presents results on ViTv2 [21], GSNet achieves comparable or superior recognition
VireoFood-172. Compared with MobileViTv2 in every param- accuracy while requiring fewer parameters and significantly less
eter range, our model achieves better top-1 accuracy of 87.8% computational load. We believe this is based on the follow-
versus 87.3%, 89.1% versus 88.0%, and 89.3% versus 88.2% ing reasons: ViT possesses powerful capabilities for extract-
with much lower FLOPs of 295 M versus 480 M, 665 M versus ing global information. However, the common challenges of
1,052 M, and 1,051 M versus 1,843 M. Although MobileNetV3 ViT-based lightweight models include the difficulty of training
and MobileNetV2 have much lower FLOPs, they lag in accuracy and the high computational cost stemming from the quadratic
by a margin of more than 2% with our models. number of interactions between tokens. When modeling the
Results on UEC Food256: As seen in Table IV, the results global context, ViT also incorporates positional information
are similar to the other two datasets. Our models achieve the of patches, further increasing parameter quantity and compu-
highest top-1 accuracy in every parameter range. Compared tational load. A key differentiating feature of food images lies in
with MobileViTv2, our model has fewer parameters and FLOPs. the correlated characteristics among the same type of ingredients
Compared with MobileNetV3 and MobileNetV2, our model dispersed throughout the image. The distant correlations among
achieves much higher top-1 accuracy with fewer parameters but the dispersed identical ingredients do not require consideration
slightly more FLOPs. of specific patch positional information. Our designed GSNet is
Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on January 21,2025 at 13:33:27 UTC from IEEE Xplore. Restrictions apply.
SHENG et al.: LIGHTWEIGHT FOOD IMAGE RECOGNITION WITH GLOBAL SHUFFLE CONVOLUTION 399
Fig. 6. Visualization of experimental results comparison. (a) Examples from dataset ETHZ Food-101. (b) Examples from dataset Vireo Food-172; Left 4 columns
are cases where both local convolution and local+global shuffle convolution can correctly identified; Right 6 columns are cases where local convolution fails but
local+global shuffle convolution succeed. The first row is the original image, the second row is the heat maps generated by using only local convolution, and the
third row is the heat maps generated by using local+global shuffle convolution. (a) Samples from ETHZ Food-101. (b) Samples from Vireo Food-172.
TABLE VII
ABLATION STUDY
effectively mitigates the parameter count and FLOPs. Evaluation [19] M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, and L. Chen, “Mo-
across four prominent food image databases demonstrates that bileNetV2: Inverted residuals and linear bottlenecks,” in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit., 2018, pp. 4510–4520.
our method outperforms existing CNN-based, ViT-based, and [20] A. Howard et al., “Searching for MobileNetV3,” in Proc. IEEE Int. Conf.
hybrid lightweight network models. The development of this Comput. Vis., 2019, pp. 1314–1324.
lightweight network holds promise for enhancing server-side [21] S. Mehta and M. Rastegari, “Separable self-attention for mobile vision
transformers,” 2022, arXiv:2206.02680.
training efficiency and facilitating the deployment of food recog- [22] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
nition applications on mobile platforms. This forms a robust recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016,
foundation for individuals to make informed, environmentally pp. 770–778.
[23] N. Ma, X. Zhang, H. Zheng, and J. Sun, “ShuffleNet V2: Practical guide-
conscious, and health-driven dietary choices in their daily lives. lines for efficient CNN architecture design,” in Proc. Eur. Conf. Comput.
Moving forward, our future endeavors will encompass Vis., 2018, pp. 122–138.
adapting to diverse hardware architectures and operating [24] S. Mehta, M. Rastegari, L. G. Shapiro, and H. Hajishirzi, “ESPNetV2: A
light-weight, power efficient, and general purpose convolutional neural
system environments for end devices. In addition, we network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019,
aim to deploy lightweight algorithms for food recognition, pp. 9190–9200.
detection, and segmentation, ultimately offering personalized [25] M. Tan and V. Quoc Le, “ EfficientNet: Rethinking model scaling for
convolutional neural networks,” in Proc. Int. Conf. Mach. Learn., vol. 97,
recommendations for environmentally sustainable and pp. 6105–6114.
health-conscious dietary choices. [26] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7132–7141.
REFERENCES [27] Z. Liu et al., “Swin transformer: Hierarchical vision transformer us-
ing shifted windows,” in Proc. IEEE Int. Conf. Comput. Vis., 2021,
[1] S. H. Wittwer, Food, Climate, and Carbon Dioxide: The Global Envi- pp. 9992–10002.
ronment and World Food Production. Boca Raton, FL, USA: CRC Press, [28] Y. Li et al., “Efficientformer: Vision transformers at mobilenet speed,” Adv.
1995. Neural Inf. Process. Syst., vol. 35, pp. 12934–12949, 2022.
[2] S. J. Vermeulen, B. M. Campbell, and J. S. I. Ingram, “Climate change and [29] T. Huang, L. Huang, S. You, F. Wang, C. Qian, and C. Xu,
food systems,” Annu. Rev. Environ. Resour., vol. 37, pp. 195–222, 2012. “LightViT: Towards light-weight convolution-free vision transformers,”
[3] W. Min, S. Jiang, L. Liu, Y. Rui, and R. Jain, “A survey on food computing,” 2022, arXiv:2207.05557.
ACM Comput. Surv., vol. 52, no. 5, pp. 1–36, 2019. [30] H. Cai et al., “Efficientvit: Lightweight multi-scale attention for high-
[4] A. Ishino, Y. Yamakata, H. Karasawa, and K. Aizawa, “RecipeLog: Recipe resolution dense prediction,” in Proc. IEEE/CVF Int. Conf. Comput. Vis.,
authoring app for accurate food recording,” in Proc. ACM Multimedia 2023, pp. 17302–17313.
Conf., 2021, pp. 2798–2800, doi: 10.1145/3474085.3478563. [31] J. Zhang et al., “MiniViT: Compressing vision transformers with weight
[5] A. Rostami, N. Nagesh, A. Rahmani, and R. C. Jain, “World food at- multiplexing,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022,
las for food navigation,” in Proc. 7th Int. Workshop Multimedia Assist. pp. 12135–12144.
Dietary Manage. Multimedia Assist. Dietary Manage., 2022, pp. 39–47, [32] K. Wu et al., “TinyViT: Fast pretraining distillation for small vision
doi: 10.1145/3552484.3555748. transformers,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 68–85.
[6] A. Rostami, V. Pandey, N. Nag, V. Wang, and R. C. Jain, “Personal [33] Y. Chen et al., “Mobile-former: Bridging MobileNet and trans-
food model,” in Proc. 28th Int. Conf. Multimedia, Virtual Event, 2020, former,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022,
pp. 4416–4424, doi: 10.1145/3394171.3414691. pp. 5260–5269.
[7] K. Nakamoto, S. Amano, H. Karasawa, Y. Yamakata, and K. Aizawa, [34] J. Guo et al., “CMT: Convolutional neural networks meet vision trans-
“Prediction of mental state from food images,” in Proc. 1st Int. Work- formers,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022,
shop Multimedia Cooking, Eating, Related Appl., 2022, pp. 21–28, pp. 12165–12175.
doi: 10.1145/3552485.3554937. [35] H. Wu et al., “CvT: Introducing convolutions to vision transformers,” in
[8] Y. Yamakata, A. Ishino, A. Sunto, S. Amano, and K. Aizawa, “Recipeori- Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 22–31.
ented food logging for nutritional management,” in Proc. 30th Int. Conf. [36] A. Srinivas, T. Lin, N. Parmar, J. Shlens, P. Abbeel, and A. Vaswani,
Multimedia, 2022, pp. 6898–6904. “Bottleneck transformers for visual recognition,” in Proc. IEEE Conf.
[9] T. Yao et al., “Online latent semantic hashing for cross-media retrieval,” Comput. Vis. Pattern Recognit., 2021, pp. 16519–16529.
Pattern Recognit., vol. 89, pp. 1–11, 2019. [37] J. Li et al., “Next-ViT: Next generation vision transformer for efficient
[10] J. Ródenas, B. Nagarajan, M. Bolaños, and P. Radeva, “Learning multi- deployment in realistic industrial scenarios,” 2022, arXiv:2207.05501.
subset of classes for fine-grained food recognition,” in Proc. 7th Int. [38] J. Pan et al., “EdgeViTs: Competing light-weight CNNs on mobile devices
Workshop Multimedia Assist. Dietary Manage. Multimedia Assist. Dietary with vision transformers,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 294–
Manage., 2022, pp. 17–26, doi: 10.1145/3552484.3555754. 311.
[11] S. Jiang, W. Min, L. Liu, and Z. Luo, “Multi-scale multi-view deep feature [39] S. Mehta and M. Rastegari, “MobileViT: Lightweight, general purpose,
aggregation for food recognition,” IEEE Trans. Image Process, vol. 29, and mobile-friendly vision transformer,” in Proc. Int. Conf. Learn. Repre-
pp. 265–276, 2020. sentations, 2022.
[12] N. Martinel, G. L. Foresti, and C. Micheloni, “Wide-slice residual networks [40] S. Yang, M. Chen, D. Pomerleau, and R. Sukthankar, “Food recognition
for food recognition,” in Proc. Winter Conf. Appl. Comput. Vis., Lake using statistics of pairwise local features,” in Proc. IEEE Conf. Comput.
Tahoe, NV, USA, 2018, pp. 567–576, doi: 10.1109/WACV.2018.00068. Vis. Pattern Recognit., 2010, pp. 2249–2256.
[13] J. Zhao et al., “Deep-learning-based automatic evaluation of rice seed [41] L. Bossard, M. Guillaumin, and L. V. Gool, “Food-101-mining discrimi-
germination rate,” J. Sci. Food Agriculture, vol. 103, no. 4, pp. 1912–1924, native components with random forests,” in Proc. Eur. Conf. Comput. Vis.,
2023. 2014, pp. 446–461.
[14] Z. Huang et al., “Fast location and segmentation of high-throughput [42] W. Min, L. Liu, Z. Luo, and S. Jiang, “Ingredient guided cascaded
damaged soybean seeds with invertible neural networks,” J. Sci. Food multi-attention network for food recognition,” in Proc. ACM Int. Conf.
Agriculture, vol. 102, no. 11, pp. 4854–4865, 2022. Multimedia, 2019, pp. 1331–1339.
[15] W. Min et al., “Vision-based fruit recognition via multi-scale attention [43] W. Min et al., “Large scale visual food recognition,” IEEE Trans. Pattern
CNN,” Comput. Electron. Agriculture, vol. 210, 2023, Art. no. 107911. Anal. Mach. Intell., vol. 45, no. 8, pp. 9932–9949, Aug. 2023.
[16] W. Shafik et al., “Using a novel convolutional neural network for plant pests [44] S. Horiguchi, S. Amano, M. Ogawa, and K. Aizawa, “Personalized classi-
detection and disease classification,” J. Sci. Food Agriculture, vol. 103, fier for food image recognition,” IEEE Trans. Multimedia, vol. 20, no. 10,
no. 12, pp. 5849–5861, 2023. pp. 2836–2848, Oct. 2018.
[17] A. Dosovitskiy et al., “An image is worth 16×16 words: Transformers for [45] H. Kagaya, K. Aizawa, and M. Ogawa, “Food detection and recognition
image recognition at scale,” in Proc. 9th Int. Conf. Learn. Representations, using convolutional neural network,” in Proc. ACM Int. Conf. Multimedia,
2021. 2014, pp. 1085–1088.
[18] G. Sheng, S. Sun, C. Liu, and Y. Yang, “Food recognition via an efficient [46] Y. Kawano and K. Yanai, “Real-time mobile food recognition system,”
neural network with transformer grouping,” Int. J. Intell. Syst., vol. 37, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, 2013,
no. 12, pp. 11465–11481, 2022. pp. 1–7.
Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on January 21,2025 at 13:33:27 UTC from IEEE Xplore. Restrictions apply.
402 IEEE TRANSACTIONS ON AGRIFOOD ELECTRONICS, VOL. 2, NO. 2, SEPTEMBER/OCTOBER 2024
[47] S. Y. Kawano and K. Yanai, “FoodCam: A real-time food recognition Tao Yao received the Ph.D. degree in multimedia
system on a smartphone,” Multimedia Tools Appl., vol. 74, no. 14, retrieval from the Dalian University of Technology,
pp. 5263–5287, 2015. Dalian, China, in 2017.
[48] P. Pouladzadeh and S. Shirmohammadi, “Mobile multi-food recognition He is currently an Associate Professor with the
using deep learning,” ACM Trans. Multimedia Comput., Commun., Appl., Department of Information and Electrical Engineer-
vol. 13, no. 3s, pp. 1–21, 2017. ing, Ludong University and also a Researcher with
[49] R. Z. Tan, X. Chew, and K. W. Khaw, “Neural architecture search for Yantai Research Institute of New Generation Infor-
lightweight neural network in food recognition,” Mathematics, vol. 9, mation Technology, Southwest Jiaotong University,
no. 11, pp. 1245–2021, 2021. Chengdu, China. He has authored or co-authored
[50] F. Yu, V. Koltun, and T. A. Funkhouser, “Dilated Residual Networks,” in more than 30 peer-referenced papers in relevant jour-
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 636–644. nals and conferences, including IEEE TRANSACTIONS
[51] M. Klasson, C. Zhang, and H. Kjellström, “A hierarchical grocery store ON KNOWLEDGE AND DATA ENGINEERING, IEEE TRANSACTIONS CYBERNET-
image dataset with visual and semantic labels,” in Proc. Winter Conf. Appl. ICS, ACM Transactions on Multimedia Computing, Communications, and
Comput. Vis., 2019, pp. 491–500. Applications and Pattern Recognition. His research interests include multimedia
[52] Y. Kawano and K. Yanai, “FoodCam-256: A. large-scale realtime mobile retrieval, computer vision, and machine learning.
food recognition system employing high-dimensional features and com-
pression of classifier weights,” in Proc. ACM Int. Conf. Multimedia, 2014, Jingru Song received the B.E. degree in software
pp. 761–762. engineering from the College of Computer Science,
[53] W. Min et al., “ISIA Food-500: A dataset for large-scale food recognition Liaocheng University, Liaocheng, China, in 2022.
via stacked global-local attention network,” in Proc. ACM Int. Conf. She is currently working toward the M.E. degree in
Multimedia, 2020, pp. 393–401. computer science and technology with the College
[54] L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for large- of Information and Electrical Engineering, Ludong
scale machine learning,” SIAM Rev, vol. 60, no. 2, pp. 223–311, 2018. University, Yantai, China.
[55] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Her research interests include multimedia process-
Batra, “Grad-CAM: Visual explanations from deep networks via gradient- ing, computer vision, and food computing.
based localization,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 618–
626.
[56] A. G. Howard et al., “MobileNets: Efficient convolutional neural networks
for mobile vision applications,” 2017, arXiv:1704.04861. Yancun Yang received the Ph.D. degree in man-
[57] K. Han, Y. Wang, Q. Tian, J. Guo, C. Xu, and C. Xu, “GhostNet: More agement from Shandong University, Jinan, China, in
features from cheap operations,” in Proc. IEEE Conf. Comput. Vis. Pattern 2008.
Recognit., 2020, pp. 1577–1586. He is currently a Lecturer with the Department of
[58] J. F. Yeh, K.-M. Lin, C.-Y. Lin, and J.-C. Kang, “Intelligent mango Information and Electrical Engineering, Ludong Uni-
fruit grade classification using AlexNet-SPP with mask R-CNN-Based versity, Yantai, China. He has authored or co-authored
segmentation algorithm,” IEEE Trans. AgriFood Electron., vol. 1, no. 1, more than 10 peer-referenced papers in relevant jour-
pp. 41–49, Jun. 2023. nals and conferences, including ACM Transactions on
Multimedia Computing, Communications, and Appli-
Guorui Sheng received the M.E. degree in computer cations and Nutrients. His research interests include
science from Kunsan National University, Gunsan, computer vision, deep learning, and food computing.
South Korea, in 2007, and the Ph.D. degree in com-
puter application technology from Nankai University,
Lili Wang received the M.E. and Ph.D. degrees
Tianjin, China, in 2017.
in electromagnetic field and microwave technology
From 2017 to 2018, he was a Research Assistant to
from Electronic Engineering School, Beijing Univer-
Scholar Bruce Denby with the School of Computer
sity of Posts and Telecommunication, Beijing, China,
Science and Technology, Tianjin University. He is
in 2006.
currently a Lecturer with the Department of Informa-
She is currently a Professor with the School of
tion and Electrical Engineering, Ludong University,
Information and Electrical Engineering, Ludong Uni-
Yantai, China. He has authored or co-authored more
versity, Yantai, China. Her research interests include
than 20 peer-referenced papers in relevant journals and conferences, including
broadband communication and multimedia commu-
ACM Transactions on Multimedia Computing, Communications, and Appli-
nication.
cations and Nutrients. His research interests include computer vision, deep
learning, and food computing.
Shuqiang Jiang (Senior Member, IEEE) received
the Ph.D. degree in computer application technology
Weiqing Min (Senior Member, IEEE) received the from the Institute of Computing Technology, Chinese
Ph.D. degree in pattern recognition and intelligent Academy of Sciences, Beijing, China, in 2006.
systems from the Institute of Automation, Chinese He is currently a Professor with the Institute of
Academy of Sciences, Beijing, China, in 2015. Computing Technology, Chinese Academy of Sci-
He is currently an Associate Professor with the Key ences (CAS), Beijing, China, and a Professor with
Laboratory of Intelligent Information Processing, In- the University of CAS. He is also with the Key Labo-
stitute of Computing Technology, Chinese Academy ratory of Intelligent Information Processing, CAS. He
of Sciences. He has authored or co-authored more has authored or co-authored more than 150 articles.
than 50 peer-referenced papers in relevant journals He was supported by the National Science Fund for
and conferences, including Patterns (Cell Press), Distinguished Young Scholars in 2021, the NSFC Excellent Young Scientists
ACM Computing Surveys, Trends in Food Science Fund in 2013, and the Young Top-Notch Talent of Ten Thousand Talent Program
and Technology, IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE in 2014. His research interests include multimedia analysis and multimodal
INTELLIGENCE, IEEE TRANSACTIONS ON IMAGE PROCESSING, Food Chemistry, intelligence.
ACM MM, AAAI, and IJCAI. His research interests include multimedia content Mr. Jiang is a Senior Member of CCF and a Member of ACM. He was a TPC
analysis and food computing. Member for more than 20 well-known conferences, including ACM Multimedia,
Mr. Win was a Senior Member of CCF. He was the recipient of the 2016 ACM CVPR, ICCV, IJCAI, AAAI, ICME, ICIP, and PCM. He was the recipient of the
Transactions on Multimedia Computing, Communications, and Applications, Lu Jiaxi Young Talent Award from CAS in 2012 and the CCF Award of Science
the Nicolas D. Georganas Best Paper Award, and the 2017 IEEE Multimedia and Technology in 2012. He is the Vice Chair of the IEEE CASS Beijing Chapter
Magazine Best Paper Award. He was the Guest Editor for the special issues and the ACM SIGMM China Chapter. He was the General Chair of ICIMCS in
on international journals, such as IEEE TRANSACTIONS ON MULTIMEDIA, IEEE 2015 and the Program Chair of the 2019 ACM Multimedia Asia and PCM in
MULTIMEDIA, and Foods. 2017. He is an Associate Editor of Multimedia Tools and Applications and ACM
Transactions on Multimedia Computing, Communications, and Applications.
Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on January 21,2025 at 13:33:27 UTC from IEEE Xplore. Restrictions apply.