0% found this document useful (0 votes)
12 views

Lightweight_Food_Image_Recognition_With_Global_Shuffle_Convolution

The document presents a lightweight deep neural network called Global Shuffle Net (GSNet) designed for efficient food image recognition, which aids consumers in making healthier and more environmentally friendly dietary choices. GSNet utilizes a novel global shuffle convolution method to capture long-range pixel dependencies while minimizing parameters and computational complexity, achieving state-of-the-art performance on food recognition datasets. Experimental results demonstrate that GSNet outperforms existing models like MobileViTv2 with higher accuracy and fewer floating operations, making it suitable for deployment on resource-constrained devices.

Uploaded by

uppho936
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Lightweight_Food_Image_Recognition_With_Global_Shuffle_Convolution

The document presents a lightweight deep neural network called Global Shuffle Net (GSNet) designed for efficient food image recognition, which aids consumers in making healthier and more environmentally friendly dietary choices. GSNet utilizes a novel global shuffle convolution method to capture long-range pixel dependencies while minimizing parameters and computational complexity, achieving state-of-the-art performance on food recognition datasets. Experimental results demonstrate that GSNet outperforms existing models like MobileViTv2 with higher accuracy and fewer floating operations, making it suitable for deployment on resource-constrained devices.

Uploaded by

uppho936
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

392 IEEE TRANSACTIONS ON AGRIFOOD ELECTRONICS, VOL. 2, NO.

2, SEPTEMBER/OCTOBER 2024

Lightweight Food Image Recognition With


Global Shuffle Convolution
Guorui Sheng , Weiqing Min , Senior Member, IEEE, Tao Yao , Jingru Song , Yancun Yang , Lili Wang ,
and Shuqiang Jiang , Senior Member, IEEE

Abstract—Consumer behaviors and habits in food choices im- to mitigating these issues at an individual level. Moreover, such
pact their physical health and have implications for climate change choices drive producers and supply chains to embrace more
and global warming. Efficient food image recognition can assist environmentally friendly practices. For instance, reducing meat
individuals in making more environmentally friendly and healthier
dietary choices using end devices, such as smartphones. Simultane- consumption can lower the greenhouse gas emissions associated
ously, it can enhance the efficiency of server-side training, thereby with livestock farming, while prioritizing local and seasonal
reducing carbon emissions. We propose a lightweight deep neural foods helps minimize carbon emissions during transportation.
network named Global Shuffle Net (GSNet) that can efficiently Efficient food image recognition plays a pivotal role as the
recognize food images. In GSNet, we develop a novel convolution initial step in empowering individuals to make such sustainable
method called global shuffle convolution, which captures the depen-
dence between long-range pixels. Merging global shuffle convolu- choices. Accurate dietary recommendations derived from this
tion with classic local convolution yields a framework that works recognition not only assist consumers in selecting environmen-
as the backbone of GSNet. Through GSNet’s ability to capture the tally friendly foods but also aid in choosing those that promote
dependence between long-range pixels at the start of the network, personal health. This capability can readily be harnessed through
by restricting the number of layers in the middle and rear, the the smartphones that people carry with them daily. However,
parameters and floating operation operations (FLOPs) can be min-
imized without compromising the performance, thus permitting given the constraints in power consumption and memory of
a lightweight goal to be achieved. Experimental results on four such end devices, it is imperative to optimize the neural network
popular food recognition datasets demonstrate that our approach utilized for food recognition.
achieves state-of-the-art performance with higher accuracy and Food image recognition occupies a pivotal position within the
fewer FLOPs and parameters. For example, in comparison to the rapidly evolving interdisciplinary realm of food computing [3],
current state-of-the-art model of MobileViTv2, GSNet achieved
87.9% accuracy of the top-1 level on the Eidgenössische Technische playing an indispensable role across various domains, such as
Hochschule Zürich (ETHZ) Food-101 dataset with 28% reduction dietary analysis, healthcare, and the food industry [4], [5], [6],
in the parameters, 37% reduction in the FLOPs, but a 0.7% more [7], [8], [9]. The proliferation of diverse cuisines and culinary
accuracy. techniques has led to a surge in food image datasets, posing
Index Terms—Climate change and global warming, deep challenges for sustainable expansion of server-side food image
learning, food recognition, global shuffle convolution, lightweight, recognition. Moreover, the substantial carbon footprint result-
long-range dependence. ing from large-scale training of artificial intelligence on server
infrastructure has emerged as a pressing concern. Furthermore,
I. INTRODUCTION food image recognition entails intricate fine-grained analysis,
offering valuable insights for refining similar models in the
LIMATE change and global warming have exhibited an
C alarming escalation in recent years, prompting growing
awareness of the impact of dietary choices on the environment
domain of fine-grained recognition [10]. Despite the widespread
adoption of deep learning methods in current approaches, char-
acterized by their high parameter count and extensive train-
among the global population of 7.7 billion people [1], [2].
ing and inference durations [11], [12], this article focuses on
Increasing numbers of consumers recognize that adopting eco-
developing lightweight deep neural network models tailored
friendly and sustainable food options can contribute significantly
specifically for food image recognition.
The rapid integration of artificial intelligence, particularly
Manuscript received 30 December 2023; revised 20 February 2024; ac-
cepted 6 April 2024. Date of publication 2 May 2024; date of current ver- deep learning, has permeated various sectors, including food
sion 10 October 2024. This article was recommended by Associate Editor C. and agriculture [13], [14], [15], [16], [58]. However, research on
Josephson. (Corresponding author: Yancun Yang.) lightweight approaches for food image recognition remains rel-
Guorui Sheng, Tao Yao, Jingru Song, Yancun Yang, and Lili Wang are with
the Department of Information and Electrical Engineering, Ludong University, atively sparse. Early endeavors primarily relied on lightweight
Yantai 264025, China (e-mail: [email protected]; [email protected]; convolutional neural network (CNN)-based methods for food
[email protected]; [email protected]; [email protected]). image analysis. However, the inherent challenge lay in extracting
Weiqing Min and Shuqiang Jiang are with the Key Laboratory of Intelligent
Information Processing, Institute of Computing Technology, Chinese Academy long-range information from images due to the dispersed nature
of Sciences, Beijing 100190, China, and also with the University of Chinese of ingredients. As illustrated in Fig. 1, the discriminative factors
Academy of Sciences, Beijing 100049, China (e-mail: [email protected]; in food identification often lie within the scattered arrangement
[email protected]).
Digital Object Identifier 10.1109/TAFE.2024.3386713 of ingredients, compounded by variations in size, shape, and

2771-9529 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on January 21,2025 at 13:33:27 UTC from IEEE Xplore. Restrictions apply.
SHENG et al.: LIGHTWEIGHT FOOD IMAGE RECOGNITION WITH GLOBAL SHUFFLE CONVOLUTION 393

1) We design a simple, effective, and easy-to-implement pure


convolutional model to capture the dependencies between
remote pixels on the food image plane to effectively
handle the dispersed distribution of ingredients in
food images. Simultaneously extracting short-range fea-
tures and long-range features through a parallel struc-
ture effectively improves the accuracy of food image
recognition.
Fig. 1. Some samples from ETHZ Food-101 [41] and Vireo Food-172 [51].
Ingredients are scattered throughout the food image. 2) Based on the fact that the model is dedicated to capturing
dependence between long-range pixels at the front of the
network, we redesigned a new lightweight neural network
that adapts to this feature and effectively reduces the
distribution arising from different cooking methods. Capturing number of parameters and calculations.
these long-range relationships amidst scattered food images is 3) We conducted extensive and comprehensive experiments
crucial for accurate dish recognition. on four major food image datasets, and the results indicate
While vision transformer (ViT) excels in capturing that our approach achieves SOTA performance with higher
global information by leveraging attention mechanisms, its accuracy and fewer floating operation operations (FLOPs)
computational demands and training complexity pose signifi- and parameters, outperforming SOTA CNN-based, ViT-
cant hurdles [17]. To reconcile this, efforts, such as those by based, and hybrid lightweight models.
Sheng et al. [18], have attempted to amalgamate ViT’s global
representation capabilities with CNN’s local feature extraction
prowess. Nonetheless, the resultant models still entail consider- II. RELATED WORKS
able parameter counts and computational overheads.
A. Lightweight CNNs, ViTs, and Hybrid Models
The challenges in lightweight food image recognition are
twofold. First, the scattered distribution of ingredients neces- ResNet [22] is one of the most successful CNN architec-
sitates a nuanced understanding of long-range pixel correlations tures. However, the best-performing CNN models are usually
crucial for accurate recognition. However, conventional CNN high in parameters and FLOPs. Lightweight CNNs that achieve
architectures excel at capturing local features, requiring increas- competitive performance with fewer parameters and FLOPs
ingly complex networks to model distant pixel relationships, include ShuffleNetV2 [23], ESPNetV2 [24], EfficientNet [25],
thus contravening lightweight design principles. Second, while MobileNetV2, [19] and MobileNetV3 [20]. MobileNetV3 [20]
ViT offers a promising avenue for extracting long-range corre- belongs to the category of models developed specifically for
lations, the quadratic increase in token interactions necessitates resource-constrained environments, such as mobile devices. The
extensive computational resources and data for training, making basic blocks of MobileNetV3 [3] include the MobileNetV2 [19]
adherence to lightweight constraints challenging. block and the squeeze-and-excite network [26]. The common
Our work has addressed key challenges in lightweight food problem of CNN-based lightweight models is their weak ability
recognition, namely, the limited expression of long-range infor- to extract global information.
mation by CNNs and the complexity of training ViT models. In order to extract global information more efficiently, ViT
We employ global shuffle convolution to capture dispersed brings transformer models for natural language processing
food ingredients’ long-range information within food images, tasks to the vision domain, especially image recognition. The
facilitating comprehensive global expression alongside local extensive use of ViT in the field of machine vision has also
convolution. This parallel block serves as the foundational struc- attracted some research on its lightweight. Most efforts have
ture of Global Shuffle Net (GSNet), markedly enhancing food been focused on improving the self-attention process to increase
image recognition accuracy. In addition, recognizing GSNet’s efficiency, such as SwinT [27], EfficientFormer [28], LightViT
emphasis on extracting long-range features in the early stages, [29], EfficientViT [30], MiniViT [31], and TinyViT [32].
we significantly reduce network layers in the intermediate and The common problems of ViT-based lightweight models are
posterior sections to minimize parameter count and compu- the difficulty of training and the high computational cost due to
tational complexity. We design GSNet and conduct extensive the quadratic number of interactions between tokens. Recently,
experiments across various prominent food image databases, some researchers have tried to construct compact hybrid models
demonstrating superior recognition performance compared with that integrate CNN and ViT for mobile vision tasks, which shows
existing CNN-based, ViT-based, and hybrid lightweight net- that combining convolution and transformer achieves improve-
works. As illustrated in Fig. 2, GSNet surpasses several widely ment in prediction accuracy as well as training stability. Subse-
used lightweight CNN and ViT models renowned for their state- quently, there have been a large number of lightweight works on
of-the-art (SOTA) performance, such as MobileNetV2 [19], these models, such as MobileFormer [33], CMT [34], CvT [35],
MobileNetV3 [20], and MobileViTv2 [21]. Notably, GSNet BoTNet [36], Next-ViT [38], EdgeViTs [38], MobileViTv1 [39],
achieves an 88.4% top-1 accuracy with only 3.1 M parame- and MobileViTv2 [21]. The hybrid lightweight model based
ters, significantly outperforming MobileNetV3 (86.2%) despite on CNN and ViT has done a good fusion in extracting global
having fewer parameters (4.3 M). information and local information, but there is still the problem
We summarize our contributions as follows. of large model size.
Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on January 21,2025 at 13:33:27 UTC from IEEE Xplore. Restrictions apply.
394 IEEE TRANSACTIONS ON AGRIFOOD ELECTRONICS, VOL. 2, NO. 2, SEPTEMBER/OCTOBER 2024

Fig. 2. Comparison with SOTA CNN-based (MobileNetV2 [19] & V3 [20]) and Hybrid (MobileViTv2 [21]) lightweight models across different datasets.
(a): ETHZ Food-101 [41]. (b): Vireo Food-172 [51]; (c): UEC Food256 [52].

Fig. 3. GSNet. Here, Conv n × n in the GSNet represents a standard n × n convolution. In global shuffle convolution block, to illustrate the implementation
process, assume that both H and W are 6.

B. Lightweight Food Recognition SOTA performance. However, due to the multihead attention
Recently, Min et al. [3] gave a survey on food computing mechanism of the ViT, the model size is still large.
In contrast to the works that use ViT, we propose a simple
including food recognition. In earlier years, various handcrafted
features were utilized for recognition [40], [41]. For example, yet effective pure convolution network, which is based upon the
characteristics of food images and allows for better control over
Mehta and Rastegari [39] utilized random forests to mine dis-
parameters and calculations. In this architecture, a global shuffle
criminative image patches as a visual representation. Due to
the rise of deep learning technology, many recognition methods convolution is utilized to identify global features and a parallel
network structure along with CNN is fashioned to draw out local
based on deep learning have emerged [11], [12], [42], [43], [44],
features, resulting in SOTA performance.
[45].
Given the necessity of lightweight food image recognition,
a lot of related research work has been proposed. Early re- III. METHOD
searchers used the lightweight CNN method for food image
recognition [46], [47], [48], [49]. Tan et al. [49] recently pro- A. Brief Review of GSNet
posed a novel lightweight neural architecture search (LNAS) Our objective is to propose a network model that can not only
model to self-generate a thin CNN that can be executed on effectively deal with the dispersion and diversity of food image
mobile devices, achieving nearly 76% recognition accuracy features, but also realize lightweight so that it can be better ex-
on the Eidgenössische Technische Hochschule Zürich (ETHZ) tended on the server side and deployed on edge and end devices.
Food-101 dataset. The recognition accuracy of these CNN-based The proposed GSNet is shown in Fig. 3. We use global
lightweight food recognition is generally low. ViT provides a shuffle convolution to capture the long-range information of
new option for extracting global features of food images, Sheng food ingredients scattered in food images to enhance the model’s
et al. [18] tried to extract global and local features with a parallel expressiveness, and then form a parallel block with local convo-
structure composed of the ViT group and CNN and obtained the lution. This parallel block is used as the basic structure of GSNet,

Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on January 21,2025 at 13:33:27 UTC from IEEE Xplore. Restrictions apply.
SHENG et al.: LIGHTWEIGHT FOOD IMAGE RECOGNITION WITH GLOBAL SHUFFLE CONVOLUTION 395

thus having a larger local field of view [50]. Dilated convolution


can express a broader range of local correlations, but also due to
the operation of ignoring certain pixels, some information is lost.
On the other hand, although dilated convolution can extract long-
range related information at different distances by adjusting the
dilated rate, ultra-long-distance related information needs to be
obtained by stacking more layers, so it is not as efficient as global
shuffle convolution to extract global features. Summarily, global
shuffle convolution is more appropriate for image recognition of
Fig. 4. (a) Local convolution. (b) Dilated convolution. (c) Global shuffle food because of its excellent ability to capture comprehensive
convolution. long-range correlation information.

C. Approach
which effectively improves the food image recognition accuracy.
Suppose the input size of a convolution layer is
Based on the fact that GSNet focuses on capturing long-range
[N, C in , IH in , IW in ], the output size is [N, C out , IH out , IW out ],
dependence among different spatial pixels in the front part of the
and convolution kernel size is [C out , C in , K H , K W ], where N
model, we reduce the number of network layers in the middle
denotes the batch size, C denotes the number of channels,
and rear parts, and correspondingly the effective reduction of the
IH and IW denotes the height and width of input images,
number of parameters and FLOPs is obtained. The experimental
respectively. During local convolution calculation, the value of a
results show that this strategy can effectively reduce the number
specific feature point X (t+1) of a specific channel of the output
of parameters and FLOPs on the premise of ensuring recognition
feature map with input X (t) is calculated as follows:
accuracy.
(t+1)
XNi ,C out ,IH out ,IW out = BCjout
j k l
B. Global Shuffle Convolution   (t)
The global shuffle convolution method divides the image into + WCjout ,c,h,w × XNi ,c,h,w (1)
h w 0≤c<C in
several patches first, and then in each convolution operation,
corresponding position pixels are taken out from each patch to where B denotes bias parameters with size C out , W de-
participate in the convolution. Since patches cover the entire notes weight parameters with size [C out , C in , K H , K W ], h ∈
image, this convolution operation is to extract scattered corre- [IHkout , IHkout + K H − 1], and l ∈ [IWlout , IWlout + K W − 1].
lation information. No matter how far the same ingredient is in The complete formula of the global shuffle convolution calcu-
the dish, it can be captured by the global shuffle convolution lation method is more complicated. For simplicity, in an image
operation. As shown in Figs. 3 and 4(c), by first resetting plane, let the number of groups in the row direction be equal
the row and then resetting the column, the distant pixels are to K H , the number of groups in the column direction is K W ,
concentrated into 2 × 2 patches, and then a normal convolution and H in /K H = H out , W in /K W = W out , that is, the number of
with a kernel of 3 × 3 size is performed, so that not only four groups is consistent with the kernel size, and the size of each
elements are involved in the calculation of correlation pixel, but group is the same as the output plane, and the stride of the
five more elements from a greater distance. Through this, the convolution is taken as the group size. Then, the calculation
correlation information between long-range pixels is quickly of the value of the specified feature point Y of the output feature
obtained. Here, the convolution kernel size is set to 3 × 3 and map with input X is similar to formula (1) but
stride set to 1. In Fig. 4(c), the middle image is the intermediate
  
result after the rows and columns of the bottom image are reset, h ∈ IHkout + k ∗ IH out k = 0, . . . , K H − 1
and the spatial variation law can be seen through pixels of the   
same color. The top plane is the result of a 3 × 3 convolution of w ∈ IWlout + k ∗ IW out k = 0, . . . , K W − 1 . (2)
the middle plane.
Compared with global shuffle convolution, local convolution Then, fold the 4-D matrix into a 2-D matrix for X (t) , combine
[see Fig. 4(a)] extracts the local correlation in the image through bias parameters into weight parameters, and add a constant row
the convolution operation on the local area, and then translates it to X (t) , the output of the local convolutional network
with a certain step size and performs the convolution operation
f (X (0) ) = f (T −1) (· · · f (1)
multiple times to achieve full coverage of local information. It
can build deeper and more nonlinear networks but ignores the (f (0) (X (0) W (0) )W (1) ) · · · W (T −1) ) (3)
correlation between pixel vectors in the global scope, leading
to information loss compared with the fully connected model. where X (t) (t = 0, . . . , T − 1) is the 2-D matrix of the input or
Dilated convolution is a variant of local convolution that expands 2-D matrix of the output layer, T is the number of layers, W (t)
the receptive field compared with local convolution. As shown is the parameter matrix of each layer, f (t) is the nonlinear ac-
in Fig. 4(b), under the same kernel size, the dilated convolution tivation function used by each layer. When nonlinear activation
skips some pixel positions to perform convolution operations, functions are not used: f (X (0) ) = X (0) W (0) W (1) · · · W (T −1) .

Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on January 21,2025 at 13:33:27 UTC from IEEE Xplore. Restrictions apply.
396 IEEE TRANSACTIONS ON AGRIFOOD ELECTRONICS, VOL. 2, NO. 2, SEPTEMBER/OCTOBER 2024

For global shuffle convolution, the output of the network

f (X (0) ) = f (T −1) (· · · f (1) (f (0)


(X (0) M (0) Wg(0) )M (1) Wg(1) ) · · · M (T −1) Wg(T −1) ) (4)
(t)
where M (t) is a linearity transformation matrix and Wg is
the parameter matrix. Likewise, when not using a nonlinear
activation function

f (X (0) ) = X (0) M (0) Wg(0) M (1) Wg(1) · · · M (T −1) WgT −1 . (5)

In linear mode, the difference between global shuffle con-


volution and local convolution can be understood from two
perspectives: 1) In the global shuffle convolutional net, during
the operation of each layer, the input matrix is first column-
transformed, i.e., X (t) · M (t) , and then multiply it with the
(t)
parameter matrix Wg ; 2) the parameter matrix of the global Fig. 5. Hierarchical network layout. (a) Hierarchical layout of traditional
neural networks. (b) By drastically reducing the number of layers in the back
shuffle convolution correspond to the parameter matrix of the of the network, the hierarchical network layout adopted by GSNet effectively
local convolution, namely reduces the number of parameters and computation.

M (t) · Wg(t) ↔ W (t) (6)

that is, in linear mode, global shuffle convolution and local between more complex plane pixel vectors at different positions
convolution are equivalent, but their parameter positions are can be represented.
adjusted. However, neural networks are nonlinear, only (1) is
true, i.e., the global shuffle convolution is a series of images E. Network Architecture
whose plane pixels are misaligned (the misalignment pattern is
fixed). Using only global shuffle convolutions in the network is This section introduces the basic parallel block, the hierar-
generally ineffective unless the dislocation results in clustered chical network layout, and the detailed network architecture of
color patches similar to normal images. GSNet.
In our work, the parallel network structure of global shuf- Parallel block: The basic block used in our network is parallel,
fle convolution and local convolution are both used, the local one branch uses local convolution and the other uses global
convolution represents the most features of the image, and the shuffle convolution, the outputs of the two branches are con-
global shuffle convolution assists in the collection of food infor- catenated and then propagated along the neural network. The
mation scattered around the image, which ultimately improves local convolution branch is the inverse residual model derived
the accuracy of recognition. from MobileNetV2, the other branch replaces the depth-wise
convolution part of it with our proposed global shuffle con-
volution. In parallel block, the local convolution branch is
D. Implementation Details
responsible for extracting the local features at the pixel level,
The essence of the global shuffle convolution method is to and the global shuffle convolution branch is responsible for
calculate the correlation between several pixels at any distance capturing the long-range dependence between pixels in different
on the image plane, that is, to convolve several pixel vectors spatial locations. The local convolution branch is the main bearer
selected at different positions in the entire image, which is since most of the image features are revealed through local
equivalent to adjusting these several pixel vectors to a local correlations. The global shuffle convolution branch provides
region, and then perform local convolution on this region. correlation features between pixels from entire image plane and
When implementing the global shuffle convolution calcula- is used to add long-range feature to improve the expressiveness
tion, our actual practice is to first perform the relocation in of image features.
the column direction of the image plane, then perform the Adjusted network layout: As shown in Fig. 5, we use a network
rearrangement in the row direction, and finally perform the local structure that differs from traditional models. In the traditional
convolution. As shown in Fig. 3, after the rearrangement in network, since the local convolution can only represent the local
column and row directions, several pixels (2, 2), (2, 5), (5, 2), and correlation of the image, the long-range features are obtained
(5, 5) are adjusted to be adjacent to each other, these pixels come after the multilayer local convolution. The characteristic of this
from the scattered positions of the entire image plane, and local network structure is that there are fewer layers in the front part
convolution on them is equivalent to global shuffle convolution. of the network and more layers in the back part. Due to the
This implementation achieves certain flexibility: the number large number of channels in the middle and rear, the network is
of groups does not have to be the same as the size of the heavily parameterized. By using the parallel block structure,
convolution kernel, and the stride of the convolution does not GSNet obtains the long-range correlation information at the
have to be the same as the size of the group so that correlations beginning of the network without relying on the shrinking part

Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on January 21,2025 at 13:33:27 UTC from IEEE Xplore. Restrictions apply.
SHENG et al.: LIGHTWEIGHT FOOD IMAGE RECOGNITION WITH GLOBAL SHUFFLE CONVOLUTION 397

TABLE I
NETWORK SPECIFICATION

in the rear of the network with more layers. Therefore, in this TABLE II
PERFORMANCE COMPARISON ON ETHZ FOOD-101[41]
work, we reduce the number of layers in the back of the network
drastically, effectively reduce the number of parameters and
computation, and then develop a lightweight food recognition
network. The following experimental results show that this
strategy is effective, reducing the number of parameters and
computations while achieving higher accuracy.
Network specification: The detailed network specification is
given in Table I. The network first obtains a image plane through
a local convolution and then passes through a series of parallel
block groups. In each parallel block group, the group number
is set according to the size of the current image resolution. At
the tail of the network, the number of channels is expanded by
convolution, then global pooling and dropout are performed to
obtain and adjust the single-pixel output, and finally, a fully
connected layer is used to map to the number of classes.

IV. EXPERIMENTS
A. Datasets
To evaluate the proposed model, we conduct experiments on
four food datasets: ETHZ Food-101 [41], Vireo Food-172 [51],
UEC Food-256 [52], and ISIA Food-500 [53]. ETHZ Food-101 we use the same data augmentation method as MobileViTv2 for
has 101 categories, we use 75 750 images for training and 25 250 image preprocessing.
for validation. Vireo Food-172 provides 172 categories, we use
66 071 images for training and 44 170 images for validation. C. Experiment Results
UEC Food-256 has 256 categories where 22 095 images are Results on ETHZ Food-101: Table II presents results on ETHZ
used for training and 9300 images are used for validation. ISIA Food-101. The results are grouped according to similar numbers
Food-500 is a comprehensive food dataset composed of 500 food of parameters. Our model surpasses all other models in three
types from Wikipedia, we use 239 378 images for training and parameter ranges. Among all models with around 1 M parame-
120 142 images for validation. ters, our model achieves 87.0% top-1 accuracy, which is 0.1%,
4.6%, and 4.6% higher than MobileViTv2, MobileNetV3, and
MobileNetV2, respectively. In around 2–3 M parameter budget
B. Training Settings
models, our model’s top-1 accuracy is 87.9%, which is 0.7%
We train our models using an input image resolution higher than MobileViTv2, and 2.4% higher than MobileNetV3
256×256, a batch size of 256, and SGD optimizer with 0.9 and MobileNetV2. Our model also achieves the highest top-1
momentum [54]. We use the initial learning rate of 0.1 for first accuracy of 88.4% in the parameter range of 3–5 M, surpassing
3000 iterations of linear warm-up and then a cosine schedule MobileViTv2, MobileNetV3, and MobileNetV2 by 0.8%, 2.2%,
with the learning rate ranging from 0.0004 to 0.8. Furthermore, and 1.9%, respectively. We also compare with recent lightweight
Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on January 21,2025 at 13:33:27 UTC from IEEE Xplore. Restrictions apply.
398 IEEE TRANSACTIONS ON AGRIFOOD ELECTRONICS, VOL. 2, NO. 2, SEPTEMBER/OCTOBER 2024

TABLE III TABLE V


PERFORMANCE COMPARISON ON VIREOFOOD-172[51] PERFORMANCE COMPARISON ON ISIA FOOD-500[53]

TABLE IV
PERFORMANCE COMPARISON ON UEC FOOD256[52] Results on ISIA Food-500: Table V presents experimental re-
sults on dataset ISIA Food-500. Because of its wide range, large
scale, and offering of both Chinese and western food, it is harder
for food recognition in Food-500. Even so, our proposed GSNet
still achieves competitive results: compared with SOTA ViT-
based lightweight network MobileViTv2, the FLOPs are greatly
reduced with almost the same recognition rate. Compared with
the SOTA CNN-based lightweight network MobileNetV2 and
V3, our model has significantly better performance with similar
parameters: GSNet -1.5/-2.0 obtain 64.3%/64.9% top-1 accu-
racy, which is +1.6%/1.1% higher than that of MobileNetv2/v3
(63.8%/62.7%) with a similar number of parameters.
The experimental results demonstrate the effectiveness and
the generalization of our design. With the proposed parallel
block, although we reduce the number of layers in the middle
and rear parts of the network, the proposed network provides
reasonable accuracy gains over the general network architecture.
Considering experiments on four different food datasets with
consistent results, the proposed model should be effective and
food recognition networks; the results show that the recognition efficient for general food vision tasks.
accuracy of our network (87.0%) is much higher than that of Comparison and Analysis with Results Based on Lightweight
LNAS-NET (75.9%) and LTBDNN (TD-192) (76.8%) in the Networks using ViT: Experimental results reveal that com-
case of much fewer parameters. pared with the SOTA lightweight model based on ViT, Mobile-
Results on Vireo Food-172: Table III presents results on ViTv2 [21], GSNet achieves comparable or superior recognition
VireoFood-172. Compared with MobileViTv2 in every param- accuracy while requiring fewer parameters and significantly less
eter range, our model achieves better top-1 accuracy of 87.8% computational load. We believe this is based on the follow-
versus 87.3%, 89.1% versus 88.0%, and 89.3% versus 88.2% ing reasons: ViT possesses powerful capabilities for extract-
with much lower FLOPs of 295 M versus 480 M, 665 M versus ing global information. However, the common challenges of
1,052 M, and 1,051 M versus 1,843 M. Although MobileNetV3 ViT-based lightweight models include the difficulty of training
and MobileNetV2 have much lower FLOPs, they lag in accuracy and the high computational cost stemming from the quadratic
by a margin of more than 2% with our models. number of interactions between tokens. When modeling the
Results on UEC Food256: As seen in Table IV, the results global context, ViT also incorporates positional information
are similar to the other two datasets. Our models achieve the of patches, further increasing parameter quantity and compu-
highest top-1 accuracy in every parameter range. Compared tational load. A key differentiating feature of food images lies in
with MobileViTv2, our model has fewer parameters and FLOPs. the correlated characteristics among the same type of ingredients
Compared with MobileNetV3 and MobileNetV2, our model dispersed throughout the image. The distant correlations among
achieves much higher top-1 accuracy with fewer parameters but the dispersed identical ingredients do not require consideration
slightly more FLOPs. of specific patch positional information. Our designed GSNet is
Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on January 21,2025 at 13:33:27 UTC from IEEE Xplore. Restrictions apply.
SHENG et al.: LIGHTWEIGHT FOOD IMAGE RECOGNITION WITH GLOBAL SHUFFLE CONVOLUTION 399

TABLE VI convolution will also focus on the background incorrectly


PERFORMANCE COMPARISON ON IMAGENET
and cause recognition failure.
2) Local and global shuffle convolution tends to collect sim-
ilar color patches globally, and its focal area tends to
be wider than local convolution, covering multiple color
patches at the same time.
3) Both models are affected if there are distinct color blocks
in the background, but the local and global shuffle model
is significantly less affected.
In summary above results show that the local and global
shuffle model is more suitable to the scattered-color features
of food images and can achieve better recognition results.
Fig. 7 illustrates cases of misrecognition by our method on
the ETHZ Food-101 and Vireo Food-172 datasets. Based on
the visual results reflected in the heatmaps, we analyze the
reasons for recognition failures as follows. Whether employing
local convolution or global convolution, both tend to extract
precisely tailored to exploit this characteristic of food images. features from prominent color blocks present in the image.
Consequently, it achieves recognition performance using CNN Global convolution, however, can gather information on the cor-
volume parameters and computational load that match or exceed relation among dispersed but related color blocks in the image,
those of ViT models. thereby generating global features. Nevertheless, a characteristic
Results on ImageNet: Table VI presents results on ImageNet- of convolutional operations is their susceptibility to being drawn
1 K. The results are grouped according to CNN-based method towards color blocks with strong color consistency, making
and ViT-based method, all with similar numbers of parameters. them prone to being misled by the background and failing
In the comparison with the CNN-based lightweight method, it to focus on the target object. While global convolution may
can be found that GSNet has the same accuracy as the newly re- mitigate this issue to some extent by collecting information on
leased MobileNetV3 when the number of parameters is roughly the correlation among related color blocks globally, the impact
the same, but the FLOPs of GSNet are higher, which is due is more significant when using local convolution alone, leading
to the parallel structure including global shuffle convolution. to a higher likelihood of recognition failures.
Compared with ViT-based methods, taking mobileViTv2 as an
example, our method has lower accuracy (75.3% versus 78.1%),
E. Ablation Study
but also lower FLOPs (1054 M versus 1800 M). It is found
that the performance of our method on ImageNet is comparable In this section, we ablate important design elements in the
to the SOTA CNN-based method, worse than the SOTA ViT- proposed model using image classifications on four datasets.
based model, and the overall performance is not as good as the Effectiveness of global shuffle convolution: Ablations of the
experimental results on the food dataset. We believe this is global shuffle convolution effect on four datasets are reported in
because global shuffle convolution is more specific in dealing Table VII. The models with global shuffle convolution blocks
with the dispersed distribution of ingredients in food images obtain more higher top-1 accuracy: 69.6% (Food-256), 87.0%
since it can effectively extract correlated features between long- (Food-101), and 87.8% (Food-172) compared with models with-
range pixels. out global shuffle convolution blocks: 69.1% (Food-256), 85.5%
(Food-101), and 86.5% (Food-172). That indicates the global
shuffle convolution block is effective in improving models’
D. Qualitative Analysis and Visualization
accuracy by gathering long-range features. We also exclude local
Different from the image recognition mechanism of the tradi- convolution blocks and train the models with only global shuffle
tional local convolution, the network including the global shuffle convolution blocks. Surprisingly, they achieve top-1 accuracy:
convolution tends to collect similar color patch information 56.0% (Food-256), 73.7% (Food-101), and 76.8% (Food-172).
globally in the image plane. Fig. 6 shows the comparison by the The results confirm that global shuffle convolution can indeed
method provided by Grad-CAM [55]: results are obtained using extract fairly discriminative features for food images.
only local convolution and using both global shuffle convolution Activation function: Compared with traditional networks,
and local convolution. In Fig. 6, the first row is the original we make a significant reduction in parameters and computa-
image, the second row is heat maps generated by using only tion by the strategy of reducing the number of layers. Con-
local convolution, and the third row is heat maps generated by sidering that the more radical activation function could be
using local and global convolution. The following can be seen effective in expanding the searching domain for the sim-
from Fig. 6. pler model architecture, we use HardSwish as the activa-
1) Using only local convolution tends to identify locally tion function of all nonlinear layers. Here, we compared the
clustered patches, which can be well-focused when they effectiveness of two typical activation functions HardSwish
appear in food images. When the background is relatively and rectified linear unit (ReLU). Compared with ReLU with
monotonous and contains similar color blocks, the local gradient values of 0 and 1, HardSwish is featured with a steep
Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on January 21,2025 at 13:33:27 UTC from IEEE Xplore. Restrictions apply.
400 IEEE TRANSACTIONS ON AGRIFOOD ELECTRONICS, VOL. 2, NO. 2, SEPTEMBER/OCTOBER 2024

Fig. 6. Visualization of experimental results comparison. (a) Examples from dataset ETHZ Food-101. (b) Examples from dataset Vireo Food-172; Left 4 columns
are cases where both local convolution and local+global shuffle convolution can correctly identified; Right 6 columns are cases where local convolution fails but
local+global shuffle convolution succeed. The first row is the original image, the second row is the heat maps generated by using only local convolution, and the
third row is the heat maps generated by using local+global shuffle convolution. (a) Samples from ETHZ Food-101. (b) Samples from Vireo Food-172.

TABLE VII
ABLATION STUDY

Fig. 7. Visualization of recognition failure cases. (a) Examples from dataset


ETHZ Food-101. (b) Examples from dataset Vireo Food-172. The first row is
the original image, the second row is the heat maps generated by using only local
convolution, and the third row is the heat maps generated by using local+global
shuffle convolution. (a) Cases of recognition failure from ETHZ Food-101. (b)
Cases of recognition failure from Vireo Food-172.

curve and wider gradient values ranging from (−1/2, 3/2).


As given in Table VII, the models using HardSwish achieve
higher top-1 accuracy: 69.6% (Food-256), 87.0% (Food-101),
and 87.8% (Food-172), compared with models using ReLU:
68.9% (Food 256), 86.8% (Food-101), and 87.4% (Food-172).
The results show that the HardSwish activation function helps
to find better solutions.

V. CONCLUSION AND FUTURE WORK


Focusing on the specific attributes of food images, we intro- addresses the dispersed distribution of ingredients in food im-
duce a lightweight and efficient CNN network model tailored for ages, leading to a notable enhancement in recognition accuracy.
food image recognition. Our model leverages a block structure To complement this, we strategically reduce the number of layers
comprising global shuffle convolution and local convolution in in the rear portion of the network, capitalizing on the front-end’s
parallel. The integration of global shuffle convolution adeptly emphasis on capturing long-range information. This approach
Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on January 21,2025 at 13:33:27 UTC from IEEE Xplore. Restrictions apply.
SHENG et al.: LIGHTWEIGHT FOOD IMAGE RECOGNITION WITH GLOBAL SHUFFLE CONVOLUTION 401

effectively mitigates the parameter count and FLOPs. Evaluation [19] M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, and L. Chen, “Mo-
across four prominent food image databases demonstrates that bileNetV2: Inverted residuals and linear bottlenecks,” in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit., 2018, pp. 4510–4520.
our method outperforms existing CNN-based, ViT-based, and [20] A. Howard et al., “Searching for MobileNetV3,” in Proc. IEEE Int. Conf.
hybrid lightweight network models. The development of this Comput. Vis., 2019, pp. 1314–1324.
lightweight network holds promise for enhancing server-side [21] S. Mehta and M. Rastegari, “Separable self-attention for mobile vision
transformers,” 2022, arXiv:2206.02680.
training efficiency and facilitating the deployment of food recog- [22] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
nition applications on mobile platforms. This forms a robust recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016,
foundation for individuals to make informed, environmentally pp. 770–778.
[23] N. Ma, X. Zhang, H. Zheng, and J. Sun, “ShuffleNet V2: Practical guide-
conscious, and health-driven dietary choices in their daily lives. lines for efficient CNN architecture design,” in Proc. Eur. Conf. Comput.
Moving forward, our future endeavors will encompass Vis., 2018, pp. 122–138.
adapting to diverse hardware architectures and operating [24] S. Mehta, M. Rastegari, L. G. Shapiro, and H. Hajishirzi, “ESPNetV2: A
light-weight, power efficient, and general purpose convolutional neural
system environments for end devices. In addition, we network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019,
aim to deploy lightweight algorithms for food recognition, pp. 9190–9200.
detection, and segmentation, ultimately offering personalized [25] M. Tan and V. Quoc Le, “ EfficientNet: Rethinking model scaling for
convolutional neural networks,” in Proc. Int. Conf. Mach. Learn., vol. 97,
recommendations for environmentally sustainable and pp. 6105–6114.
health-conscious dietary choices. [26] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7132–7141.
REFERENCES [27] Z. Liu et al., “Swin transformer: Hierarchical vision transformer us-
ing shifted windows,” in Proc. IEEE Int. Conf. Comput. Vis., 2021,
[1] S. H. Wittwer, Food, Climate, and Carbon Dioxide: The Global Envi- pp. 9992–10002.
ronment and World Food Production. Boca Raton, FL, USA: CRC Press, [28] Y. Li et al., “Efficientformer: Vision transformers at mobilenet speed,” Adv.
1995. Neural Inf. Process. Syst., vol. 35, pp. 12934–12949, 2022.
[2] S. J. Vermeulen, B. M. Campbell, and J. S. I. Ingram, “Climate change and [29] T. Huang, L. Huang, S. You, F. Wang, C. Qian, and C. Xu,
food systems,” Annu. Rev. Environ. Resour., vol. 37, pp. 195–222, 2012. “LightViT: Towards light-weight convolution-free vision transformers,”
[3] W. Min, S. Jiang, L. Liu, Y. Rui, and R. Jain, “A survey on food computing,” 2022, arXiv:2207.05557.
ACM Comput. Surv., vol. 52, no. 5, pp. 1–36, 2019. [30] H. Cai et al., “Efficientvit: Lightweight multi-scale attention for high-
[4] A. Ishino, Y. Yamakata, H. Karasawa, and K. Aizawa, “RecipeLog: Recipe resolution dense prediction,” in Proc. IEEE/CVF Int. Conf. Comput. Vis.,
authoring app for accurate food recording,” in Proc. ACM Multimedia 2023, pp. 17302–17313.
Conf., 2021, pp. 2798–2800, doi: 10.1145/3474085.3478563. [31] J. Zhang et al., “MiniViT: Compressing vision transformers with weight
[5] A. Rostami, N. Nagesh, A. Rahmani, and R. C. Jain, “World food at- multiplexing,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022,
las for food navigation,” in Proc. 7th Int. Workshop Multimedia Assist. pp. 12135–12144.
Dietary Manage. Multimedia Assist. Dietary Manage., 2022, pp. 39–47, [32] K. Wu et al., “TinyViT: Fast pretraining distillation for small vision
doi: 10.1145/3552484.3555748. transformers,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 68–85.
[6] A. Rostami, V. Pandey, N. Nag, V. Wang, and R. C. Jain, “Personal [33] Y. Chen et al., “Mobile-former: Bridging MobileNet and trans-
food model,” in Proc. 28th Int. Conf. Multimedia, Virtual Event, 2020, former,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022,
pp. 4416–4424, doi: 10.1145/3394171.3414691. pp. 5260–5269.
[7] K. Nakamoto, S. Amano, H. Karasawa, Y. Yamakata, and K. Aizawa, [34] J. Guo et al., “CMT: Convolutional neural networks meet vision trans-
“Prediction of mental state from food images,” in Proc. 1st Int. Work- formers,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022,
shop Multimedia Cooking, Eating, Related Appl., 2022, pp. 21–28, pp. 12165–12175.
doi: 10.1145/3552485.3554937. [35] H. Wu et al., “CvT: Introducing convolutions to vision transformers,” in
[8] Y. Yamakata, A. Ishino, A. Sunto, S. Amano, and K. Aizawa, “Recipeori- Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 22–31.
ented food logging for nutritional management,” in Proc. 30th Int. Conf. [36] A. Srinivas, T. Lin, N. Parmar, J. Shlens, P. Abbeel, and A. Vaswani,
Multimedia, 2022, pp. 6898–6904. “Bottleneck transformers for visual recognition,” in Proc. IEEE Conf.
[9] T. Yao et al., “Online latent semantic hashing for cross-media retrieval,” Comput. Vis. Pattern Recognit., 2021, pp. 16519–16529.
Pattern Recognit., vol. 89, pp. 1–11, 2019. [37] J. Li et al., “Next-ViT: Next generation vision transformer for efficient
[10] J. Ródenas, B. Nagarajan, M. Bolaños, and P. Radeva, “Learning multi- deployment in realistic industrial scenarios,” 2022, arXiv:2207.05501.
subset of classes for fine-grained food recognition,” in Proc. 7th Int. [38] J. Pan et al., “EdgeViTs: Competing light-weight CNNs on mobile devices
Workshop Multimedia Assist. Dietary Manage. Multimedia Assist. Dietary with vision transformers,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 294–
Manage., 2022, pp. 17–26, doi: 10.1145/3552484.3555754. 311.
[11] S. Jiang, W. Min, L. Liu, and Z. Luo, “Multi-scale multi-view deep feature [39] S. Mehta and M. Rastegari, “MobileViT: Lightweight, general purpose,
aggregation for food recognition,” IEEE Trans. Image Process, vol. 29, and mobile-friendly vision transformer,” in Proc. Int. Conf. Learn. Repre-
pp. 265–276, 2020. sentations, 2022.
[12] N. Martinel, G. L. Foresti, and C. Micheloni, “Wide-slice residual networks [40] S. Yang, M. Chen, D. Pomerleau, and R. Sukthankar, “Food recognition
for food recognition,” in Proc. Winter Conf. Appl. Comput. Vis., Lake using statistics of pairwise local features,” in Proc. IEEE Conf. Comput.
Tahoe, NV, USA, 2018, pp. 567–576, doi: 10.1109/WACV.2018.00068. Vis. Pattern Recognit., 2010, pp. 2249–2256.
[13] J. Zhao et al., “Deep-learning-based automatic evaluation of rice seed [41] L. Bossard, M. Guillaumin, and L. V. Gool, “Food-101-mining discrimi-
germination rate,” J. Sci. Food Agriculture, vol. 103, no. 4, pp. 1912–1924, native components with random forests,” in Proc. Eur. Conf. Comput. Vis.,
2023. 2014, pp. 446–461.
[14] Z. Huang et al., “Fast location and segmentation of high-throughput [42] W. Min, L. Liu, Z. Luo, and S. Jiang, “Ingredient guided cascaded
damaged soybean seeds with invertible neural networks,” J. Sci. Food multi-attention network for food recognition,” in Proc. ACM Int. Conf.
Agriculture, vol. 102, no. 11, pp. 4854–4865, 2022. Multimedia, 2019, pp. 1331–1339.
[15] W. Min et al., “Vision-based fruit recognition via multi-scale attention [43] W. Min et al., “Large scale visual food recognition,” IEEE Trans. Pattern
CNN,” Comput. Electron. Agriculture, vol. 210, 2023, Art. no. 107911. Anal. Mach. Intell., vol. 45, no. 8, pp. 9932–9949, Aug. 2023.
[16] W. Shafik et al., “Using a novel convolutional neural network for plant pests [44] S. Horiguchi, S. Amano, M. Ogawa, and K. Aizawa, “Personalized classi-
detection and disease classification,” J. Sci. Food Agriculture, vol. 103, fier for food image recognition,” IEEE Trans. Multimedia, vol. 20, no. 10,
no. 12, pp. 5849–5861, 2023. pp. 2836–2848, Oct. 2018.
[17] A. Dosovitskiy et al., “An image is worth 16×16 words: Transformers for [45] H. Kagaya, K. Aizawa, and M. Ogawa, “Food detection and recognition
image recognition at scale,” in Proc. 9th Int. Conf. Learn. Representations, using convolutional neural network,” in Proc. ACM Int. Conf. Multimedia,
2021. 2014, pp. 1085–1088.
[18] G. Sheng, S. Sun, C. Liu, and Y. Yang, “Food recognition via an efficient [46] Y. Kawano and K. Yanai, “Real-time mobile food recognition system,”
neural network with transformer grouping,” Int. J. Intell. Syst., vol. 37, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, 2013,
no. 12, pp. 11465–11481, 2022. pp. 1–7.
Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on January 21,2025 at 13:33:27 UTC from IEEE Xplore. Restrictions apply.
402 IEEE TRANSACTIONS ON AGRIFOOD ELECTRONICS, VOL. 2, NO. 2, SEPTEMBER/OCTOBER 2024

[47] S. Y. Kawano and K. Yanai, “FoodCam: A real-time food recognition Tao Yao received the Ph.D. degree in multimedia
system on a smartphone,” Multimedia Tools Appl., vol. 74, no. 14, retrieval from the Dalian University of Technology,
pp. 5263–5287, 2015. Dalian, China, in 2017.
[48] P. Pouladzadeh and S. Shirmohammadi, “Mobile multi-food recognition He is currently an Associate Professor with the
using deep learning,” ACM Trans. Multimedia Comput., Commun., Appl., Department of Information and Electrical Engineer-
vol. 13, no. 3s, pp. 1–21, 2017. ing, Ludong University and also a Researcher with
[49] R. Z. Tan, X. Chew, and K. W. Khaw, “Neural architecture search for Yantai Research Institute of New Generation Infor-
lightweight neural network in food recognition,” Mathematics, vol. 9, mation Technology, Southwest Jiaotong University,
no. 11, pp. 1245–2021, 2021. Chengdu, China. He has authored or co-authored
[50] F. Yu, V. Koltun, and T. A. Funkhouser, “Dilated Residual Networks,” in more than 30 peer-referenced papers in relevant jour-
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 636–644. nals and conferences, including IEEE TRANSACTIONS
[51] M. Klasson, C. Zhang, and H. Kjellström, “A hierarchical grocery store ON KNOWLEDGE AND DATA ENGINEERING, IEEE TRANSACTIONS CYBERNET-
image dataset with visual and semantic labels,” in Proc. Winter Conf. Appl. ICS, ACM Transactions on Multimedia Computing, Communications, and
Comput. Vis., 2019, pp. 491–500. Applications and Pattern Recognition. His research interests include multimedia
[52] Y. Kawano and K. Yanai, “FoodCam-256: A. large-scale realtime mobile retrieval, computer vision, and machine learning.
food recognition system employing high-dimensional features and com-
pression of classifier weights,” in Proc. ACM Int. Conf. Multimedia, 2014, Jingru Song received the B.E. degree in software
pp. 761–762. engineering from the College of Computer Science,
[53] W. Min et al., “ISIA Food-500: A dataset for large-scale food recognition Liaocheng University, Liaocheng, China, in 2022.
via stacked global-local attention network,” in Proc. ACM Int. Conf. She is currently working toward the M.E. degree in
Multimedia, 2020, pp. 393–401. computer science and technology with the College
[54] L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for large- of Information and Electrical Engineering, Ludong
scale machine learning,” SIAM Rev, vol. 60, no. 2, pp. 223–311, 2018. University, Yantai, China.
[55] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Her research interests include multimedia process-
Batra, “Grad-CAM: Visual explanations from deep networks via gradient- ing, computer vision, and food computing.
based localization,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 618–
626.
[56] A. G. Howard et al., “MobileNets: Efficient convolutional neural networks
for mobile vision applications,” 2017, arXiv:1704.04861. Yancun Yang received the Ph.D. degree in man-
[57] K. Han, Y. Wang, Q. Tian, J. Guo, C. Xu, and C. Xu, “GhostNet: More agement from Shandong University, Jinan, China, in
features from cheap operations,” in Proc. IEEE Conf. Comput. Vis. Pattern 2008.
Recognit., 2020, pp. 1577–1586. He is currently a Lecturer with the Department of
[58] J. F. Yeh, K.-M. Lin, C.-Y. Lin, and J.-C. Kang, “Intelligent mango Information and Electrical Engineering, Ludong Uni-
fruit grade classification using AlexNet-SPP with mask R-CNN-Based versity, Yantai, China. He has authored or co-authored
segmentation algorithm,” IEEE Trans. AgriFood Electron., vol. 1, no. 1, more than 10 peer-referenced papers in relevant jour-
pp. 41–49, Jun. 2023. nals and conferences, including ACM Transactions on
Multimedia Computing, Communications, and Appli-
Guorui Sheng received the M.E. degree in computer cations and Nutrients. His research interests include
science from Kunsan National University, Gunsan, computer vision, deep learning, and food computing.
South Korea, in 2007, and the Ph.D. degree in com-
puter application technology from Nankai University,
Lili Wang received the M.E. and Ph.D. degrees
Tianjin, China, in 2017.
in electromagnetic field and microwave technology
From 2017 to 2018, he was a Research Assistant to
from Electronic Engineering School, Beijing Univer-
Scholar Bruce Denby with the School of Computer
sity of Posts and Telecommunication, Beijing, China,
Science and Technology, Tianjin University. He is
in 2006.
currently a Lecturer with the Department of Informa-
She is currently a Professor with the School of
tion and Electrical Engineering, Ludong University,
Information and Electrical Engineering, Ludong Uni-
Yantai, China. He has authored or co-authored more
versity, Yantai, China. Her research interests include
than 20 peer-referenced papers in relevant journals and conferences, including
broadband communication and multimedia commu-
ACM Transactions on Multimedia Computing, Communications, and Appli-
nication.
cations and Nutrients. His research interests include computer vision, deep
learning, and food computing.
Shuqiang Jiang (Senior Member, IEEE) received
the Ph.D. degree in computer application technology
Weiqing Min (Senior Member, IEEE) received the from the Institute of Computing Technology, Chinese
Ph.D. degree in pattern recognition and intelligent Academy of Sciences, Beijing, China, in 2006.
systems from the Institute of Automation, Chinese He is currently a Professor with the Institute of
Academy of Sciences, Beijing, China, in 2015. Computing Technology, Chinese Academy of Sci-
He is currently an Associate Professor with the Key ences (CAS), Beijing, China, and a Professor with
Laboratory of Intelligent Information Processing, In- the University of CAS. He is also with the Key Labo-
stitute of Computing Technology, Chinese Academy ratory of Intelligent Information Processing, CAS. He
of Sciences. He has authored or co-authored more has authored or co-authored more than 150 articles.
than 50 peer-referenced papers in relevant journals He was supported by the National Science Fund for
and conferences, including Patterns (Cell Press), Distinguished Young Scholars in 2021, the NSFC Excellent Young Scientists
ACM Computing Surveys, Trends in Food Science Fund in 2013, and the Young Top-Notch Talent of Ten Thousand Talent Program
and Technology, IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE in 2014. His research interests include multimedia analysis and multimodal
INTELLIGENCE, IEEE TRANSACTIONS ON IMAGE PROCESSING, Food Chemistry, intelligence.
ACM MM, AAAI, and IJCAI. His research interests include multimedia content Mr. Jiang is a Senior Member of CCF and a Member of ACM. He was a TPC
analysis and food computing. Member for more than 20 well-known conferences, including ACM Multimedia,
Mr. Win was a Senior Member of CCF. He was the recipient of the 2016 ACM CVPR, ICCV, IJCAI, AAAI, ICME, ICIP, and PCM. He was the recipient of the
Transactions on Multimedia Computing, Communications, and Applications, Lu Jiaxi Young Talent Award from CAS in 2012 and the CCF Award of Science
the Nicolas D. Georganas Best Paper Award, and the 2017 IEEE Multimedia and Technology in 2012. He is the Vice Chair of the IEEE CASS Beijing Chapter
Magazine Best Paper Award. He was the Guest Editor for the special issues and the ACM SIGMM China Chapter. He was the General Chair of ICIMCS in
on international journals, such as IEEE TRANSACTIONS ON MULTIMEDIA, IEEE 2015 and the Program Chair of the 2019 ACM Multimedia Asia and PCM in
MULTIMEDIA, and Foods. 2017. He is an Associate Editor of Multimedia Tools and Applications and ACM
Transactions on Multimedia Computing, Communications, and Applications.

Authorized licensed use limited to: Motilal Nehru National Institute of Technology. Downloaded on January 21,2025 at 13:33:27 UTC from IEEE Xplore. Restrictions apply.

You might also like