0% found this document useful (0 votes)
162 views

A Review of Image-Based Food Recognition and Volume Estimation Artificial Intelligence Systems

This document reviews image-based food recognition and volume estimation artificial intelligence systems. It discusses how accurate dietary assessment is important for managing diseases like diabetes and obesity. Recent advances in computer vision and AI have enabled smartphone apps to identify foods and estimate nutrients by analyzing photos, which could help with daily dietary monitoring more than traditional self-reporting. The review assesses the state-of-the-art in food segmentation, recognition, and volume estimation techniques used by these systems and how their performance depends on the food image databases used for training.

Uploaded by

lm10995
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
162 views

A Review of Image-Based Food Recognition and Volume Estimation Artificial Intelligence Systems

This document reviews image-based food recognition and volume estimation artificial intelligence systems. It discusses how accurate dietary assessment is important for managing diseases like diabetes and obesity. Recent advances in computer vision and AI have enabled smartphone apps to identify foods and estimate nutrients by analyzing photos, which could help with daily dietary monitoring more than traditional self-reporting. The review assesses the state-of-the-art in food segmentation, recognition, and volume estimation techniques used by these systems and how their performance depends on the food image databases used for training.

Uploaded by

lm10995
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

This article has been accepted for publication in IEEE Reviews in Biomedical Engineering.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/RBME.2023.3283149

A Review of Image-based Food Recognition and Volume Estimation


Artificial Intelligence Systems
Fotios S. Konstantakopoulos, Eleni I. Georga, Member, IEEE and Dimitrios I. Fotiadis, Fellow, IEEE

Abstract— The daily healthy diet and balanced intake of rheumatic heart disease and other conditions. CVDs are the
essential nutrients play an important role in modern lifestyle. number one cause of death globally while, in 2016, 17.9
The estimation of a meal’s nutrient content is an integral million people died from CVDs representing 31% of all
component of significant diseases, such as diabetes, obesity and global deaths [4]. The above-mentioned diseases are
cardiovascular disease. Lately, there has been an increasing inextricably linked. Healthy diet has been shown to be the
interest towards the development and utilization of smartphone common denominator that can either positively or negatively
applications with the aim of promoting healthy behaviours. The
semi – automatic or automatic, precise and in real-time
affect the aforementioned diseases. A healthy lifestyle, which
estimation of the nutrients of daily consumed meals is includes a balanced diet, maintaining a healthy weight and
approached in relevant literature as a computer vision problem regular exercise can significantly reduce the percentage of
using food images which are taken via a user’s smartphone. individuals suffering from these diseases.
Herein, we present the state-of-the-art on automatic food Daily diet monitoring by experts is definitely the most
recognition and food volume estimation methods starting from appropriate way to achieve a healthy and balanced diet, which
their basis, i.e., the food image databases. First, by methodically includes daily recording of the type and the estimated amount
organizing the extracted information from the reviewed studies, of food consumed [5]. However, since daily diet monitoring
this review study enables the comprehensive fair assessment of by specialists is almost impossible, patients are advised to
the methods and techniques applied for segmenting food images,
classifying their food content and computing the food volume,
record their daily eating habits themselves. Although these
associating their results with the characteristics of the used methods are widely used, their accuracy remains questioned,
datasets. Second, by unbiasedly reporting the strengths and especially for children and adolescents who lack motivation
limitations of these methods and proposing pragmatic solutions and the required skills [6], with the average error in estimating
to the latter, this review can inspire future directions in the field the amount of food consumed being more than 20% [7]. Even
of dietary assessment systems. well-trained individuals with diabetes have difficulty in
Index Terms ― Dietary assessment system, Food databases, Food calculating, with a relative accuracy, the amount of
segmentation, Food recognition, Food classification, Food volume carbohydrates of their meal [8]. The rapid increase in the use
estimation, Nutrient information, Computer vision, Machine learning,
Deep learning, Artificial Intelligence.
of smartphones and their advanced computing capabilities
during the last decade, have led to the development of
I. INTRODUCTION smartphone applications [9] that can detect food, recognize its
The global incidence of chronic diet-related diseases, such as type and calculate its nutritional value, by estimating its
obesity, diabetes, and cardiovascular diseases, shows an ever quantity, via the analysis of food images [10]. In a typical
–increasing trend, which tends to take on epidemic scenario, the user is asked to take one or more photos or even
proportions. The number of obese people has nearly tripled videotape his/her meal, and then, the application computes the
since 1975. In 2016, more than 1.9 billion adults were corresponding nutritional information.
overweight, out of which over 650 million were obese. Nowadays, the advances in the field of computer vision and
Moreover, in 2019, 38 million children under the age of five Artificial Intelligence (AI) provide users with the possibility
were overweight or obese [1]. Diabetes is considered as a to monitor their health every day through appropriate
major cause for blindness, kidney failure, heart attacks, applications [11]. Recent studies have shown that AI-based
stroke, and lower limb amputation. The World Health applications are more popular among users, compared to
Organization (WHO) estimated that 1.5 million deaths were traditional dietary recording methods, for recording the
directly caused by diabetes and that diabetes was the seventh nutritional composition of food [12]. AI-based methods can
leading cause of death in 2019 [2]. According to the be divided into semi-automatic which require user
International Diabetes Federation, 463 million people (adults participation, and automatic that do not require any human
20-79 years) suffer from diabetes worldwide nowadays [3]. participation. These applications do not aim to replace
As far as cardiovascular diseases (CVDs) are concerned, they dieticians, on the contrary their goal is to provide them with
are a group of disorders of the heart and blood vessels that an additional tool in the monitoring patients’ diet. The
include coronary heart disease, cerebrovascular disease, performance and accuracy of these applications depend to a

This research is supported by the GlucoseML project and has been co- Science and Engineering Department, University of Ioannina, Ioannina, GR
financed by the European Regional Development Fund of the European 45110 Greece and the Biomedical Research Institute, FORTH, University of
Union and Greek national funds through the Operational Program Ioannina, Ioannina, GR 45110 Greece (e-mail: [email protected],
Competitiveness. Entrepreneurship and Innovation, under the call [email protected], corresponding author phone: +302651009006; fax:
RESEARCH – CREATE – INNOVATE (project code: T1EDK-03990). +302651008889; e-mail: [email protected]).
F. S. Konstantakopoulos, E. I. Georga, and D. I. Fotiadis are with the Unit
of Medical Technology and Intelligent Information Systems, Materials

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Reviews in Biomedical Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/RBME.2023.3283149

Figure 1 An automated vision-based dietary assessment system.

large extend on various factors, such as the food image In this paper, we present a review of the literature over the
databases used for training of the system and extraction of past 10 years (2012 - 2021) in the field of food images
nutritional composition, the food segmentation techniques, segmentation, food classification and food volume estimation
the food recognition methods and the volume estimation based on smartphone-captured food images, assessing, in
techniques. parallel, the main characteristics of the employed food image
The quality and the quantity of images of a food database databases. The in-depth analysis of the methods used in each
mainly affects the performance of the food recognition step of the above components of a dietary assessment system
[13]. Food classification, which consists of food segmentation comprises the main distinguishing characteristic of this
and food recognition steps is next. Food segmentation is the review in comparison with existing reviews in the specified
process of partitioning a food image into multiple segments research topic [18-22]. This analysis led to the categorization
(sets of pixels) [14]. Food recognition comprises the of the employed methods as: (i) semi-automatic and automatic
identification of the foods which are present in the food image food image segmentation methods, (ii) traditional machine
through the application of machine and deep learning learning (ML) - based and deep learning-based methods for
techniques [15, 16]. The final step is the volume estimation food image classification, and (iii) 3D reconstruction, pre-
for each food item which is present in the food image. This build shape templates, perspective transformation, depth
step depends directly on the previous steps of segmentation camera and deep learning methods for food volume
and recognition. Volume calculation of each identified estimation (Table I). The algorithms and techniques
segment, in combination with a food nutritional database, is pertaining to each of these categories are identified per
used for the extraction of the nutritional composition [17]. A investigated study, and their performance, strengths and
typical procedure of an automated vision-based dietary limitations are presented and contrasted. Importantly, we
assessment system is shown in Figure 1. suggest pragmatic solutions to deal with the identified
TABLE I
MAIN TECHNIQUES, METHODS AND PERFORMANCE METRICS FOR EACH STEP IN DIETARY ASSESSMENT SYSTEM
Step Methods and Techniques Performance Metrics
Semi – automatic approaches (GrabCut algorithm)
Food Automatic ML approaches with handcrafted feature extraction (HOG, JSEG) Intersection over union (IoU), Pixel
segmentation Automatic ML approaches using deep learning for feature extraction (CNNs, Instance accuracy, Panoptic Quality (PQ)
and Semantic segmentation)
Feature
SIFT, SURF, HOG, Gabor, LBP
extraction
Food Traditional approaches Feature Recall, Precision, F1-Score,
BoF, Fisher vectors
classification representation Top-1 accuracy, Top-5 accuracy
Classification SVM, kNN, RF
Deep learning approaches CNN and DCNN
Mean absolute error (MAE)
Food volume 3D reconstruction, Pre-build shape template, Perspective transformation, Depth camera,
Mean absolute percentage error (MAPE)
estimation Deep learning
Root mean square error (RMSE)

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Reviews in Biomedical Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/RBME.2023.3283149

UEC-Food100 UEC-Food256 Food-101 MedGRFood

Figure 2 Food images from UEC-Food100, UEC-Food256, Food-101 and MedGRFood databases.

limitations starting from the construction of relevant datasets the food’s image background) or in a free environment. In
to the computation of the food nutrient value. This manuscript addition, with the increasing use of deep learning methods for
is hereunder organized in six sections, with Sections II-V image classification, the food image databases must contain a
presenting the review of the methods and techniques used in large number of images per class to support training of a deep
each of the components of a dietary assessment system, and learning model. Furthermore, the diversity of the images
Sections VI and VII being devoted to the discussion of the contained in a class leads to a more advanced model, which
outcomes and conclusions derived by this review study. can classify food even if it has been cooked in a similar way.
Figure 2 presents sample images from four food image
II. FOOD IMAGE DATABASES
databases.
The process of collecting food images, which can be used in The techniques used in the later stages of food image-based
the food classification model, is crucial and it directly affects analysis nutrition systems, emphasize the need to create
the performance of the classification models. A
comprehensive collection of food images is the key to a
classifier’s performance. Large food image databases, such as
Food-101 [23], UEC-Food100 [14], VIREO Food-172 [24],
and UEC-Food256 [25], are benchmark food databases and
are typically used to evaluate machine learning models.
Existing databases are distinguished by the different
characteristics they have, such as cuisine type, the number of
images, the number of food classes, the food categories, the
way of acquisition, the task of use (classification or
segmentation task) as well as by how many different food
items are included in each photo. For instance, Diabetes [26]
has 11 classes with a total of 5,420 pictures out of which 3,800
images are downloaded from the web and 1,620 are captured
in a controlled environment. A few food databases have been
created by compiling images of existing food databases. For
instance, the database Food524DB [27] were created from
Figure 3 Type of cuisine distribution according to the number of classes and
existing publicly available food image databases: Food-101,
how they are used.
UEC-Food256 and VIREO Food-172. Moreover, there are
several food image databases that have collected food images
from specific types of cuisines. For example, Chen [28] and
ChineseFoodNet [29] represent the Chinese cuisine, FFoCat
[30] and MedGRFood [31] refer to Mediterranean food,
Indian food database [32] contains images with local food
dishes, while [33-35] present databases with images of fruits
and vegetables. FLD-469 [36] refers to Japanese food, while
FoodX-251 [37], Menu-Match [38], UPMC Food-101 [39],
NutriNet [40] and UNICT-FD889 [41] consist of a mix of
eastern and western food images. Moreover, a critical feature
of the food image database is whether it is used for
classification [42-45] or segmentation tasks [46-51]. For
example, Food201-Segmented [52] contains segmented
images from Food-101 dataset for the USA cuisine. Also, an
important element for the classifier is the way the pictures
were acquired, namely whether they were taken in a Figure 4 Size of existing databases for different types of cuisine annotated
controlled environment (in terms of lighting conditions and by the means of food image collection.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Reviews in Biomedical Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/RBME.2023.3283149

TABLE II
FOOD IMAGES DATABASES
Database # classes/
Authors Food Category Database Use Image Source
Name # images
Chen 2012, [28] Chen Chinese Classification 50/5,000 Downloaded from the web
UEC-
Matsuda et al. 2012, [14] Japanese Classification 100/9,132 Captured by authors
Food100
Downloaded from the web +
Anthimopoulos et al. 2014, [26] Diabetes European Classification 11/5,420
controlled environment
Bossard et al. 2014, [23] Food-101 USA Classification 101/101,000 Downloaded from the web
UEC-
Kawano et al., 2014, [25] Japanese Classification 256/31,397 Captured by authors
Food256
UNICT-
Farinella et al. 2014, [41] Generic Classification 889/3,583 Captured by users
FD889
Food201-
Meyers et al. 2015, [52] USA Segmentation 201/12,625 Acquired from other databases
Segmented
Beijbom et al. 2015, [38] Menu-Match Generic Classification 41/646 Captured by authors
UPMC Food-
Wang et al. 2015, [39] Generic Classification 101/90,840 Downloaded from the web
101
Zhou and Lin, 2016, [42] Food-975 Chinese Classification 975/37,785 Captured by authors
Vireo Food-
Chen and Ngo 2016, [24] Chinese Classification 172/110,241 Downloaded from the web
172
UNIMIB
Ciocca et al. 2016, [13] Italian Classification 73/1,027 Captured by authors
2016
Singla et al. 2016, [43] Food-11 Generic Classification 11/16,643 Acquired from other databases
UNICT-
Farinella et al. 2016, [44] Generic Classification 1,200/4,754 Captured by authors
FD1200
Ciocca et al. 2017, [27] Food524DB Generic Classification 524/247,636 Acquired from other databases
ChineseFood
Chen et al. 2017, [29] Chinese Classification 208/180,000 Captured by users
Net
Indian Food
Pandey et al. 2017, [32] Indian Classification 50/5,000 Downloaded from the web
Database
Mezgec et al., 2017, [40] NutriNet Generic Classification 520*/225,953 Downloaded from the web
Fruit and
Hou et al. 2017, [33] VegFru Classification 292/160,731 Downloaded from the web
Vegetables
Fruit and
Waltner et al. 2017, [34] FruitVeg-81 Classification 81/15,737 Captured by authors
Vegetables
Qing Yu et al. 2018 [36] FLD-469 Japanese Classification 469/209,700 Captured by authors
Muresan et al. 2018, [35] Fruits-360 Fruits Classification 131/90,483 Captured by authors
Aguilar et al. 2019, [45] MAFood-121 Generic Classification 121/21,175 Downloaded from the web
Donadello et al., 2019, [30] FfoCat Mediterranean Classification 156/58,962 Downloaded from the web
Food-101 + downloaded from the
Kaur et al. 2019, [37] FoodX-251 Generic Classification 251/158,000
web
Gao et al. 2019, [46] SUEC Food Japanese Segmentation 256/31,395 Acquired from other databases
Ege et al. 2019, [47] UECFoodPix Japanese Segmentation 100/10,000 Acquired from other databases
Wang et al. 2019, [51] Mixed dishes Chinese Segmentation 218/12,105 Captured by authors
Aslan et al. 2020, [50] Food50Seg Japanese Segmentation 50/5,000 Acquired from other databases
Konstantakopoulos et al. 2021, 160/51,840 & Downloaded from the web +
MedGRFood Mediterranean Classification
[31] 190/5,000 controlled environment
UECFoodPix
Okamoto et al. 2021, [48] Japanese Segmentation 102/10,000 Acquired from other databases
Complete
FoodSeg103/ 730/7,118
Wu et al. 2021, [49] Generic Segmentation Acquired from other databases
FoodSeg154 730/9,490

databases that contain a large number of images for each food them are either collected from the web or created using other
class. It may be easier nowadays to collect the images for a databases. Finally, it is worth mentioning that there is no
large food image database, due to the tendency to capture food benchmark food image database for general classification
images using smartphones and to the existence of many purposes. As food has no borders and we live in multicultural
images in social networks. Although, there is a plethora of societies, it is needed to create a large food image database,
food image databases, we note that there are no food image that will include different types of cuisines, to allow the
databases related to healthy diet patterns. In addition, there development of systems and applications that will be able to
exist a few annotated databases, mainly referring to the detect and calculate the amount of as many foods as possible.
Japanese cuisine, which could be used in the segmentation Therefore, the creation of an annotated food image dataset
and classification tasks (Figure 3). Figure 4 illustrates the size that would take into account the type of cuisine could include
(number of images) of existing food image databases for foods with the same name but from different regions. For
different types of cuisine annotated by the associated method example, it is possible for an annotated food image database
of constructions. We observe that the majority of databases to contain the same food name and characterize it additionally
belong to generic and Asian cuisine, while a large number of by its cuisine or its region. Therefore, the creation of an

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Reviews in Biomedical Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/RBME.2023.3283149

annotated food image dataset that would take into account the
type of cuisine could include foods with the same name but
from different regions. For example, it is possible for an
annotated food image database to contain the same food name
and characterize it additionally by its cuisine or its region.
Table II summarizes the most representative food image
databases, and their most significant features.
III. FOOD IMAGE SEGMENTATION Figure 5 Example of food image segmentation using the GrabCut
algorithm. The blue rectangle represents the region of interest, the white
Segmentation is the initial step required to identify food and lines represent the foreground and the black lines represents the
refers to the process of localization and extracting regions that background.
have different colour and texture features. The purpose of Hassannejad et al. [57], used a customized interactive graph
food image segmentation is to localize a food item or the food cut algorithm. Initially, the user imposes a number of hard
items (if there is more than one) present in an image, and to constraints to segmentation, by marking some pixels. Then
separate them from the background or other food items [24]. they use the Gaussian mixture model and K-Means to
When the image contains more than one food, food generate image clusters and initialize the graph. Finally, an
segmentation is considered a necessary step in dietary iterative graph cut algorithm is used to segment the food
assessment systems. It is a challenging task to segment foods image. The users who were familiar with the application
that overlap each other, or foods that have an indeterminate achieved up to 93% accuracy (images with less than 5% of
shape, or foods that do not have strong colour or texture false segmented pixels), while the users who were not familiar
features in contrast with the other food items in a plate. In achieved 88% accuracy.
addition, the lighting conditions, under which an image is In automatic food segmentation methods with handcrafted
taken, can affect the segmentation step by creating shadows feature extraction, the user only needs to capture the image.
and reflections [17]. Although segmentation is a difficult Then, existing image processing techniques are employed to
process, the accuracy of segmentation directly affects the solve the segmentation problem by making assumptions about
effectiveness of the subsequent steps, such as the the shape, colour and number of food items in the plate. These
classification and volume estimation. The main metrics for approaches use algorithms and techniques to extract texture,
assessing food image segmentation are the Intersection over shape and colour features, such as the J measure-based
Union – IoU: segmentation (JSEG), the Normalize cuts (NCut) [58], or
𝑌𝑡𝑟𝑢𝑒 ∩ 𝑌𝑝𝑟𝑒𝑑
𝐼𝑜𝑈 = , (1) region merging and growing [59]. For example,
𝑌𝑡𝑟𝑢𝑒 ∪ 𝑌𝑝𝑟𝑒𝑑
Anthimopoulos et al. [60] suggested the use of a five-step
where Ytrue is the ground truth of the food image and Ypred is food segmentation algorithm based on colour information:
the prediction mask; CIELAB conversion, pyramidal mean-shift filtering, region
the meanIoU for multiclass segmentation: growing, region merging and plate detection/background
1 subtraction. The proposed method achieves an 88.5%
𝑚𝑒𝑎𝑛𝐼𝑜𝑈 = ∑𝑁 𝑖=1 𝐼𝑜𝑈𝑖 , (2)
𝑁 segmentation accuracy.
where N is the number of food classes; and the pixel accuracy: In recent years, deep learning approaches [61-64] and
𝑇𝑃+𝑇𝑁 Convolutional Neural Networks (CNNs) [65] in some cases
𝑃𝑖𝑥𝑒𝑙𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = , (3)
𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁 have shown state of the art performance in computer vision
where True Positive (TP) represents a pixel that is correctly tasks, allowing the use of automated food image segmentation
predicted to belong to the given class, True Negative (TN) methods. In these approaches the segmentation models
represents a pixel that is correctly identified as not belonging consist of two main parts: (i) the first part, acts as an encoder
to the given class, False Positive (FP) represents a pixel that by extracting a large number of features from the image, while
is wrongly predicted to belong to the given class and False (ii) the second part act as decoder and is responsible for image
Negative (FN) represents a pixel that is wrongly identified as segmentation (Figure 6). Several popular CNNs models, such
not belonging to the given class.
Several methods have been proposed to address issues in food
image segmentation. An initial classification of methods is:
(i) semi-automatic food segmentation, (ii) automatic ML with
handcrafted feature extraction, and (iii) automatic ML with
deep learning feature extraction.
In several studies, the use of semi-automatic techniques for
food segmentation is preferred, where the user is asked to
select regions of interest in the image, the foreground and the
background (Figure 5). The results of semi-automatic
techniques are highly accurate, distinguishing details of each
food item in the image, as the user knows the exact boundaries
of food items contained in the image/tray [53-56]. Figure 6 An instance segmentation model.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Reviews in Biomedical Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/RBME.2023.3283149

TABLE III
FOOD IMAGE SEGMENTATION APPROACHES
Authors Approach Performance
Bounding Box (bbox) and GrabCut segmentation Classification accuracy is improved when the ground-truth
Kawano and Yanai 2013, [53]
algorithm bounding boxes are given
It can detect bounding boxes around food items with a
Shimoda and Yanai 2015, [55] Bbox using CNNs and GrabCut
mean average precision of 49.9%
Customized interactive version of the graph-cut 93.1% accuracy for familiar users with the application and
Hassannejad et al. 2015, [57]
algorithm 88% for users who were not familiar
Interactive segmentation. Boundary detection/
filling and Gappy Principal Component Analysis
Inunganbi et al. 2018, [54] Outperforms the existing methods
methods are applied to restore the missing
information
Manual design of bbox, manual selection of food tag
Fang et al. 2018, [56] Performs efficiently when used on a large image database
and GrabCut
Matsuda et al. 2012, [14] JSEG segmentation, circle detector and DPM Overall accuracy 21%
CIELAB conversion, pyramidal mean-shift filtering,
Anthimopoulos et al. 2013, [60] region growing, region merging and plate detection/ Accuracy 88.5%
background subtraction
Pouladzadeh et al. 2014, [17] Graph Cut segmentation Accuracy of 95%
Multiple segmentation hypotheses by selecting
Outperforms normalized cut method and improves the
Zhu et al. 2014, [15] segmentations using confidence scores assigned to
classification accuracy
each segment.
Meyers 2015, [52] DeepLab model Classification accuracy is improved
Wang et al. 2016, [58] Normalized cut and superpixels Outperforms some widely used segmentation methods
A combination of saturation, binarization, JSEG Achieves better segmentation accuracy in contrast to
Ciocca et al. 2016, [13]
segmentation and morphological operations JSEG approach
The segmentation accuracy is improved, compared with
Zheng et al. 2018, [59] Adaptive K-means image segmentation
other traditional methods
Salient region detection, multi-scale segmentation
Minija and Emmanuel 2020, [16] Classification accuracy is improved
and fast rejection
The automatic and semi-automatic segmentation methods
Dehais et al. 2016, [61] DCNN and region growing/merging techniques
reached average accuracies of 88% and 92%, respectively
Employed a DCNN to simultaneously perform food
Bolanos et al. 2016, [64] Outperforms the existing methods
localization
Fully convolutional network (FCN) and bounding
Aguilar et al. 2018, [69] IoU over 0.96
box
Aslan et al. 2018, [70] DeepLab-v2 for semantic segmentation mIoU: 0.433 in UNIMIB 2016
DCNN to discriminate food regions from the
Ciocca et al. 2019, [62] IoU 0.79
background in different illumination conditions
DCNN for semantic segmentation of food on a plate
Pfisterer et al. 2019, [67] IoU 0.912
using monocular RGB images
The segmentation accuracy is improved, compared with
Shimoda and Yanai 2020, [71] Class Activation Mapping (CAM)
existing -supervised segmentation methods
Finding salient missing objects before and after
Yarlagadda et al. 2021, [65] AUC (Area Under the Curve): 0954
eating images
Okamoto et al. 2021, [48] DeepLab V3+ Mean IoU: 0555
Poply et al. 2021, [63] Semantic segmentation - RefineNet IoU0.75: 0.962 in UNIMIB 2016
Wu et al. 2021, [49] ReLeM semantic segmentation model Mean IoU 0.439 in FoodSeg103
Park et al. 2021, [66] Mask R-CNN pretrained on synthetic data Average precision: 0.522
Nguyen and Ngo 2021, [72] Terrace-based instance segmentation MAE:0.45, PQ:0.693 in Mixed dishes dataset

as ResNet50 [66, 67] and InceptionV3 [68] are used as the (LSTM) network as the encoder and the vision transformer
backbone network in the encoder, while well-known architecture as the decoder and they achieved 0.439 mIoU in
architectures, such as Fully Convolutional Network (FCN) the FoodSeg103 database. In a new study, Nguyen and Ngo
[69] and DeepLab [70], are used as a decoder. Shimoda and [72] presented an instance segmentation model for multiclass
Yanai [71], presented a method to make consistency between segmentation, using the terrace representation for food items.
a food segmentation model and a plate segmentation model. They employed the panoptic quality metric, a combination of
More specifically, they used Class Activation Mapping IoU and pixel accuracy metrics, which achieved a score 0.693.
(CAM), which is one of the basic visualization techniques of Although the segmentation step is not necessary in several
CNNs. A food category classifier can highlight food regions dietary assessment systems, we observe that the studies using
containing no plate regions, while a food/non-food category the semi-automated segmentation method result in better
classifier can highlight food regions including plate regions. performance. However, this leads to a delay in calculating the
They demonstrated that they boosted the accuracy of weakly- nutritional composition, as it requires interaction with user of
supervised food segmentation. In a recent study, Wu et al. the system. In automated food segmentation, the use of deep
[49] proposed a novel fully automatic semantic segmentation learning techniques has resulted in better performance
method consisting of a recipe learning module and an image compared to handcrafted techniques. Instance segmentation is
segmentation module. They used a Long short-term memory a technique that has been used on a small scale (Figure 7) in

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Reviews in Biomedical Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/RBME.2023.3283149

food image segmentation and could further improve the


segmentation performance of dietary assessment systems.
Moreover, it can be used to segment multiple foods in an
image, allowing the development of more realistic
applications, as each dish tends to have more than one food
items. This presupposes the use of annotated food image
databases, as it is a requisite to build segmentation models
based on deep learning techniques. In recent studies, the food
image segmentation step is omitted and in some others the
performance is not reported. In other studies, although the
performance of the methods used to segment food images is
high and improves the classification accuracy, there are still
open issues related to cases where mixed or overlapping foods
exist. In these cases, the use of state-of-the-art segmentation
techniques, such as semantic and instance segmentation, can Figure 7 The counts of segmentation approaches in dietary assessment
be used to improve performance and increase accuracy in the systems.
classification step. In Table III, the main segmentation 2×(𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛×𝑅𝑒𝑐𝑎𝑙𝑙)
techniques are summarized. 𝐹1 − 𝑆𝑐𝑜𝑟𝑒 = . (7)
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙
IV. FOOD IMAGE CLASSIFICATION
The task of food image recognition can be divided into two
Food image classification is a complex process that may be categories: traditional machine learning approach with
affected by many factors. For instance, the way food is cooked handcrafted features and deep learning approach using
or if other food items, like sauce, covering the main food are convolutional neural networks (Figure 8).
present. Provided that the results of classification highly A. Traditional Machine Learning Approaches
affect the effectiveness of next steps (the food volume
estimation step and the food nutritional composition step), Approaches that fall into this category are differentiated based
researchers have developed various techniques and methods on the technique chosen to extract the image features and, on
to improve classification accuracy. The training of the the classifier selected for their classification. Feature
classifier is affected by the number and quality of images used extraction is the process in which the most representative
in the training phase, so the food database plays a crucial role features of an image are extracted, creating the corresponding
in this process. Moreover, the techniques used to extract the feature vector. There are several feature extraction
features of the images, through which the images are algorithms, such as speeded-up robust features (SURF), scale
recognized, greatly affect the accuracy of the classifier. The invariant feature transform (SIFT), local binary patterns
most basic metrics used for classification models are top-1 (LBP) [73], Gabor filter [74] and histogram of oriented
and top-5 accuracy. Top-1 accuracy is the accuracy where true gradients (HOG). In numerous approaches the feature
class matches with the most probable classes predicted by the extraction is performed by a combination of the above
model, defined as: algorithms, improving the classification accuracy. The
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 exported features then, feed a classifier for training the
𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = , (4)
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑎𝑙𝑙 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 prediction model, based on machine learning algorithms, such
Top-5 accuracy is the accuracy where true class matches with as support vector machine (SVM), bag of features (BoF),
any one of the 5 most probable classes predicted by the model. random forests (RF), k-nearest neighbours (kNN) [75] and
Other known metrics for classification task are: multiple kernel learning (MKL). For example, Bossard et al.
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃/(𝑇𝑃 + 𝐹𝑃), (5) [23] introduced a method to mine discriminative parts using
RF. To improve effectiveness of mining and classification,
𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃/(𝑇𝑃 + 𝐹𝑁), (6)

Figure 8 A deep learning classification model of food images.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Reviews in Biomedical Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/RBME.2023.3283149

TABLE IV
TRADITIONAL CLASSIFICATION APPROACHES
Authors Features Classifier Database Top 1 accuracy Top 5 accuracy
Chen et al. 2012, [28] SIFT, LBP, Gabor and colour SVM & Adaboost Chen 68.3% 90.9%
BoF of SIFT and CSIFT, HOG
Matsuda et al. 2012, [14] MKL-SVM UEC-Food100 55.8% n/a
and Gabor
Kawano and Yanai 2013, Colour histogram and Bag-of-
Linear SVM 6,781 images/50 food 53.5% 81.6%
[53] SURF
Kawano and Yanai, 2014, One-vs-rest linear
Fisher Vector and RootHoG UEC-Food256 50.1% 74.4%
[76] classifier
Anthimopoulos et al. BoF, hsvSIFT and colour
SVM Diabetes 78.0% n/a
2014, [26] moment invariant
1,453 images/42 84.2% top-4
He et al. 2014, [75] DCD, MDSIFT, SCD and SIFT KNN 64.5%
classes accuracy
Bossard et al. 2014, [23] SURF and Lab colour Random Forests Food-101 50.8% n/a
Pouladzadech et al. 2015, 6,000 images/ 30
Gabor and colour Cloud-based SVM 94,5% n/a
[74] classes
One-vs-rest linear
Beijbom et al. 2015, [38] HOG, SIFT, LBP and MR8 Menu-Match 77.4% 96.2%
SVM
Christodoulidis et al.
Colour Histograms, LBP SVM Own database 82.2% n/a
2015, [73]

they consider patches that are adjusted with image last years, deep learning is the state of the art for food image
superpixels. For each superpixel, they extracted Dense SURF classification [89]. Hassannejad et al. [90], evaluated a fine-
and L*a*b colour features. Then, they train a multi-class tuned version of Inception V3 model, increasing the accuracy
SVM for final classification, with an average accuracy 50.8% and decreasing the computational cost. In particular, they
in Food-101 image dataset. In another study, Kawano and achieved 81.5%, 76.2% and 88.3% top-1 accuracy, on UEC-
Yanai [76] proposed a food recognition system that can Food100, UEC-Food256 and Food-101 databases,
identify 256 food categories using the food image database respectively. In addition, they achieved 97.3%, 92.6% and
UEC-Food256. They applied RootHoG and colour features 96.9% top-5 accuracy on UEC-Food100, UEC-Food256 and
and coded them into a Fisher Vector to train one-vs-all linear Food-101 databases, respectively. In another study, they have
classifier, with top-1 accuracy 50.1% and 74.4% top-5 built a DNN model consisting of two stages: The first stage is
accuracy. Pouladzadech et al. [74], classified 30 food classes a residual network, encoding generic visual depictions of food
using a cloud-based SVM classifier, achieving 94.5% images, while the second stage is a slice network with a slice
accuracy. They used a combination of features, including convolutional layer capturing the vertical food features. The
colour, texture, size and shape, while most prevailing methods extracted features are linked and fed to the fully connected
use only colour and shape features. Table IV summarizes layers that give out the classification prediction. Tan and Le
traditional food classification approaches and their main [91], proposed a new CNN scaling architecture, the
characteristics. EfficientNet. They scaled up the depth, width and resolution
B. Deep Learning Approaches of the network, outperforming the state-of-the-art deep
learning studies. EfficientNet-B7 achieves 93% accuracy in
The CNN is a class of deep neural networks (DNNs); it the Food-101 dataset. In several deep learning-based studies
constitutes the state-of-the-art method in image recognition. for food recognition, it is observed that the evaluation of the
They are most used to analyse visual imagery and are models is performed in the databases of food images: UEC-
frequently working behind the scenes (hidden layers) in image
classification. A CNN convolves learned features with input
data and uses 2D convolutional layers. This means that this
type of network is ideal for processing 2D images. Compared
to other image classification algorithms, CNNs actually use
very little pre-processing. A CNN works by extracting
features from images. This eliminates the need for manual
feature extraction. The features are not trained but they are
learned while the network is trained on a set of images. This
makes deep learning models extremely accurate for computer
vision tasks. CNNs learn feature detection through tens or
hundreds of hidden layers. Each layer increases the
complexity of the learned features.
Several studies use pre-trained CNN models [77-82] to
classify food images, such as Inception V3 [83, 84] and
EfficientNet [85, 86]. Moreover, fine-tuning [87], transfer
learning [88] and data augmentation techniques are applied to Figure 9 Boxplot distribution of top-1 accuracy of deep learning-based food
improve the accuracy of classification models. Definitely, the recognition algorithms for different food image databases.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Reviews in Biomedical Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/RBME.2023.3283149

TABLE V
DEEP LEARNING CLASSIFICATION APPROACHES
Authors Techniques Database Top-1 accuracy Top-5 accuracy
Kagaya et al. 2014, [77] CNN 73.7% n/a
Christodoulidis et al. 2015, [73] Patch-wise CNN 84.9% n/a
Pouladzadeh et al. 2016, [80] CNN Own database 99% n/a
Termritthikun et al. 2017, [81] NU-InNet1.0 69.8% 92.3%
He et al. 2020, [78] CNN 88.7
Konstantakopoulos et al. 2021, [85] DCNN MedGRFood 83.4% 97.8%
Ciocca et al. 2016, [13] CNN 78.3% n/a
UNIMIB2016
Mezgec and Seljak 2017, [40] NutriNet 86.4% n/a
Chen and Νgo 2016, [24] Arch–D 82.1% 95.9%
Min et al. 2019, [94] IG-CMAN VIREO-172 90.6% 98.4%
Metwalli et al. 2020, [95] DenseFood 81.2% 95.4%
Kawano and Yanai 2014, [82] Pre-trained DCNN 72.3% 92.0%
Yanai and Kawano 2015, [87] DCNN-Food 78.8% 95.2%
Hassannejad et al. 2016, [90] Inception V3 81.5% 97.3%
Liu et al. 2016, [89] DeepFood UEC-Food100 76.3% 94.6%
Liu et al. 2017, [83] Inception Module 77.5% 94.6%
Martinel et al. 2018, [79] WISeR 89.6% 99.2%
ResNeXt101 &
Arslan et al. 2021 [92] 90.0% -
DenseNet161
Yanai and Kawano 2015, [87] DCNN-Food 67.6% 89.0%
Hassannejad et al. 2016, [90] Inception V3 76.2% 92.6%
Liu et al. 2016, [89] DeepFood 54.7% 81.5%
UEC-Food256
Liu et al. 2017, [83] Inception Module 54.5% 87.0%
Martinel et al. 2018, [79] WISeR 83.2% 93.4%
Zhao et al. 2020, [93] JDNet 84.0% 96.2%
Bossard et al. 2014, [23] CNN 56.4% n/a
Yanai and Kawano 2015, [87] DCNN-Food 70.4% n/a
Meyers 2015, [52] GoogleLeNet 79.0% n/a
Hassannejad et al. 2016, [90] Inception V3 88.3% 96.9%
Liu et al. 2016, [89] DeepFood 77.4% 93.7%
Chen and Νgo 2016, [24] Arch–D 82.1% 97.3%
Pandey et al. 2017, [32] Ensemple Net 72.1% 91.6%
Liu et al. 2017, [83] Inception Module Food-101 77.0% 94.0%
Cui et al. 2018, [88] DSTL 90.4% n/a
Martinel et al. 2018, [79] WISeR 90.3% 98.7%
Tan and Le 2019, [91] EfficientNetB7 93.0% n/a
Merchant and Pande 2019, [84] ConvFood 70.0% n/a
Min et al. 2019, [94] IG-CMAN 90.4% 98.4%
Zhao et al. 2020, [93] JDNet 91.2% 98.8%
VijayaKumari et al.2022, [86] EfficientNetB0 80.0% -

Food100 [92], UEC-Food256 [93], Food-101 [94] and any information about the used databases, diminishing their
VIREO-172 [95]. replicability potential. We observe that the Food-101 is the
Figure 9 shows the box plots of top-1 accuracy achieved by database with the highest percentage, while newer databases
deep learning approaches for existing food image databases. have used very little. Table V presents the main characteristics
We observe the top-1 accuracy features a high interquartile of deep learning approaches applied in food image
range for the UEC-Food256 and Food-101 databases; this is classification. We can observe that the accuracy of
an indication of the complexity characterising multi-class conventional classification models can be improved by
problems. On the other hand, a higher and less spread top-1 combining feature extraction algorithms. Moreover, the
accuracy obtained for databases with a small number of combination of different classifiers seems to work better than
classes or focused on specific tasks. Figure 10 presents the using only one classifier. In addition, we notice that the
percentage usage of existing food image databases as traditional approaches are used on small food datasets where
development datasets in food recognition, where databases deep learning techniques cannot be applied, and it is obvious
with a large number of classes being used more often. In that deep learning techniques for food image recognition
addition, a considerable amount of studies (18%) do not refer outperform the traditional ones [19]. Although CNNs were

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Reviews in Biomedical Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/RBME.2023.3283149

in many cases dedicated cameras for capturing food images.


In fact, calculating the nutritional composition of a food is a
challenging task, even for nutritionists. This is why in many
nutritional estimation systems; it is considered appropriate to
have a reference object to determine the depth of the image.
The metrics which are used to evaluate the volume of food
are: the mean absolute error (MAE):
1
𝑀𝐴𝐸 = ∑𝑛𝑗=1|𝑉𝑟𝑒𝑎𝑙 − 𝑉𝑒𝑠𝑡 |, (8)
𝑛
the mean absolute percentage error (MAPE):
1 𝑉𝑟𝑒𝑎𝑙 − 𝑉𝑒𝑠𝑡
𝑀𝐴𝑃𝐸𝑖 = ∑𝑛𝑗=1 | | ∗ 100, (9)
𝑛 𝑉𝑟𝑒𝑎𝑙

and the root mean square error (RMSE):


1
𝑅𝑀𝑆𝐸 = √∑𝑛𝑗=1(𝑉𝑟𝑒𝑎𝑙 − 𝑉𝑒𝑠𝑡 )2 , (10)
𝑛

where Vreal is the real volume of food, Vest is the estimated


volume and n is the total number of foods.
Several studies require taking two or more images of the food
Figure 10 Percentage use of food image databases in food recognition- for its 3D reconstruction [96, 97]. The first step in these
related studies.
studies is the feature points extraction, using appropriate
firstly used to extract features that feed a classifier, in recent feature extraction algorithms, among others SIFT and SURF.
years only deep learning models have been used to classify Then, the relative camera pose is estimated between the
food images. Furthermore, we note that there is a tendency to captured images. Furthermore, reference objects, with known
use deeper learning networks to train food image dimensions, are used to estimate the scale of the image, for
classification models (for example, the EfficientNet B-7 instance a reference card. Consequently, dense stereo
consists of 813 layers). However, the need of computing matching is utilized for 3D food reconstruction, projecting the
power seems to limit the possibilities of such an approach. In image coordinate system to the world coordinate. The next
the future, with the ever-increasing computing power to train step is to estimate the volume of the food by removing the
deep learning models (e.g., deep learning cloud servers) and background from the image and keeping only the food in it.
to build deeper networks, combined with training in larger Finally, the nutritional composition of the food is analysed
datasets, their performance can be further improved. using the relevant nutrient database, such as the USDA Food
and Nutrient Database for Dietary Assessment (FNDDS)
V. FOOD VOLUME ESTIMATION [98]. Dehais et al. [99], estimated the volume of multi-food
The last step in food nutritional composition systems meals by capturing two images, with the food placed inside
comprises the estimation of foods quantity and the analysis of an elliptical plate and a reference card placed next to it. The
their nutritional composition, such as carbohydrates, proteins, proposed system comprised of three stages. The first stage is
fat and total calories. Accurate estimation of the amount of extrinsic calibration (computation of camera rotation and
food, assumes that the previous stages of the segmentation translation matrices) which is performed in three steps: salient
and recognition of the food have been accomplished correctly. point matching, relative pose extraction and scale extraction.
Then, using appropriate approaches, such as 3D The second stage is dense reconstruction, which also consists
reconstruction, pre-build shape templates, perspective of three steps: rectification of the images, stereo matching and
transformation, depth camera and deep learning techniques, point cloud generation (Figure 11). Volume estimation is the
the volume of food is estimated. This is a demanding process final stage, which consists of the following steps: food surface
which in most cases requires a specific number of photos and extraction, dish surface extraction and volume calculation.
a specific way of taking them, a controlled environment and The system was evaluated on 77 food dishes of known

Figure 11 Dense reconstruction steps of two captured images.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Reviews in Biomedical Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/RBME.2023.3283149

TABLE VI
FOOD VOLUME ESTIMATION APPROACHES
Authors Approach Techniques Performance
• Feature matching & stereo rectification
Rahman et al. A pair of stereo images for • Camera calibration The average error is 7.7%, for
2012, [96] dense reconstruction • Disparity and depth map generation 6 fruits
• 3D reconstruction & 3D volume estimation
• Extrinsic calibration Mean absolute percentage
Dehais et al. 2017, Two-view dense 3D
• Dense reconstruction error ranging from 8.2% to
[99] reconstruction
• Volume estimation 9.8% in two different datasets
Make use of a monocular Mean volume estimation error
• Sparse-map generation by SLAM method
Gao et al. 2018, wearable camera based on is 15.98% and 20.54% on
• Apply convex-hull algorithm to form a 3D mesh object
[97] SLAM system for food static food and on during food
volume estimation • Volume estimation based on 3D mesh object consumption respectively
Structure from motion 3D • Feature matching and pose estimation
Konstantakopoulos MAPE from 4.6% to 11.1%
reconstruction with a • Stereo matching and 3D reconstruction
et al. 2021, [31] for seven types of food
reference card • Scale determination and volume estimation
Make use of a pre-built 3D • 3D model generation
Xu et al. 2013, model of food items and The average error is 10%, for
• Pose initialization
[100] then compute the pose 5 types of food
estimation • Pose finalization
An electronic device • A self-developed image undistortion algorithm applied to
Jia et al. 2014, the image with the best quality 85 out of the 100 food items
(eButton) captures images
[101] have less than 30% error
every 2 to 4 sec. • A virtual shape method used to measure the portion size
Singe-view 3D
• Points of interest estimation
Fang et al. 2015, reconstruction using the Achieve less than 6% error in
• Area-based volume estimation through the prism model
[102] geometric contextual energy estimation
information from the scene • Weight estimation through the food density
• A user needs to register a known size reference object
• Quadratic curve is estimated from the 2D size of foods to Relative average error on
Okamoto and Food calorie estimation by a
their calories calorie estimation is 21.3%,
Yanai 2016, [103] single image
• Quadratic curve is trained based on the training data for 60 food images
annotated with real food calories independently
The average error is 12.01%
Jia et al. 2012, Utilize the plate and LED • Object location and orientation using the plate method
for the plate method and
[104] methods • Object location and orientation using the LED method
29.01%, for the LED method
• Perspective transformation
Yue et al. 2012, Single image with a known Average length and thickness
• Orientation estimation
[106], size circular container error estimation 3.41%
• Dimension estimation
• Camera calibration
Use the methods: Shape The average error is 11%, for
He et al. 2013, template 3D reconstruction • Camera pose information beverage images using
[105], and area-based weight • Shape template method to reconstruct a 3D food item cylinder shape, and 10% for
estimation for foods • Food area and weight estimation area-based weight estimation
Captured two photos (over
For 5 types of food, 10% error
Pouladzadeh et al. and side of the food), with • Food area measurement from top view captured image
in the worst case and less than
2014, [108] the user’s thumb as • Depth estimation from the side-view captured image
1% error in the best case
reference
• A smartphone motion sensor determines camera orientation
A fiducial marker free
Yang et al. 2019, • The length or the width of the smartphone determines the The average absolute error is
method, making use of the
[107] location of any visible point on the tabletop 16.65% for ten types of food
smartphone motion sensor
• The food image captures with a special way
Chen et al. 2012, Depth camera (Microsoft • Calculate the area of food container The system performance is not
[28] Kinect) • Calculate the depth value of the contained food reported
• Take a RGB-D food image The proposed system achieves
Food volume and calorie
Ando et al. 2019, • Estimate volumes of food on the dish higher accuracy than
estimation using depth
[109]
camera • Calculate foods calories using the pre-registered calorie CalorieCam and AR
density of each food category. CalorieCam V2 applications

volume, and achieved MAPE from 8.2 - 9.8% in two different processing algorithms, such as SIFT or SURF, makes the
datasets. It is worth mentioning that the researchers in order methodology sensitive to the acquisition of images and make
to extract the relative pose, modified the classical Random the process significantly slower, affecting food volume
sample consensus (RANSAC) algorithm by including local estimation accuracy.
optimization and an adaptive threshold estimation method. Some studies suggest the use of specific geometrical shapes
3D food reconstruction is a methodology that can be used in or templates (for example spherical and cylindrical objects) to
a food of any shape and in capturing food images in a non- reconstruct the food image from the 2D space into the 3D
controlled environment. However, the need to capture at least space from a single image [100-102]. Moreover, they utilize
two images, as well as to extract the features using image a fiducial marker (a checkboard pattern or a reference card) to

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Reviews in Biomedical Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/RBME.2023.3283149

TABLE VI
CONTINUED
Use a CNN model to Until 400ml absolute error
• Use a CNN architecture to predict the depth map
Meyers et al. 2015, [52] estimate the 3D across the 11 meals in the
• Convert the depth map to voxel representation
volume NFood-3d dataset
RMSE for bread units is 1.53.
State of the art deep • Predict the depth map using a CNN Food categories were
Christ at al. 2017, [112]
learning methods • Estimate the bread units using ResNet-50 evaluated from Diabetes60
dataset
10.89% energy estimation
Use of generative
• GAN is trained on paired images to map a food image to its error. 2095 paired images
Fang et al. 2018, [113] model for food
equivalent energy distribution image were used for the generative
energy estimation
network.
Estimate food energy
• Use GAN to estimate the image to energy mappings Average food energy
based on learned
Fang et al. 2019, [111] • A CNN-regression model estimates the energy value based estimation error 209 kcal for
energy distribution
the learned energy distribution images 347 food images
images
• A mobile phone with depth sensors is captured a single depth
image
Vision-based method • A fine-tuned Mask R-CNN are segmented the food items
using real-time 3D • The depth image is converted from image to camera The average error ranging
Lo et al. 2019, [110] reconstruction and coordinate from 15 to 79 cm3 for eleven
deep learning view • The partial point cloud is directed to the point completion types of food
synthesis network to perform 3-D reconstruction
• The portion size of food items is estimated

obtain the camera parameters and provide a reference for the estimate its volume. The advantage of perspective
object scale and pose of each food item. The requirement for transformation methodology is that it can handle irregular
predefined geometrical shapes or templates for the 3D food shapes based on a single image. Its disadvantages are
reconstruction of food, renders these methods extremely that it requires a special capture of food images and that the
difficult to use in systems for daily dietary monitoring, distance cannot be computed accurately.
because of the different and irregular shapes that food items In order to obtain the depth of the food image, the use of
present. For instance, in [103], the dimensions of the reference special devices and sensors is suggested in some studies. In
object used by the user must be pre-registered, to be able to [109], new generation smartphone cameras (Time of Flight
calculate the real size of the food region. They assume that the (ToF) sensor or depth-sensing camera) were utilized to
food portion height is correlated with the food size, and they estimate depth and distance, where a pair of rear cameras can
estimate calories of food items directly from the food size. For create the depth map in real time. The use of an additional
this purpose, they utilize quadratic curve estimation of food depth camera to calculate the depth makes this approach less
calories based on their 2D size. The quadratic curve of each popular. However, with the development of technology which
food is calculated based on data annotated with real food captures 3D images using smartphones, the depth camera
calories. This approach gives good results in foods that have methodology is expected to dominate in the next years. At the
a regular shape, such as lasagna and cheesecake. Otherwise,
the calculation of the amount of food is inaccurate and must
be used in conjunction with methodologies for volume
estimation of food having irregular shape. For food items that
have irregular 3D shapes, researchers suggest using area-
based volume estimation methods from a single image [104,
105]. The pinhole camera model provides a perspective
transformation from the 3D plane to the 2D plane [106].
Perspective transformation is a linear projection where 3D
objects are projected on a picture plane. This causes distant
objects to appear smaller the nearest ones and also means that
lines which are parallel appear to intersect in the projected
image. In order to accurately determine the food region, the
2D image should be rectified, so that the projective distortion
may be removed. In this case, the existence of a reference
object in the 2D image is a prerequisite [107]. In, [108] they
have proposed a system which requires the user’s thumb
placed beside the dish when capturing the picture. Then the
system, which already knows the dimensions of user’s thumb,
can calculate the food area of each food item, and multiplies
the total area of food (TA) by the depth (d) of the image to Figure 12 Percentage use of each volume estimation approach.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Reviews in Biomedical Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/RBME.2023.3283149

moment, the high cost of these smartphones prohibits the use in the dietary assessment system, many applications require
of such technology. capturing images from specific shooting angles [99] and with
In recent years with the ever-increasing use of deep learning specific objects placed next to them [108]. These prerequisites
networks in computer vision problems [110], they have been make it difficult to use these applications and prevent users
used in food volume estimation problems. Moreover, the from employing them, which renders it imperative to create
ever-increasing computing power has allowed the use of simpler systems.
Generative Adversarial Networks (GANs) to estimate the In food image databases, the use of deep learning techniques
amount of food [111], providing a new dimension in the for food recognition tends to create databases with the largest
solution of this problem. In [112], a CNN is employed to possible number of images for each food class. However, the
deduce the depth from RGB food images to be used in Bread existing databases are limited to the number of food classes,
Units (BU) regression. This is why they have created a large- depending on the dietary habits of the database constructor.
scale dataset of around 9K different RGB-D images of 60 Thus, there is a necessity to create a generic food image
western dishes taken using a Microsoft Kinect v2 sensor. database which covers as many food categories as possible
They have proved that depth maps from RGB images can and represents the types of food from all cuisines. The
replace RGB-D input data at high importance for the BU collection of food images and the creation of food image
regression task. In another study [113], GANs are utilized to databases is an easier task nowadays, due to the habit of
estimate food energy distribution. For the GANs training, capturing and posting images on social media. However,
they have created a food image dataset, which consists of creating a database that will additionally include the
1875 paired images, based on ground truth food labels and ingredients of the food or its weight, is still a demanding task.
segmentation masks for each food image including energy Furthermore, creating an annotated database of food images
information correlated with the food image. The average using their weight in addition to the type of food, will help
energy estimation error is 10.89%. In Figure 12 we can build better and more accurate models for the next steps of
observe a quasi-even use of different food volume estimation nutritional analysis systems. Also, one possible way to
approaches, except for depth-camera -based ones, with deep increase the number of images per food class is to use GANS
learning and perspective transformation covering each 25% models. Finally, it is worth mentioning that the acquisition of
of the studies. Table VI summarizes the main food volume databases remains difficult, and the creation of a unified food
estimation approaches, along with the techniques used to image database cannot be achieved.
estimate the amount of food and their performance. In several recent studies, the step of food image segmentation
is omitted and in some others the performance of this step is
VI. DISCUSSION
not reported. In other studies, although the performance of the
The 21st century is characterized as the century of data methods used to segment food images is high and improves
explosion. With the AI and the Internet of Things (IoT) the classification accuracy, there are still open issues related
becoming omnipresent technologies, we now have a huge to cases where there are mixed foods. There are also open
amount of data being created. Since the enormous volume of issues in cases where lighting conditions can create shadows
image data we receive is not structured, we rely on advanced or reflections in the image or blurring the food items
techniques, such as machine learning for efficient image contained in the image. In these cases, the use of state-of-the-
analysis. Food image database, food image segmentation, art segmentation techniques, such as semantic and instance
food classification and food volume estimation are parts of segmentation, can be used to improve the performance of this
image analysis and can be used to dietary assessment systems step and improve the efficiency to the classification step.
as part of mobile health (mHealth) applications, capturing Studies have shown that deep learning techniques perform
images through a smartphone. This is what today is used and better than traditional food image classification techniques
it is easy to use by most of the people and of all ages to capture and that is the reason why they are considered the state-of-
photos and more specifically food images, that will offer the the-art methods for food image classification. To classify food
possibility of continuous recording of health data in real time. images, as mentioned above, databases with a large number
The use of mobile devices and cloud technology to monitor of food images are required. This requirement becomes even
health data and sharing it with physicians, can lead to faster bigger for deep learning techniques, where the number of
and less misdiagnosis of diseases, such as diabetes and CVDs. images in the database affects the performance of the food
In vision-based dietary assessment systems, all stages are image classification system. In addition, blurred images,
important towards building a reliable integrated system for inadequate lighting conditions when capturing them and the
food nutrition analysis. Although the dietary assessment different ways of cooking the same food, can lead to
systems have been researched for many years, several misidentification of the food. The use of deeper classification
challenges remain to be explored. models and the application of transfer learning, fine tuning,
The way food images are captured plays an important role in and data augmentation techniques, could improve the
the individual steps of these systems. For both the creation of accuracy of deep learning classification models. The use of
the databases and their input in the food analysis systems, the pre-trained DNNs in existing food image databases could lead
way the images are taken affects the performance of to the construction of models with better accuracy and even
segmentation, classification, and volume estimation. In the lower loss.
database creation, similar foods must be captured in a way
that emphasizes their different features. To input food images

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Reviews in Biomedical Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/RBME.2023.3283149

TABLE VII
COMPARISON OF EXISTING REVIEW STUDIES
V. Bruno et M. C. W. Min, et F.P. Wen Wang, et. al., This review
al., J. Health Archundia al., ACM Lo, et al., Trends Food
Med. Herrera et al., Comput. IEEE JBHI, Sci.
Inform., Nutrients, Surv., 2019 2020 Technol.,
2017 2018 2022
Analysis of Food Image Databases
Reporting general database information ✓ - ✓ - ✓ ✓
Reporting # of images and # of classes ✓ - ✓ - ✓ ✓
Reporting the image acquisition process ✓ - ✓ - - ✓
Reporting database use - - ✓ - ✓ ✓
Reporting pros and cons - - ✓ - ✓ ✓
Reporting future directions ✓ - ✓ - ✓ ✓
Analysis of Food Image Segmentation Techniques
Categorisation of techniques to:
Semi – automatic approaches - - - - - ✓
Automatic ML-based approaches with handcrafted
- - - - - ✓
feature extraction
Automatic ML-based approaches using deep
- - - - ✓ ✓
learning for feature extraction
Reporting the description of the approach followed
✓ - - - ✓ ✓
for each of the reviewed study
Reporting of performance ✓ - - - ✓ ✓
Reporting performance metrics - - - - ✓ ✓
Reporting pros and cons ✓ - - - - ✓
Reporting future directions - - - - - ✓
Analysis of Food Image Classification Techniques
Categorisation of techniques to ML & DL
approaches ✓ - ✓ ✓ ✓ ✓
Reporting the database used ✓ - - ✓ ✓ ✓
Reporting of performance ✓ - ✓ ✓ ✓ ✓
Reporting performance metrics - - - - ✓ ✓
Reporting pros and cons ✓ - ✓ ✓ - ✓
Reporting future directions - - ✓ ✓ ✓ ✓
Analysis of Food Volume Estimation Techniques
Categorisation of techniques to:
3D reconstruction - - - ✓ ✓ ✓
Pre-build shape template - - - ✓ - ✓
Perspective transformation - - - ✓ - ✓
Depth camera - - - ✓ - ✓
Deep learning - - - ✓ - ✓
Reporting the description of the approach followed
for each of the reviewed study ✓ ✓ - ✓ - ✓
Reporting of performance ✓ ✓ - ✓ ✓ ✓
Reporting performance metrics - - - - ✓ ✓
Reporting pros and cons ✓ ✓ - ✓ ✓ ✓
Reporting future directions - ✓ - ✓ ✓ ✓
Reporting food intake monitoring devices and
apps - ✓ ✓ - - -

Volume and nutrient estimation are the most challenging task volume estimation to be calculated in only few foods which
in automated vision-based dietary assessment systems. The have a specific shape. Finally, although the recent use of deep
controlled environment for capturing food images, taking learning techniques in food volume estimation was a very
multiple photos, the inability to estimate the volume of food promising approach, studies have shown that they do not
with weak texture features, for instance yogurt, and the outperform the existing techniques. In the 3D reconstruction
creation of databases according to the techniques used in each approach, CNNs could be used instead of image processing
study, render the estimation of the amount of food through algorithms to extract the features, significantly increasing the
images the most demanding stage for nutrient analysis number of matched features and improving the reconstruction
systems. In addition, the need to use a reference object or the of food 3D point cloud. One possible approach that would
use of a depth camera to calculate the scale and quantity of solve many problems regarding the way images are captured,
food, limits their possibility for extensive use. Moreover, food the number of images required and the depth sensors needed
estimation techniques based on geometric patterns allow would be to build a machine learning model on an annotated

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Reviews in Biomedical Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/RBME.2023.3283149

food image database with regard to the weight of the food [3] International Diabetes Federation, Diabetes facts & figures, Available:
https://www.idf.org/aboutdiabetes/what-is-diabetes/facts-figures.html,
items.
[Accessed: 19 December 2021].
Considering the continuous technological development and [4] World Health Organization, Cardiovascular diseases (CVDs),
the techniques of recording data, the use of alternative ways Available: https://www.who.int/news-room/fact-
to enter data and information related to the food consumed sheets/detail/cardiovascular-diseases-(cvds), [Accessed: 11 June 2021].
[5] M. Rusin et al. "Functionalities and input methods for recording food
(for example via speech or text), could help optimize the
intake: a systematic review," International journal of medical
performance of nutritional analysis systems. In particular, informatics, vol. 82, no. 8, pp. 653-664, 2013.
combining traditional food recognition and quantity [6] M. B. E. Livingstone et al., "Issues in dietary intake assessment of
estimation techniques with voice and text input and children and adolescents," British Journal of Nutrition, vol. 92, no. S2,
pp. S213-S222, 2004.
processing techniques could further improve the performance
[7] T. Hernández et al., "Portion size estimation and expectation of
of nutritional assessment systems. In addition, using advanced accuracy," Journal of food composition and analysis, vol. 19, pp. S14-
deep learning techniques and algorithms, such as S21, 2006.
reinforcement learning, it is possible to build dietary [8] M. R. Graff et al., "How well are individuals on intensive insulin
therapy counting carbohydrates?," Diabetes Research and Clinical
assessment systems based on personalized nutrition,
Practice, no. 50, pp. 238-239, 2000.
providing dynamic dietary recommendations by monitoring [9] F. S. Konstantakopoulos et al., "GlucoseML Mobile Application for
the user’s environment and aiming to optimize a reward Automated Dietary Assessment of Mediterranean Food," in 2022 44th
function. Annual International Conference of the IEEE Engineering in Medicine
& Biology Society (EMBC), 2022, pp. 1432-1435: IEEE.
Table VII provides a comparative assessment of existing
[10] J. Ngo et al., "A review of the use of information and communication
review studies including our work with respect to the technologies for dietary assessment," British Journal of Nutrition, vol.
elements of dietary assessment systems that are reviewed and 101, no. S2, pp. S102-S112, 2009.
assessed therein. Considering the level of information [11] F. Jiang et al., "Artificial intelligence in healthcare: past, present and
future," Stroke and vascular neurology, vol. 2, no. 4, pp. 230-243, 2017.
(quality, quantity, and granularity) provided by the existing
[12] M. C. Carter et al., "Adherence to a smartphone application for weight
reviews, herein, we aimed at improving the completeness of loss compared to website and paper diary: pilot randomized controlled
the information by reviewing all the elements of such a system trial," Journal of medical Internet research, vol. 15, no. 4, p. e32, 2013.
(Sections II-V) and unbiasedly capturing all the different [13] G. Ciocca et al., "Food recognition: a new dataset, experiments, and
results," IEEE journal of biomedical and health informatics, vol. 21,
classes of methods/techniques/algorithms that have been
no. 3, pp. 588-598, 2016.
proposed over the last 10 years in the specified research topic. [14] Y. Matsuda et al., "Recognition of multiple-food images by detecting
In this direction, the above discussion of both the strengths candidate regions," in 2012 IEEE International Conference on
and limitations of the existing approaches alongside the Multimedia and Expo, 2012, pp. 25-30: IEEE.
[15] F. Zhu et al., "Multiple hypotheses image segmentation and
identification of solutions to their shortcomings aimed at
classification with application to dietary assessment," IEEE journal of
strengthening future research works. biomedical and health informatics, vol. 19, no. 1, pp. 377-388, 2014.
[16] S. J. Minija and W. S. Emmanuel, "Food recognition using neural
VII. CONCLUSION network classifier and multiple hypotheses image segmentation," The
This review study assessed and contrasted the methods Imaging Science Journal, vol. 68, no. 2, pp. 100-113, 2020.
[17] P. Pouladzadeh et al., "Using graph cut segmentation for food calorie
constituting the intelligence logic of a dietary assessment measurement," in 2014 IEEE International Symposium on Medical
system aiming at providing to the reader the potentialities of Measurements and Applications (MeMeA), 2014, pp. 1-6: IEEE.
the existing approaches. First, we highlighted the need for [18] V. Bruno and C. J. Silva Resende, "A survey on automated food
annotated food image databases including meals from monitoring and dietary management systems," Journal of health &
medical informatics, vol. 8, no. 3, 2017.
multiple cuisines and with adequate size per class in view of [19] F. P. W. Lo et al., "Image-Based Food Classification and Volume
their use as training/test sets in image segmentation or image Estimation for Dietary Assessment: A Review," IEEE journal of
classification tasks. Second, we stressed the potential of biomedical and health informatics, vol. 24, no. 7, pp. 1926-1939, 2020.
instance and semantic image segmentation approaches to [20] M. C. Archundia Herrera and C. B. Chan, "Narrative review of new
methods for assessing food and energy intake," Nutrients, vol. 10, no.
augment the performance of food classification models 8, p. 1064, 2018.
orchestrated under the same pipeline. Third, we verified, as it [21] W. Min et al., "A survey on food computing," J ACM Computing
was expected, the superiority of deep learning architectures in Surveys, vol. 52, no. 5, pp. 1-36, 2019.
classifying the content of food images over conventional [22] W. Wang et al., "A review on vision-based analysis for automatic
dietary assessment," Trends in Food Science, 2022.
machine learning algorithms, and the tendency of increasing [23] L. Bossard et al., "Food-101–mining discriminative components with
the number of hidden layers towards increasing the accuracy random forests," in European conference on computer vision, 2014, pp.
of predictions. Finally, further annotation of food images 446-461: Springer.
(e.g., with respect to their weight) could complement the [24] J. Chen and C.-W. Ngo, "Deep-based ingredient recognition for
cooking recipe retrieval," in Proceedings of the 24th ACM international
current functionality of food volume estimation approaches. conference on Multimedia, 2016, pp. 32-41.
REFERENCES [25] Y. Kawano and K. Yanai, "Automatic expansion of a food image
dataset leveraging existing categories with domain adaptation," in
[1] World Health Organization, Obesity and overweight, Available:
European Conference on Computer Vision, 2014, pp. 3-17: Springer.
https://www.who.int/news-room/fact-sheets/detail/obesity-and-
[26] M. M. Anthimopoulos et al., "A food recognition system for diabetic
overweight, [Accessed: 9 June 2021].
patients based on an optimized bag-of-features model," IEEE journal of
[2] World Health Organization, Diabetes, Available:
biomedical and health informatics, vol. 18, no. 4, pp. 1261-1271, 2014.
https://www.who.int/news-room/fact-sheets/detail/diabetes,
[27] G. Ciocca et al., "Learning CNN-based features for retrieval of food
[Accessed: 16 September 2022].
images," in International Conference on Image Analysis and
Processing, 2017, pp. 426-434: Springer.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Reviews in Biomedical Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/RBME.2023.3283149

[28] M.-Y. Chen et al., "Automatic chinese food identification and quantity [53] Y. Kawano and K. Yanai, "Real-time mobile food recognition system,"
estimation," in SIGGRAPH Asia 2012 Technical Briefs, 2012, pp. 1-4. in Proceedings of the IEEE Conference on Computer Vision and
[29] X. Chen et al., "Chinesefoodnet: A large-scale image dataset for chinese Pattern Recognition Workshops, 2013, pp. 1-7.
food recognition," arXiv preprint arXiv:1705.02743, 2017. [54] S. Inunganbi et al., "Classification of food images through interactive
[30] I. Donadello and M. Dragoni, "Ontology-Driven Food Category image segmentation," in Asian Conference on Intelligent Information
Classification in Images," in International Conference on Image and Database Systems, 2018, pp. 519-528: Springer.
Analysis and Processing, 2019, pp. 607-617: Springer. [55] W. Shimoda and K. Yanai, "CNN-based food image segmentation
[31] F. Konstantakopoulos et al., "3D Reconstruction and Volume without pixel-wise annotation," in International Conference on Image
Estimation of Food using Stereo Vision Techniques," in 2021 IEEE 21st Analysis and Processing, 2015, pp. 449-457: Springer.
International Conference on Bioinformatics and Bioengineering [56] S. Fang et al., "cTADA: The design of a crowdsourcing tool for online
(BIBE), 2021, pp. 1-4: IEEE. food image identification and segmentation," in 2018 IEEE Southwest
[32] P. Pandey et al., "FoodNet: Recognizing foods using ensemble of deep Symposium on Image Analysis and Interpretation (SSIAI), 2018, pp. 25-
networks," IEEE Signal Processing Letters, vol. 24, no. 12, pp. 1758- 28: IEEE.
1762, 2017. [57] H. Hassannejad et al., "A Mobile App for Food Detection: New
[33] S. Hou et al., "Vegfru: A domain-specific dataset for fine-grained visual approach to interactive segmentation," in Proceedings of the
categorization," in Proceedings of the IEEE International Conference FORITAAL Conference, Lecco, Italy, 2015, pp. 19-22.
on Computer Vision, 2017, pp. 541-549. [58] Y. Wang et al., "Efficient superpixel based segmentation for food image
[34] G. Waltner et al., "Personalized dietary self-management using mobile analysis," in 2016 IEEE International Conference on Image Processing
vision-based assistance," in International Conference on Image (ICIP), 2016, pp. 2544-2548: IEEE.
Analysis and Processing, 2017, pp. 385-393: Springer. [59] X. Zheng et al., "Image segmentation based on adaptive K-means
[35] H. Muresan and M. Oltean, "Fruit recognition from images using deep algorithm," EURASIP Journal on Image and Video Processing, vol.
learning," J Acta Univ. Sapientiae, vol. 10, no. 1, pp. 26-42, 2018. 2018, no. 1, p. 68, 2018.
[36] Q. Yu et al., "Food image recognition by personalized classifier," in [60] M. Anthimopoulos et al., "Segmentation and recognition of multi-food
2018 25th IEEE International Conference on Image Processing (ICIP), meal images for carbohydrate counting," in 13th IEEE International
2018, pp. 171-175: IEEE. Conference on BioInformatics and BioEngineering, 2013, pp. 1-4:
[37] P. Kaur et al., "Foodx-251: a dataset for fine-grained food IEEE.
classification," arXiv preprint arXiv:1907.06167, 2019. [61] J. Dehais et al., "Food image segmentation for dietary assessment," in
[38] O. Beijbom et al., "Menu-match: Restaurant-specific food logging from Proceedings of the 2nd International Workshop on Multimedia Assisted
images," in 2015 IEEE Winter Conference on Applications of Computer Dietary Management, 2016, pp. 23-28.
Vision, 2015, pp. 844-851: IEEE. [62] G. Ciocca et al., "Evaluating CNN-based semantic food segmentation
[39] X. Wang et al., "Recipe recognition with large multimodal food across illuminants," in International Workshop on Computational
dataset," in 2015 IEEE International Conference on Multimedia & Color Imaging, 2019, pp. 247-259: Springer.
Expo Workshops (ICMEW), 2015, pp. 1-6: IEEE. [63] P. Poply and J. A. A. Jothi, "Refined image segmentation for calorie
[40] S. Mezgec and B. Koroušić Seljak, "NutriNet: a deep learning food and estimation of multiple-dish food items," in 2021 International
drink image recognition system for dietary assessment," Nutrients, vol. Conference on Computing, Communication, and Intelligent Systems
9, no. 7, p. 657, 2017. (ICCCIS), 2021, pp. 682-687: IEEE.
[41] G. M. Farinella et al., "A benchmark dataset to study the representation [64] M. Bolaños and P. Radeva, "Simultaneous food localization and
of food images," in European Conference on Computer Vision, 2014, recognition," in 2016 23rd International Conference on Pattern
pp. 584-599: Springer. Recognition (ICPR), 2016, pp. 3140-3145: IEEE.
[42] F. Zhou and Y. Lin, "Fine-grained image classification by exploring [65] S. K. Yarlagadda et al., "Saliency-aware class-agnostic food image
bipartite-graph labels," in Proceedings of the IEEE Conference on segmentation," ACM Transactions on Computing for Healthcare, vol.
Computer Vision and Pattern Recognition, 2016, pp. 1124-1133. 2, no. 3, pp. 1-17, 2021.
[43] A. Singla et al., "Food/non-food image classification and food [66] D. Park, J. Lee et al., "Deep learning based food instance segmentation
categorization using pre-trained googlenet model," in Proceedings of using synthetic data," in 2021 18th International Conference on
the 2nd International Workshop on Multimedia Assisted Dietary Ubiquitous Robots (UR), 2021, pp. 499-505: IEEE.
Management, 2016, pp. 3-11. [67] K. J. Pfisterer et al., "Fully-Automatic Semantic Segmentation for Food
[44] G. M. Farinella et al., "Retrieval and classification of food images," Intake Tracking in Long-Term Care Homes," arXiv preprint
Computers in biology and medicine, vol. 77, pp. 23-39, 2016. arXiv:1910.11250, 2019.
[45] E. Aguilar et al., "Regularized uncertainty-based multi-task learning [68] C. Szegedy et al., "Rethinking the inception architecture for computer
model for food analysis," Journal of Visual Communication and Image vision," in Proceedings of the IEEE conference on computer vision and
Representation, vol. 60, pp. 360-370, 2019. pattern recognition, 2016, pp. 2818-2826.
[46] J. Gao et al., "MUSEFood: Multi-Sensor-based food volume estimation [69] E. Aguilar et al., "Grab, pay, and eat: Semantic food detection for smart
on smartphones," in restaurants," IEEE Transactions on Multimedia, vol. 20, no. 12, pp.
SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI, 2019, pp. 899- 3266-3275, 2018.
906: IEEE. [70] S. Aslan et al., "Semantic food segmentation for automatic dietary
[47] T. Ege et al., "A new large-scale food image segmentation dataset and monitoring," in 2018 IEEE 8th International Conference on Consumer
its application to food calorie estimation based on grains of rice," in Electronics-Berlin (ICCE-Berlin), 2018, pp. 1-6: IEEE.
Proceedings of the 5th International Workshop on Multimedia Assisted [71] W. Shimoda and K. Yanai, "Weakly-Supervised Plate And Food
Dietary Management, 2019, pp. 82-87. Region Segmentation," in 2020 IEEE International Conference on
[48] K. Okamoto and K. Yanai, "UEC-FoodPIX complete: a large-scale Multimedia and Expo (ICME), 2020, pp. 1-6: IEEE.
food image segmentation dataset," in International Conference on [72] H.-T. Nguyen and C.-W. Ngo, "Terrace-based food counting and
Pattern Recognition, 2021, pp. 647-659: Springer. segmentation," in Proceedings of the AAAI Conference on Artificial
[49] X. Wu et al., "A large-scale benchmark for food image segmentation," Intelligence, 2021, vol. 35, no. 3, pp. 2364-2372.
in Proceedings of the 29th ACM International Conference on [73] S. Christodoulidis et al., "Food recognition for dietary assessment using
Multimedia, 2021, pp. 506-515. deep convolutional neural networks," in International Conference on
[50] S. Aslan et al., "Benchmarking algorithms for food localization and Image Analysis and Processing, 2015, pp. 458-465: Springer.
semantic segmentation," International Journal of Machine Learning [74] P. Pouladzadeh et al., "Cloud-based SVM for food categorization,"
and Cybernetics, vol. 11, no. 12, pp. 2827-2847, 2020. Multimedia Tools and Applications, vol. 74, no. 14, pp. 5243-5260,
[51] Y. Wang et al., "Mixed dish recognition through multi-label learning," 2015.
in Proceedings of the 11th Workshop on Multimedia for Cooking and [75] Y. He et al., "Analysis of food images: Features and classification," in
Eating Activities, 2019, pp. 1-8. 2014 IEEE International Conference on Image Processing (ICIP),
[52] A. Meyers et al., "Im2Calories: towards an automated mobile vision 2014, pp. 2744-2748: IEEE.
food diary," in Proceedings of the IEEE International Conference on [76] Y. Kawano and K. Yanai, "Foodcam-256: a large-scale real-time
Computer Vision, 2015, pp. 1233-1241. mobile food recognitionsystem employing high-dimensional features

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Reviews in Biomedical Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/RBME.2023.3283149

and compression of classifier weights," in Proceedings of the 22nd [98] N. K. Fukagawa et al., "USDA's FoodData Central: what is it and why
ACM international conference on Multimedia, 2014, pp. 761-762. is it needed today?," The American Journal of Clinical Nutrition, vol.
[77] H. Kagaya et al., "Food detection and recognition using convolutional 115, no. 3, pp. 619-624, 2022.
neural network," in Proceedings of the 22nd ACM international [99] J. Dehais et al., "Two-View 3D Reconstruction for Food Volume
conference on Multimedia, 2014, pp. 1085-1088. Estimation," IEEE Transactions on Multimedia, vol. 19, no. 5, pp.
[78] J. He et al., "Multi-task Image-Based Dietary Assessment for Food 1090-1099, 2017.
Recognition and Portion Size Estimation," in 2020 IEEE Conference on [100] C. Xu et al., "Model-based food volume estimation using 3D pose," in
Multimedia Information Processing and Retrieval (MIPR), 2020, pp. 2013 IEEE International Conference on Image Processing, 2013, pp.
49-54. 2534-2538: IEEE.
[79] N. Martinel et al., "Wide-slice residual networks for food recognition," [101] W. Jia et al., "Accuracy of food portion size estimation from digital
in 2018 IEEE Winter Conference on Applications of Computer Vision pictures acquired by a chest-worn camera," Public health nutrition, vol.
(WACV), 2018, pp. 567-576: IEEE. 17, no. 8, pp. 1671-1681, 2014.
[80] P. Pouladzadeh et al., "Food calorie measurement using deep learning [102] S. Fang et al., "Single-view food portion estimation based on geometric
neural network," in 2016 IEEE International Instrumentation and models," in 2015 IEEE International Symposium on Multimedia (ISM),
Measurement Technology Conference Proceedings, 2016, pp. 1-6: 2015, pp. 385-390: IEEE.
IEEE. [103] K. Okamoto and K. Yanai, "An automatic calorie estimation system of
[81] C. Termritthikun et al., "NU-InNet: Thai food image recognition using food images on a smartphone," in Proceedings of the 2nd International
convolutional neural networks on smartphone," Journal of Workshop on Multimedia Assisted Dietary Management, 2016, pp. 63-
Telecommunication, Electronic and Computer Engineering (JTEC), 70.
vol. 9, no. 2-6, pp. 63-67, 2017. [104] W. Jia et al., "Imaged based estimation of food volume using circular
[82] Y. Kawano and K. Yanai, "Food image recognition with deep referents in dietary assessment," Journal of food engineering, vol. 109,
convolutional features," in Proceedings of the 2014 ACM International no. 1, pp. 76-86, 2012.
Joint Conference on Pervasive and Ubiquitous Computing: Adjunct [105] Y. He et al., "Food image analysis: Segmentation, identification and
Publication, 2014, pp. 589-593. weight estimation," in 2013 IEEE International Conference on
[83] C. Liu et al., "A new deep learning-based food recognition system for Multimedia and Expo (ICME), 2013, pp. 1-6: IEEE.
dietary assessment on an edge computing service infrastructure," IEEE [106] Y. Yue et al., "Measurement of food volume based on single 2-D image
Transactions on Services Computing, vol. 11, no. 2, pp. 249-261, 2017. without conventional camera calibration," in 2012 Annual International
[84] K. Merchant and Y. Pande, "ConvFood: A CNN-Based Food Conference of the IEEE Engineering in Medicine and Biology Society,
Recognition Mobile Application for Obese and Diabetic Patients," in 2012, pp. 2166-2169: IEEE.
Emerging Research in Computing, Information, Communication and [107] Y. Yang et al., "Image-based food portion size estimation using a
Applications: Springer, 2019, pp. 493-502. smartphone without a fiducial marker," Public health nutrition, vol. 22,
[85] F. S. Konstantakopoulos et al., "Mediterranean Food Image no. 7, pp. 1180-1192, 2019.
Recognition Using Deep Convolutional Networks," in 2021 43rd [108] P. Pouladzadeh et al., "Measuring calorie and nutrition from food
Annual International Conference of the IEEE Engineering in Medicine image," IEEE Transactions on Instrumentation and Measurement, vol.
& Biology Society (EMBC), 2021, pp. 1740-1743: IEEE. 63, no. 8, pp. 1947-1956, 2014.
[86] G. VijayaKumari et al., "Food Classification using Transfer Learning [109] Y. Ando et al., "Depthcaloriecam: A mobile application for volume-
Technique," Global Transitions Proceedings, 2022. based foodcalorie estimation using depth cameras," in Proceedings of
[87] K. Yanai and Y. Kawano, "Food image recognition using deep the 5th International Workshop on Multimedia Assisted Dietary
convolutional network with pre-training and fine-tuning," in 2015 IEEE Management, 2019, pp. 76-81.
International Conference on Multimedia & Expo Workshops (ICMEW), [110] F. P.-W. Lo et al., "Point2Volume: A Vision-based Dietary Assessment
2015, pp. 1-6: IEEE. Approach using View Synthesis," IEEE Transactions on Industrial
[88] Y. Cui et al., "Large scale fine-grained categorization and domain- Informatics, vol. 16, no. 1, pp. 577-586, 2019.
specific transfer learning," in Proceedings of the IEEE conference on [111] S. Fang et al., "An end-to-end image-based automatic food energy
computer vision and pattern recognition, 2018, pp. 4109-4118. estimation technique based on learned energy distribution images:
[89] C. Liu et al., "Deepfood: Deep learning-based food image recognition Protocol and methodology," Nutrients, vol. 11, no. 4, p. 877, 2019.
for computer-aided dietary assessment," in International Conference on [112] P. Ferdinand Christ et al., "Diabetes60-Inferring Bread Units From
Smart Homes and Health Telematics, 2016, pp. 37-48: Springer. Food Images Using Fully Convolutional Neural Networks," in
[90] H. Hassannejad et al., "Food image recognition using very deep Proceedings of the IEEE International Conference on Computer Vision
convolutional networks," in Proceedings of the 2nd International Workshops, 2017, pp. 1526-1535.
Workshop on Multimedia Assisted Dietary Management, 2016, pp. 41- [113] S. Fang et al., "Single-view food portion estimation: Learning image-
49. to-energy mappings using generative adversarial networks," in 2018
[91] M. Tan and Q. Le, "Efficientnet: Rethinking model scaling for 25th IEEE International Conference on Image Processing (ICIP), 2018,
convolutional neural networks," in International Conference on pp. 251-255: IEEE.
Machine Learning, 2019, pp. 6105-6114: PMLR.
[92] B. Arslan et al., "Fine-grained food classification methods on the UEC
food-100 database," vol. 3, no. 2, pp. 238-243, 2021.
[93] H. Zhao et al., "JDNet: A Joint-Learning Distilled Network for Mobile
Visual Food Recognition," IEEE Journal of Selected Topics in Signal
Processing, vol. 14, no. 4, pp. 665-675, 2020.
[94] W. Min et al., "Ingredient-Guided Cascaded Multi-Attention Network
for Food Recognition," in Proceedings of the 27th ACM International
Conference on Multimedia, 2019, pp. 1331-1339.
[95] A.-S. Metwalli et al., "Food Image Recognition Based on Densely
Connected Convolutional Neural Networks," in 2020 International
Conference on Artificial Intelligence in Information and
Communication (ICAIIC), 2020, pp. 027-032: IEEE.
[96] M. H. Rahman et al., "Food volume estimation in a mobile phone based
dietary assessment system," in 2012 Eighth International Conference
on Signal Image Technology and Internet Based Systems, 2012, pp.
988-995: IEEE.
[97] A. Gao et al., "Food volume estimation for quantifying dietary intake
with a wearable camera," in 2018 IEEE 15th International Conference
on Wearable and Implantable Body Sensor Networks (BSN), 2018, pp.
110-113: IEEE.

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

You might also like