food classification
food classification
of the 2017 IEEE Region 10 Conference (TENCON), Malaysia, November 5-8, 2017
Abstract—The process of identifying food items from an image research on food classification is focused on developing
is quite an interesting field with various applications. Since real-time applications which capture images and train the
food monitoring plays a leading role in health-related problems, machine learning models instantly. It helps to take prevention
it is becoming more essential in our day-to-day lives. In this
paper, an approach has been presented to classify images of to avoids diseases such as diabetes, blood pressure and so on.
food using convolutional neural networks. Unlike the traditional Some of the methods currently in use for dietary assessment
artificial neural networks, convolutional neural networks have the involve self-reporting and manually recorded instruments. The
capability of estimating the score function directly from image issue with such methods of assessment is that the evaluation of
pixels. A 2D convolution layer has been utilised which creates calorie consumption by a participant is prone to bias [6], i.e.
a convolution kernel that is convolved with the layer input to
produce a tensor of outputs. There are multiple such layers, and underestimating and under reporting of food intake. In order to
the outputs are concatenated at parts to form the final tensor of increase the accuracy and reduce the bias, enhancements to the
outputs. We also use the Max-Pooling function for the data, and current methods are required. One such potential solution is a
the features extracted from this function are used to train the mobile cloud computing system, which makes use of devices
network. An accuracy of 86.97% for the classes of the FOOD-101 such as smartphones to capture dietary and calorie information.
dataset is recognised using the proposed implementation.
Index Terms—Convolution filters; Convolution layer; The next step is to automatically analyse the dietary and
Convolutional neural networks; Food-101 dataset; Food calorie information employing the computing capacity of the
classification; Image recognition; MAX pooling. cloud for an objective assessment. However, users still have
to enter the information manually. Over the last few years,
I. I NTRODUCTION plenty of research and development efforts have been made
In the current age, people are more conscious about their in the field of visual-based dietary and calorie information
food and diet to avoid either upcoming or existing diseases. analysis. However, the efficient extraction of information from
Since people are dependent on smart technologies, provision food images remains a challenging issue.
of an application to automatically monitor the individuals diet, In this paper, an effort has been made to classify the
helps in many aspects. It increases the awareness of people in images of food for further diet monitoring applications using
their food habits and diet [1]–[5]. Over the last two decades, convolutional neural networks (CNNs). Since the CNNs are
research has been focused on automatically recognising the capable of handling a large amount of data and can estimate
food and their nutritional information from images captured the features automatically, they have been utilised for the task
using computer vision and machine learning techniques. In of food classification. The standard Food-101 dataset has been
order to properly assess dietary intake, accurate estimation of selected as the working database for this approach.
calorie value of food is of paramount importance. A majority The rest of the paper is organised as follows. Section II
of the people are overeating and not being active enough. details the related works in the field of food classification with
Given how busy and stressed people are today, it’s effortless their merits and demerits. Section III explains the proposed
to forget to keep track of the food that they eat. This only methodology including the database selected, and provides a
increases the importance of proper classification of food. description of the CNN. Section IV discusses the results and
Recently, smart applications for mobile devices such as the observations. Finally, section V concludes the work with
Android phones and iPhone, have increased tremendously. some future directions.
They are capable of balancing the food habits of users and
also warn them about unhealthy food. Due to the advances in II. R ELATED W ORK
various technologies used in smartphones, their computational The task of the food detection system is first initiated with
power has also increased. They are capable of processing four fast-food classes namely fries, apple pies, hamburgers
real-time multi-media information with their computational and chicken burgers [7]. The images were segmented initially
power, whereas traditional mobiles are incapable and hence, to form the feature vector with size, shape, texture, color
used to send the images to high processing servers that (normalised RGB), and other context-based features. With
increase the cost of communication and delay. Since the this motivation, a minimised feature vector with the Gabor
present smartphones can handle the high-quality images too, filter responses (texture), pixel intensity, and color components
is used to categorise the 19 classes of foods. However, the an accuracy of 62% is reported [8]. The three-dimensional
performance is good for food replicas, and a less efficient properties of food shapes are used to reconstruct and further
performance is observed with real images [8]. The size of extracted the feature values [20].
images and their variations in capturing could be the reason Deep Convolutional Neural Networks have been used for
for the performance degradation. Based on this, scale invariant food recognition recently [21], which have used the UEC-100
feature transform (SIFT) features have been extracted and and UEC-256 datasets for testing, along with ImageNet
experimented on homemade foods, fast-food, and fruits [9]. and ILSVRC for training, which use a combination of
With this, the better performance is found with less number baseline feature extraction and neural network fine-tuning.
of classes, although the images of each class are more. Another approach [22] uses Convolutional Neural Networks
The term bag of features (BoF) which is derived from the along with a Global Average Pooling layer, which generates
bag of words (BoW) is the emerging trend in recent days. Food Activation Maps (heat maps of food probability). Fine
It is highly influenced to process the natural language. It tuning is done for FAM generation, which includes adding a
is designed to catch frequently appearing words by ignoring convolutional layer with stride, and setting a softmax layer.
the order in which they appear [10], [11]. Similarly, images Additionally, via thresholding, bounding boxes are generated.
contain some common visual patterns that are useful in The present work aims to combine some of the above
recognising the category of food. This process reduces methodologies together, that creates a food classification
the complexity issues raised by the direct image matching system, that predicts the class of food the image is in, and also
techniques. Based on this, some works are found using the gives the calorie count based on the portion size visible. This
BoF approach. concept has a high scope in the health sector, as people want to
The existing literature has utilised the variety of classifiers keep track of what and how much they eat and simplifying the
available. The notable ones and better performance giving process into the form of this implementation increases usage
classifiers among them are artificial neural networks (ANNs), and awareness of health-related factors. Since CNNs are less
support vector machines (SVMs), Naive Bayes, and Adaboost. focused in the literature, they have been utilized due to their
Further, the pairwise classification framework has been inherent capabilities in computing features automatically.
proposed to enhance the recognition rate of food classification
[12]. Texton histograms have been utilised to resemble BoF
models. It is found that they can carry less information and III. P ROPOSED M ETHODOLOGY
failed to deal with high-resolution images. Moreover, the
performance is not good for varying light conditions. Hence,
a checker-board which is colored is captured to utilise the Food Image Train data Train the
Image Database Preprocessing CNN classifier
system for varying light conditions. However, the performance
accuracy reduces from 95% to 80% when there is an increase Test data
2802
Proc. of the 2017 IEEE Region 10 Conference (TENCON), Malaysia, November 5-8, 2017
Fig. 2. The role of convolutional neural networks in the proposed food classification system.
Fig. 3. The configuration of the CNN, which retrains a Google Inception V3 model.
2803
Proc. of the 2017 IEEE Region 10 Conference (TENCON), Malaysia, November 5-8, 2017
1) Image preprocessing parameters: The following are the 1) Classification of Food: The first step in tracking
parameters that are considered for image preprocessing. calorie intake using images is to identify the food being
• Rotation range = 45: Images are randomly rotated by consumed. The difficulty arises when one considers the various
a degree of 45. This ensures that images taken at any assortments of cuisines and dishes that exist in the real world.
angle can be predicted correctly, and that diversity of the Given the size and variety of the food items in the dataset, this
patterns obtained (feature maps) is maintained. has proven to be quite a formidable task. The use of neural
networks seems to be better to deal with the issue of scaling
• Width shift range = 0.2: Images are shifted horizontally
primarily because of their ability to learn patterns that are not
by this fraction. This allows for ”incomplete” or ”half”
linearly separable, along with the concepts of dealing with
images to be predicted, and patterns obtained will differ.
other factors such as noise in the images.
2) Calorific Value Estimation: The remaining task after
• Height shift range = 0.2: Images are shifted vertically the process of classification is mapping the food names to
by this fraction of the total width. The purpose is same a calorific value. This can be achieved relatively easily by
as mentioned in horizontal shift. scraping the web for the average calorie value of food items
• Horizontal flip = True: Images are flipped horizontally. per unit weight. The average calorie values are considered for
Random flipping of images helps in identifying different the different classes food, per 100g of serving.
patterns and for ”upside down” images to be predicted 3) CNN specifications: The most popular readily available
accurately as well. dataset for image classification is the ImageNet database,
• Fill mode = reflect: Points outside the boundaries of the which has been used to train the Google Inception CNN. It also
images are filled according to this mode. has multiple existing classification categories. To generalize
the system, another 101 categories are added to this model
• Train datagen.config[’random crop size’]: Assigns the
by training it on the Food-101 dataset. The specifications
crop size for the images that are fed to the network, in
of the model are as follows. The input sensor has a size
this case 299x299x3. All images are forced to be cropped
of 299x299x3, with a Max pooling downscale of 2 in each
to this resolution which ensures the compatibility and
spatial dimension with a dropout rate of 0.4 and the softmax
linearity of input to the neural network.
activation function. The optimization function used for this
task is stochastic gradient descent, which basically finds the
D. Image Processing to CNN
maxima and minima through optimal number of iterations.
The present work utilized the Google InceptionV3 model 4) Image Augmentation: In this step, one-hot encode is
[13] which is pre-trained on ImageNet. Prior to that, all the used to get a set of binary features from each label. This
images are reshaped to 299x299x3 size. The global average is better than one feature that can take on any value from
pooling function is applied on the dataset which takes the n− classes. An image augmentation pipeline has been used
average of all features of an image. The dimensionality of that comes with cropping tools and the inception image
output space is defined by using the dense() function. The preprocessor. Using a multiprocessing tool allows for GPU
dropout fraction rate on input units with 0.5 is considered usage to be maximized.
to avoid overfitting issues. Further, to determine the actual IV. R ESULTS AND A NALYSIS
class from n− number of classes softmax activation function
is defined. It identifies the class based on the maximum This section discusses the results, and the observations
probability obtained at output for that class and ignores the found while experimenting starting from the performance
rest. measurement techniques for food classification.
A. Evaluation of Models
E. Neural Network Training
Now having multiple saved models, it is possible to evaluate
The simple CNN used for the proposed work is depicted them and to load the models with lowest loss/highest accuracy.
in Fig. 2. The Stochastic Gradient Descent with a quickly Further, a confusion matrix has been considered based on the
decreasing learning schedule has been used to achieve better outputs obtain by CNNs. A confusion matrix will plot each
performance. The model is trained for 32 epochs and has three class label, and how many times it was correctly labeled vs. the
callbacks defined which record the progress into a log file. A other times it was incorrectly labeled as a different class. To
learning rate scheduler is defined which takes the epoch index evaluate the test set, multiple crops have been used instead
as input and returns a new learning rate as output. Model of single value. This increases accuracy as compared to a
checkpoints are made via the check pointer callback. These are single crop evaluation scheme. The output is the top − N
saved in the form of .hdf5 files. The best score is considered predictions for each crop, which in turn is used to process
to save only best learned models. the top − 5 predictions. Crops are created for every item in
the test set. These are used to get the predictions. Hence,
F. Usage of Neural Networks and Web Scraping predictions for each image has been obtained from this stage.
This subsections describes the use of neural networks and Mapping technique is used to map the test item index to the
web scraping for the task of food classification. top predictions.
2804
Proc. of the 2017 IEEE Region 10 Conference (TENCON), Malaysia, November 5-8, 2017
B. Obtained Results contrast, it takes more computational time to train the network.
The confusion matrix discussed earlier will show the However, the performance matters a lot, and once the system
correctly versus incorrectly labeled classes. From results, it is properly trained, the system can produce the results in less
is found that the CNNs are more appropriate for image time. The images are properly preprocessed and all kinds of
classification. They provide features such as filtering and images are tested with CNN. From this, it is concluded that
Max-pooling that gives better recognition rate for image CNNs are more suitable for classifying the images when the
classification than traditional neural networks. Convolving the number of classes are more.
image allows feature extraction, regardless of its orientation The task of image classification can be extended using
and position in the image. prominent features that can categorize food images. Since
the CNNs are consuming high computational time, the
TABLE I feature-based approach is highly appreciable. A multi-level
A COMPARISON OF TOP ACCURACIES FOR DIFFERENT MODELS AND classification approach (hierarchical approach) is suitable to
DATASETS .
avoid mis-classifications when the number of classes is more.
Sl.No. Model Dataset Accuracy (in %) Moreover, a dataset containing all food categories is also not
1. SVM Food-101 50.76 available in the literature yet.
2. Neural Networks Food-101 56.40
3. RFDC-based Approach Food-101 56.76
4. Resnet18 Food-101 67.23 R EFERENCES
5. CNN UEC-FOOD100 78.77
6. CNN (ILSVRC) Food-101 79.20 [1] W. Wu and J. Yang, “Fast food recognition from videos of eating for
7. CNN (Food-101) EgocentricFood 90.90 calorie estimation,” in Multimedia and Expo, 2009. ICME 2009. IEEE
8. Proposed Approach Food-101 86.97 International Conference on. IEEE, 2009, pp. 1210–1213.
[2] N. Yao, R. J. Sclabassi, Q. Liu, J. Yang, J. D. Fernstrom, M. H.
Fernstrom, and M. Sun, “A video processing approach to the study of
The results obtained for various classifiers used in literature obesity,” in Multimedia and Expo, 2007 IEEE International Conference
on the standard Food-101 dataset are shown in Table I. The on. IEEE, 2007, pp. 1727–1730.
present system is developed using CNNs on Food-101 dataset [3] S. Yang, M. Chen, D. Pomerleau, and R. Sukthankar, “Food recognition
using statistics of pairwise local features,” in Computer Vision and
which gives 86.97% accuracy at top − 1 prediction. The same Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010,
is giving better performance at top − 5 prediction, and the pp. 2249–2256.
accuracy is 97.42%. However, the top − 1 prediction alone [4] M. Bosch, F. Zhu, N. Khanna, C. J. Boushey, and E. J. Delp, “Combining
global and local features for food identification in dietary assessment,”
is considered to compare with the other models. The same in Image Processing (ICIP), 2011 18th IEEE International Conference
Food-101 dataset is considered to compare with the state-of-art on. IEEE, 2011, pp. 1789–1792.
systems. [5] M. M. Anthimopoulos, L. Gianola, L. Scarnato, P. Diem, and S. G.
Mougiakakou, “A food recognition system for diabetic patients based
Initially, the SVMs and neural networks are considered on an optimized bag-of-features model,” IEEE journal of biomedical
as they are designed to capture the patterns which are and health informatics, vol. 18, no. 4, pp. 1261–1271, 2014.
highly non-linear. Random forest decision classifier (RFDC) [6] P. Pouladzadeh, S. Shirmohammadi, and R. Al-Maghrabi, “Measuring
is used since it is the most acceptable classifier for non-linear calorie and nutrition from food image,” IEEE Transactions on
Instrumentation and Measurement, vol. 63, no. 8, pp. 1947–1956, 2014.
patterns in present trend. As assumed, the RFDCs had [7] G. Shroff, A. Smailagic, and D. P. Siewiorek, “Wearable context-aware
given better accuracy when compared to other two classifiers food recognition for calorie monitoring,” in Wearable Computers, 2008.
mentioned above. However, the features based on SIFT ISWC 2008. 12th IEEE International Symposium on. IEEE, 2008, pp.
119–120.
invariant technique, BoF models, and other useful features [8] F. Zhu, M. Bosch, I. Woo, S. Kim, C. J. Boushey, D. S. Ebert, and
have been computed for above three classifiers. Later, recent E. J. Delp, “The use of mobile devices in aiding dietary assessment and
work using the Resnet18 model has been used to test with evaluation,” IEEE journal of selected topics in signal processing, vol. 4,
no. 4, pp. 756–766, 2010.
the Food-101 dataset. The Resnet18 model has given better [9] F. Kong and J. Tan, “Dietcam: Automatic dietary assessment with mobile
accuracy when 10 classes are used from the Cifar10 dataset camera phones,” Pervasive and Mobile Computing, vol. 8, no. 1, pp.
which is around 86%. The same model offers less performance 147–163, 2012.
[10] L. Fei-Fei and P. Perona, “A bayesian hierarchical model for learning
with the Food-101 dataset. The performance is around 67.23%. natural scene categories,” in Computer Vision and Pattern Recognition,
Since the CNNs are capable of estimating the features 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 2.
automatically and highly capable of mapping the non-linear IEEE, 2005, pp. 524–531.
[11] T. Joachims, “Text categorization with support vector machines:
relations, a better accuracy of 86.97% is obtained with CNNs. Learning with many relevant features,” Machine learning: ECML-98,
However, experimentation is yet to be done on realistic images pp. 137–142, 1998.
and all kinds of food. [12] M. Puri, Z. Zhu, Q. Yu, A. Divakaran, and H. Sawhney, “Recognition
and volume estimation of food intake using a mobile device,” in
V. C ONCLUSION AND F UTURE WORK Applications of Computer Vision (WACV), 2009 Workshop on. IEEE,
2009, pp. 1–8.
The performance of the system is high, and is considered [13] M. Chen, K. Dhingra, W. Wu, L. Yang, R. Sukthankar, and J. Yang,
acceptable from a usage point of view. However, the CNNs “Pfid: Pittsburgh fast-food image dataset,” in Image Processing (ICIP),
need high-performance computing machines in order to 2009 16th IEEE International Conference on. IEEE, 2009, pp. 289–292.
[14] T. Joutou and K. Yanai, “A food image recognition system with
experiment on the huge multi-media datasets. The CNN multiple kernel learning,” in Image Processing (ICIP), 2009 16th IEEE
is capable of train highly non-linear data, and for that in International Conference on. IEEE, 2009, pp. 285–288.
2805
Proc. of the 2017 IEEE Region 10 Conference (TENCON), Malaysia, November 5-8, 2017
2806